# Divert: Making word search puzzles from Catullus

## Devise
Not all pedagogy needs to be exercises, drills, quizzes, tests, etc. We are allowed to have fun as well. We can find no shortage of Latin crossword puzzles—the [*Guardian*](https://www.theguardian.com/crosswords/crossword-blog/2015/oct/19/crossword-blog-return-latin) published one in 1930 and restarted (and at some point restopped?) in 2015. Latin word games more generally can be found here and there online, as for example, [*Hebdomada Aenigmatum*](https://www.latincrosswords.com/). We even have [Latin Wordle](https://wordle.latindictionary.io/) and [Latin Spelling Bee](https://www.examenapium.com/) to keep ourselves entertained as we learn. Speaking from my own experience as a young child, word puzzle magazines from *Dell* and *Pennypress* were a source of endless engagement with language well before I knew exactly what every word meant or could even fill out 10% of a crossword puzzle grid.

So, it is in the *Exploratory Philology* spirit to leverage computational methods toward any activity that brings us closer to that kind of "endless engagement". The subtitle of the book is "learning about Ancient Greek and Latin" and even in a simple word search puzzle we are inevitably learning *about* the lanaguage: we are learning about letter patterns, letter frequencies, character cluster frequencies, maybe even something, as we will see below, about lexical categories and key words.

In this notebook, we will build word search puzzles from the works of Catullus and at the end of this activity you will have a letter grid with hidden words that you can distribute to your students for a quick and fun classroom activity. But as we will see, the process of constructing the puzzle is itself a learning experience. In a certain respect, we can think of it as a re-learning experience, since we will be drawing on Python skills from the previous three notebooks and directing them toward a new end.

## Plan

What do we need to do to make a word search puzzle from a given poem of Catullus? There are really two separate tasks that we need to consider here: 1. the technical task of creating a word search puzzle from any text and 2. the more philological task of selecting the best words from any given poem to use for our wordlist. The first task we will largely leave to an existing Python module—note that this in itself is a useful lesson in computer programming: there is no for wheel-reinventing especially when reusing and adapting existing code allows us to focus more on what we are really interested in, namely learning about Latin texts through code. The second task gives us an opportunity to introduce some more text analysis fundamentals, specifically the idea of keyness. As opposed to our earlier Describe task which was genuinely about frequency counts, here we turn our attention not simply to which words appear most often but rather which ones play an important role in the text we are looking at. To foreshadow our work in this experiment, if we are making a word search puzzle for Catullus 2, how do we make sure that *passer* is in our wordlist?

Our two-part pseudocode...

**Pseudocode for making word search puzzles for Catullus**

- Task 1
    - Import an off-the-shelf solution for making word search puzzles
- Task 2
    - Load our library of Latin texts, keeping only the poems of Catullus
    - Define a "keyness" measure for our poems, here TF-IDF
    - Measure keyness for specific poem and select some number of words for our wordlist
    - Make a word search puzzle from our wordlist

## Code

In [None]:
# Preliminary imports
from natsort import natsorted
from pprint import pprint
import random
from time import sleep
from latintools import preprocess

In [None]:
# PC 1: Load our library of Latin texts, keeping only Catullus

from cltkreaders.lat import LatinTesseraeCorpusReader

T = LatinTesseraeCorpusReader()

catullus = [fileid for fileid in T.fileids() if 'catullus' in fileid][0] # There is only one Catullus file
print(catullus)


In [None]:
import pandas as pd

docrows = next(T.doc_rows(catullus))

df = pd.DataFrame.from_dict(docrows, orient='index', columns=['line'])
df

In [None]:
# Format the dataframe
df = pd.DataFrame.from_dict(docrows, orient='index', columns=['line'])
df['author'] = 'catullus'
df['poem'] = df.apply(lambda row: row.name.split('.')[1].replace('>','').strip(), axis=1)
df['line_no'] = df.apply(lambda row: row.name.split('.')[2].replace('>','').strip(), axis=1)
df['line'] = df.apply(lambda row: row.line.replace('\n','').strip(), axis=1)
df = df[['author', 'poem', 'line_no', 'line']]
df = df.reset_index(drop=True)
df

In [None]:
df['poem'].unique()

In [None]:
df[df['poem'] == '2']

In [None]:
poem = "\n".join(df[df['poem'] == '2']['line'].tolist())
print(poem)

In [None]:
def make_poem(df, poem_no):
    poem = "\n".join(df[df['poem'] == str(poem_no)]['line'].tolist())
    return poem

catullus_poems = {}

for poem_no in df['poem'].unique():
    poem = make_poem(df, poem_no)
    catullus_poems[poem_no] = poem

In [None]:
pprint(list(catullus_poems.items())[:2])

In [None]:
catullus_poems['1']

In [None]:
catullus_preprocess = {no: preprocess(poem, remove_lines=True) for no, poem in catullus_poems.items()}

In [None]:
pprint(catullus_preprocess['1'])

In [None]:
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In [None]:
catullus_preprocess_vals = [preprocess(poem) for poem in catullus_poems.values()]

In [None]:
bloblist = [tb(poem) for poem in catullus_preprocess_vals]

In [None]:
scores = [{word: tfidf(word, blob, bloblist) for word in blob.words} for blob in bloblist]

In [None]:
test = sorted(list(scores[2].items()), key=lambda x: x[1], reverse=True)[:10]

from tabulate import tabulate

print(tabulate(test, headers=['Word', 'Score']))

In [None]:
vocab_poems = [list(blob.words) for blob in bloblist]
vocab = set([word for poem in vocab_poems for word in poem])


In [None]:
vocab_df = {}

for poem in vocab_poems:
    for word in set(poem):
        if word in vocab_df:
            vocab_df[word] += 1
        else:
            vocab_df[word] = 1

vocab_df = sorted(vocab_df.items(), key=lambda x: x[1], reverse=True)
vocab_df[:10]

In [None]:
from collections import Counter
Counter([word for poem in vocab_poems for word in poem]).most_common(10)

In [None]:
test = sorted(list(scores[2].items()), key=lambda x: x[1], reverse=True)
pprint(test[:10])

In [None]:
wordlist_base = [word for word, score in test]
wordlist_base = [word for word in wordlist_base if len(word) > 4]
wordlist_top = wordlist_base[:10]
wordlist_random = random.sample(wordlist_base[10:40], 5)
wordlist = wordlist_top + wordlist_random
wordlist = [[word, None] for word in wordlist] # Needs to be in this format; can't remember why

In [None]:
from wordsearch import *

In [None]:
puzzle = Crossword(20, 20, '-', 5000, wordlist)
puzzle.compute_crossword(5, 5)

In [None]:
print(puzzle.word_bank())

In [None]:
print(puzzle.word_find())

In [None]:
print(puzzle.solution())

## Explore

### Next steps

- ***Change collection(s)***: What authors or texts other than Catullus would you (or your students!) want to make word search puzzles for? Remember that you will need to use both a set of "figure" texts from which to make the puzzle *and* some collection of "ground" texts from which to measure keyness. For example, if you wanted to make a word search puzzle for Virgil's *Aeneid*, you might use the *Aeneid* as your figure text and then the entire collection of Latin epic as your ground text.
- ***Change puzzle***: Word search puzzle can be fun, but try to think of other activities, puzzles, games, etc. that work with lists of words and especially from sets of "key" words. We can even move away from the world of "play" and consider how keyness could be useful to us as Latin teachers for, say, scaffolding vocabulary in a pre-reading activity.

### For the future

- ***Work better at scale*** We have gone about deriving TF-IDF in a manual fashion here—and for good reason as we want to make sure that we grasp the underlying concept of keyness before ramping things up. But there are much more efficient ways of calculating such measures for large amounts of text. For example, the `TfidfVectorizer` in the `scikit-learn` package can handle much of the overhead with much more flexibility in how we handle minimum counts, stopwords, etc. Here is a sample [tutorial](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a). I have given an example of a count vectorizer and a tf-idf vectorizer dataframe for Catullus in an appendix below.

## Further Reading
- Luca, D. 2018. Hebdomada Aenigmatum. Les premiers mots croisés en Latin et Grec. Paris: Dictionnaire.

## Appendix: Vectorized Catullus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

poem_nos, texts = zip(*catullus_preprocess.items())

In [None]:
CV = CountVectorizer()
CV_matrix = CV.fit_transform(texts)
vocab = CV.get_feature_names_out()
df_counts = pd.DataFrame(CV_matrix.toarray(), columns=vocab, index=poem_nos)
df_counts

In [None]:
df_counts.iloc[0].sort_values(ascending=False).where(lambda x: x > 0).dropna().index

In [None]:
tfidfvectorizer = TfidfVectorizer()
tfidf_wm = tfidfvectorizer.fit_transform(texts)
tfidf_tokens = tfidfvectorizer.get_feature_names_out()
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),index = poem_nos, columns = tfidf_tokens)
df_tfidfvect

In [None]:
df_tfidfvect[['libellum']].head()

In [None]:
df_tfidfvect[['passer']].head()

In [None]:
df_tfidfvect.iloc[2].sort_values(ascending=False)[:10]