# Describe: Creating 'cloze' exercises with Cicero

## Devise



## Plan

As always, let's plan out our work before we start writing Python code. We will use the following steps to create our cloze exercises:

**Pseudocode for Cicerorian 'cloze' exercises**

- Load our library of Latin texts, keeping only those by Cicero
- Create a list of sentences from which we can draw our exercises, keeping them at a certain length (~10-25 words)
- Pick a sentence at random
- Pick a word at random to mask
- Create a set of multiple-choice answers, i.e. three random words in addition to the removed word
- Ask user for input to test whether the removed word can be correctly identified

## Code

In [None]:
# Preliminary imports
from natsort import natsorted
from pprint import pprint
from time import sleep

As always, let's set up our corpus reader and pull out the texts we want to describe.

In [None]:
# PC 1: Load our library of Latin texts, keeping only those by Cicero

from cltkreaders.lat import LatinTesseraeCorpusReader

T = LatinTesseraeCorpusReader()

cicero = natsorted([fileid for fileid in T.fileids() if 'de_finibus' in fileid])
pprint(cicero[:10])

In [None]:
# PC 2a: Create a list of sentences from which we can draw our exercises

sents = list(T.sents(fileids=cicero))

In [None]:
# Show example sentences

for i, sent in enumerate(sents[:10], 1):
    print(f'{i}: {sent}')

In [None]:
print(len(sents))

In [None]:
sents = [sent.as_doc() for sent in sents if len(sent) > 10 and len(sent) < 25]
for i, sent in enumerate(sents[:10], 1):
    print(f'{i}: {sent}')

This process of loading these sentences into memory takes 15 seconds on my machine. To load all of the sentences from the Cicero files would take even longer. When we find outselves in a situation like this it can often be a huge timesaver to write these kinds of computation-intensive results to disk for quick retrieval later. Here is an example of "pickling" the sentences we just loaded so they can be loaded from disk as opposed to reprocessed. 

In [None]:
import pickle

sents = [sent for sent in sents] # Convert to strings
pickle.dump(sents, open('../data/cicero-sents.pickle', 'wb'))

In [None]:
sents = pickle.load(open('../data/cicero-sents.pickle', 'rb'))
sents[0]

In [None]:
# PC 3: Pick a sentence at random

import random
random.seed(42)

exercise = random.choice(sents)
exercise

In [None]:
# PC 4: Pick a word at random to mask

In [None]:
for i, token in enumerate(exercise):
    print(f'{i}: {token}')

In [None]:
for token in exercise:
    print(f'{token.i}: {token.text}')

In [None]:
for token in exercise:
    print(f'{token.i}: {token.is_alpha}')

In [None]:
remove_options = [token.i for token in exercise if token.is_alpha]

In [None]:
random.seed(1)
remove_choice = random.choice(remove_options)
remove_choice

In [None]:
cloze = ' '.join([token.text if token.i != remove_choice else '_____' for token in exercise])

In [None]:
cloze

In [None]:
answer = exercise[remove_choice].text
answer

In [None]:
# PC 5: Create a set of multiple-choice answers, i.e. three random words in addition to the removed word

In [None]:
vocab = set([word.text for sent in sents for word in sent])

In [None]:
random.seed(42)
wrong_answers = random.sample(list(vocab - {answer}), 3)
wrong_answers

In [None]:
# PC 6: Ask user for input to test whether the removed word can be correctly identified

quiz = {cloze: [answer] + wrong_answers}
pprint(quiz)

In [None]:
for question, alternatives in quiz.items():
    correct_answer = alternatives[0]
    for alternative in sorted(alternatives):
        print(f"  - {alternative}")
    print()
    
    answer = input(f"{question}? ")
    if answer == correct_answer:
        print("Correct!")
    else:
        print(f"Incorrect! The answer is {correct_answer}")

In [None]:
def create_cloze_qa_bank(sents, vocab, n=10):
    sents = random.sample(sents, n)
    cloze_qa_bank = {}
    for sent in sents:
        remove_options = [token.i for token in sent if token.is_alpha]
        remove_choice = random.choice(remove_options)
        cloze = ' '.join([token.text if token.i != remove_choice else '_____' for token in sent])
        answer = sent[remove_choice].text
        wrong_answers = random.sample(list(vocab - {answer}), 3)
        cloze_qa_bank[cloze] = [answer] + wrong_answers
    return cloze_qa_bank

In [None]:
# quiz = create_cloze_qa_bank(sents, vocab, n=10)

# for question, alternatives in quiz.items():
#     correct_answer = alternatives[0]
#     for alternative in sorted(alternatives):
#         print(f"  - {alternative}")
#     print()
    
#     answer = input(f"{question}? ")
#     if answer == correct_answer:
#         print("Correct!")
#     else:
#         print(f"Incorrect! The answer is {correct_answer}")
#     print()

## Explore

### Next steps

- ***Change author***: It is becoming a pattern! But that is because this is where exploration lies for us, at least in the early stages. Experiment with sentences from authors other than Cicero, or the works of Cicero that we have not yet looked at. 
- ***Change objective***: Try inserting a random word into a sentence and seeing if the user can identify the errant addition. Try scrambling the letters of one or more words (all?) in a sentence. Get the part of speech of masked words and ask the user for madlib style insertions. This is a Deform experiment—feel free to manipulate the text in any way you see fit. Claassen 1991 recommends an exercise where "the program omits at random intervals the last two letters of any word... [and] the students must complete the blanks." How would you implement this?

### For the future

- ***Consider the multiple choice***: Right now we are inserting random words from Cicero's vocabulary into the multiple choice. There are more principled ways of going about this process though. It may already have been clear from previous examples that choosing words randomly for the vocbulary produces some pretty unlikely candidates for filling-in-the-blank. Consider how we would address this? One idea would be to only return words with the same part of speech. Another idea—this one cribbed from Duolingo's language courses—would be to build a list of words with similar but not exactly the same spelling, e.g. *manet* for *monet*; you can read up on the idea of "edit distance" as an entry point into this approach. But the real payoff is going to be in using vector semantics or word embedding models. Word embeddings are a numerical representation of lexical items and specifically dense vector representation. And since they are numerical representations these word vectors can be compared for similarity. Look at Notebok 8a for a quick tour of how vectors can be used for such a task.

## Further Reading
- Claassen, J.-M. 1991. “The Design of Computer Software for Learning Latin.” *Per Linguam* 7(1): 3–23.