# Lemmatizing with CLTK

Let's investigate to what degree speeches that come second in a conversation re-use the language of the speech to which they reply.

## Preliminaries

### A couple of useful packages

In [None]:
import random
import re
import ipywidgets as widgets
from IPython.display import display

### The DICES API

In [None]:
from dicesapi import DicesAPI
from dicesapi.jupyter import NotebookPBar
api = DicesAPI(progress_class=NotebookPBar)

### CLTK setup

If you haven't used CLTK before, you may need to download models and texts before some functions will work. Normally, you only need to run this once. If you're working on Binder, though, you start from a clean system every time.

In [None]:
from cltk.corpus.utils.importer import CorpusImporter

print('Downloading models:')

for lang in ['latin', 'greek']:
    print(' - ' + lang)
    downloader = CorpusImporter(lang)
    downloader.import_corpus(f'{lang}_models_cltk')

### Tokenizers and Lemmatizers

CLTK uses language-specific tokenizers and lemmatizers. I like to have one convenience function that I can call on every speech, regardless of language. That means I have to set up language-specific tokenizers and lemmatizers first, and also cook up some kludgey regular expression substitutions to normalize orthography.

In [None]:
from cltk.tokenize.word import WordTokenizer
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer

# language-specific tokenizers
tokenizer = {
    'greek': WordTokenizer('greek'),
    'latin': WordTokenizer('latin'),
}

# language-specific lemmatizers
lemmatizer = {
    'greek': BackoffGreekLemmatizer(),
    'latin': BackoffLatinLemmatizer(),    
}

# regular expressions to tidy up perseus texts for ctlk
replacements = {
    'greek': [
        (r'·', ','),           # FIXME: raised dot? 
        (chr(700), chr(8217)), # two different apostrophes that look alike
    ],
    'latin': [
        
    ],
}

# compile the regexes
for lang in ['greek', 'latin']:
    replacements[lang] = [(re.compile(pat), repl) for pat, repl in replacements[lang]]

# wrap everything in a generic tokenize-lemmatize function
def lemmatize(text, lang):
    '''return a set of (token,lemmata) pairs for a string'''
    
    for pat, repl in replacements[lang]:
        text = pat.sub(repl, text)
    
    tokens = tokenizer[lang].tokenize(text)
    lemmata = lemmatizer[lang].lemmatize(tokens)
    
    return lemmata

## Process the speeches

### Query the DICES API

For the moment, at least, it's generally easier to download an inclusive set of speeches from the remote server all at once, then filter them locally using the client library. Here, we download all speeches in Homer. 

<div class="alert alert-warning" style="margin: 1em 2em">
    <p>We could have used <code>author_name='Homer'</code> as the sole search param, but this way we can showcase concatenation of results with the <code>+</code> operator.</p>
</div>

In [None]:
speeches = api.getSpeeches(work_title='Iliad', progress=True) + \
            api.getSpeeches(work_title='Odyssey', progress=True)
speeches.sort()

### Download the text of the speeches

Before we can do any NLP, we have to get the text of the speeches from the remote library. In this loop, we download the CTS passage for each speech in turn, appending the plain text of the passage to the respective speech object as a new attribute.

In [None]:
# create a progress bar
pbar = NotebookPBar(start=0, max=len(speeches))

# download text, add to speech object as new attribute
for s in speeches:
    cts_passage = s.getCTS()
    s.text = cts_passage.text
    pbar.update()

### Parse the speech text with CLTK

Now we can run CLTK's tokenizers and lemmatizers, using the wrapper function defined above. The lemmatizer returns two lists, one of tokens and one of lemmata in their dictionary form. I'm saving each of these lists to the original speech object as a new attribute, just to make sure I don't lose track of which lemmata go with which speech.

In [None]:
# create a progress bar
pbar = NotebookPBar(start=0, max=len(speeches))

# iterate over speeches
for s in speeches:
    lang = s.work.lang
    toks = []
    lems = []
    
    # lemmatizer delivers two lists, one of tokens and one of lemmata
    for t, l in lemmatize(s.text.lower(), lang):
        toks.append(t)
        lems.append(l)
        
    # append toks, lems, to speech object as new attributes
    s.tokens = toks
    s.lemmata = lems
        
    pbar.update()

## Looking for shared lemmata

Now that we've got the raw data, let's try a simple experiment: **To what extent do replies reuse language from the speech they're replying to?**

For this test:

1. We'll consider as replies those speeches that come second in their conversation, i.e. `part==2`.
2. We'll measure language reuse as lemmata shared between part 1 and part 2 of a speech cluster, as a fraction of the lemmata in part 1. We'll count only distinct *types*, that is, duplicate lemmata in a single speech won't be counted.
3. To estimate how much overlap we might expect by chance, we'll also compare reply speeches to some randomly selected initial speeches from unrelated conversations.

### Replies

First, let's gather all speeches whose `part` attribute is `2`.

In [None]:
# filter by part
replies = speeches.filterParts([2])

# how many results?
print(len(replies))

### Incipits

Now, let's gather all speeches whose `part` is `1`.

In [None]:
# filter by part
incipits = speeches.filterParts([1])

# how many results
print(len(incipits))

There are a lot more initial speeches than replies because so many speech clusters only have a single speech. Let's further limit the incipits under consideration to those from clusters represented in `replies`:

In [None]:
# filter by clusters present in replies
incipits = incipits.filterClusters(replies.getClusters())

# how many results?
print(len(incipits))

I want to organize these two groups in two different ways. In one treatment, each reply is paired with its incipit. In the control, each reply is paired with a random incipit.

### Incipit-reply matched pairs

Let's try pairing them off, first, and see what happens:

In [None]:
# start with empty list
reply_pairs = []

# check each reply
for reply in replies:

    # get incipits with same cluster id
    results = incipits.filterClusters([reply.cluster])
    
    # should be only one
    if len(results) == 0:
        print(f'found no incipit for cluster {reply.cluster.id}')
    elif len(results) > 1:
        print(f'found {len(results)} incipits for cluster {reply.cluster.id}')
    else:
        reply_pairs.append((results[0], reply))
        

# how many pairs?
print(len(reply_pairs))

### Randomized control set

Now let's shuffle the incipits and replies to create some control pairs.

In [None]:
# start with empty list
random_pairs = []

# choose incipts, replies at random
for rep in range(1000):
    i = random.randint(0, len(incipits)-1)
    j = random.randint(0, len(replies)-1)
    if incipits[i].cluster.id != replies[j].cluster.id:
        random_pairs.append((incipits[i], replies[j]))

# how many pairs?
print(len(random_pairs))

### Metric for shared lemmata

For a given pair of speeches, one incipit and one reply, we're looking for the number of unique, shared lemmata divided by the number of unique lemmata in the incipit alone.

First, we'll write a custom function to return the shared lemmata (ommitting punctuation, which CLTK assigns to the lemma "punc"):

In [None]:
def shared(speech_a, speech_b, inc_punc=False):
    '''Return shared lemmata between two speeches'''
    
    shared = set([lem for lem in speech_a.lemmata if lem in speech_b.lemmata])
    
    if not inc_punc:
        if 'punc' in shared:
            shared = set([lem for lem in shared if lem != 'punc'])
    
    return shared

### Comparison of incipit-reply pairs to randomized control

Now we just run through out two groups of pairs, and calculate the metric for each group.

In [None]:
# start with empty list
nshared_reply = []

# calculate for matched pairs
for incipit, reply in reply_pairs:
    nshared_reply.append(len(shared(incipit, reply))/len(set(incipit.lemmata)))

# how many values?
print(len(nshared_reply))

In [None]:
# start with empty list
nshared_random = []

# calculate for matched pairs
for incipit, reply in random_pairs:
    nshared_random.append(len(shared(incipit, reply))/len(set(incipit.lemmata)))

# how many values?
print(len(nshared_random))

### Visualize results

Are the two groups similar? Let's use a simply box and whisker plot to get an overview of the distributions.

In [None]:
from matplotlib import pyplot
%matplotlib inline

In [None]:
pyplot.boxplot([nshared_reply, nshared_random], notch=True, labels=['matched pairs', 'random pairs'])
pyplot.show()

### Digging a little deeper...

Well, the results are suggestive, but how significant are they. One way forward might be a statistical analysis of the distributions of this metric in the two groups.

In [None]:
pyplot.hist(nshared_reply)
pyplot.show()

In [None]:
pyplot.hist(nshared_random)
pyplot.show()

In [None]:
import scipy.stats

In [None]:
scipy.stats.ttest_ind(nshared_reply, nshared_random)