<div class="well" style="margin:1em 2em">
<p>This Notebook reproduces and expands on a demo from “Distant Reading of Direct Speech in Epic: An Illustrated Workflow,” a talk I gave at the FIEC / CA annual meeting in London, July 8, 2019.</p>
</div>


# Heroes and their moms

Let's say we're young scholars interested in Telemachus' speech to Penelope.
 - How often does he speak to her?
 - What kind of language does he use?
 - How does the narrator refer to these speeches?

## Preliminaries

In [None]:
# this lets me change the api while the notebook is open
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import re
import ipywidgets as widgets
from IPython.display import display
from collections import Counter
from matplotlib import pyplot
%matplotlib inline

### The DICES API

See example 1 for notes.

In [None]:
from dicesapi import DicesAPI
api = DicesAPI(
    dices_api = 'https://fierce-ravine-99183.herokuapp.com/api',
    cts_api = 'http://cts.perseids.org/api/cts/',
)

### CLTK

In [None]:
from cltk.tokenize.word import WordTokenizer
tokenizer = {
    'greek': WordTokenizer('greek'),
    'latin': WordTokenizer('latin'),
}

from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer
lemmatizer = {
    'greek': BackoffGreekLemmatizer(),
    'latin': BackoffLatinLemmatizer(),    
}

# regular expressions to tidy up perseus texts for ctlk
replacements = {
    'greek': [
        (r"·", ','),           # FIXME: raised dot? 
        (chr(700), chr(8217)), # two different apostrophes that look alike
    ],
    'latin': [
        
    ],
}

# compile the regexs
for lang in ['greek', 'latin']:
    replacements[lang] = [(re.compile(pat), repl) for pat, repl in replacements[lang]]
    

# generic tokenize-lemmatize function
def lemmatize(text, lang):
    '''return a set of (token,lemmata) pairs for a string'''
    
    for pat, repl in replacements[lang]:
        text = pat.sub(repl, text)
    
    tokens = tokenizer[lang].tokenize(text)
    lemmata = lemmatizer[lang].lemmatize(tokens)
    
    return lemmata

### WikiData

In [None]:
from qwikidata.linked_data_interface import get_entity_dict_from_api
from qwikidata.entity import WikidataItem, WikidataProperty

##  Part 1

Let's start by building a lexicon for all the words Telemachus speaks to Penelope.

### Identify and download the speeches

Using the hand-rolled DICES API code, we can search speeches using keywords. For now, JSON results from the API are paged, so if your search has a lot of results, you may have to wait for several pages to download. I've added a progress bar widget because I get impatient.

Note that I can specify both the speaker and the addressee.

In [None]:
speeches = api.getSpeeches(spkr_name='Telemachus', addr_name='Penelope')

What did we get?

In [None]:
for s in speeches:
    print(s)

### Retrieve the passages from a remote library

In [None]:
passages = []
for s in speeches:
    cts_passage = s.getCTS()
    text = cts_passage.text
    passages.append(text)
    
    print(f'{s.author.name} {s.work.title} {s.l_range}')
    print(text)
    print()

### Use CLTK to parse the text

In [None]:
lems = Counter()
for p in passages:
    lang = s.getLang()
    lemmatized = lemmatize(p.lower(), lang)
    
    these_lems = [lem for tok, lem in lemmatized]
    lems.update(these_lems)

In [None]:
results = pd.DataFrame(lems.most_common(), columns=['lemma', 'count'])
results

## Part 2

Now let's think more broadly. How typical is this kind of speech? We can use external linked data to find other examples of mother-son conversations in the corpus.

### Some custom code to query WikiData

This lets us ask whether a given addressee belongs to the set of people having a certain relationship to a given speaker.

In [None]:
def checkWD(c):
    '''make sure character has wikidata id'''
    if c.char is not None:
        if c.char.wd is not None:
            if len(c.char.wd.strip()) > 0:
                return c.char.wd.strip()

def checkWDRelation(s, a, relation, cache={}):
    if (s.id, a.id) in cache:
        return cache[(s.id, a.id)]

    res = False

    if not hasattr(s, 'wd_ent'):
        s.wd_ent = WikidataItem(get_entity_dict_from_api(s.wd))

    claim_group = s.wd_ent.get_truthy_claim_group(relation)
    for claim in claim_group:
        if claim.mainsnak.datavalue is None:
            continue
        if claim.mainsnak.datavalue.value['id'] == a.wd:
            res = True
    
    cache[(s.id, a.id)] = res
    return res

For example, the relation "mother of" has the WikiData ID `'P25'`. Here's how we ask if a given addressee is the mother of a given speaker:

In [None]:
speaker = api.getCharacters(name='Telemachus')[0]
addressee = api.getCharacters(name='Penelope')[0]

checkWDRelation(speaker, addressee, 'P25')

### Using WikiData to filter the speeches

The DICES dataset includes WikiData ids for most of the characters (not all). The DICES API doesn't let us query WikiData itself, though. For now, the easiest thing for now is just to download all the speeches and character IDs, and then cross reference them against WikiData using its own API.

In [None]:
# download all the speeches: takes a minute
speeches = api.getSpeeches(progress=True)

In [None]:
cache_mothers = {}
cache_fathers = {}

In [None]:
df = []

# create a progress bar
pbar = widgets.IntProgress(
    value = 0,
    min = 0,
    max = len(speeches),
    bar_style='info',
    orientation='horizontal'
)
pbar_label = widgets.Label(value = f'{pbar.value}/{len(speeches)}')
display(widgets.HBox([pbar, pbar_label]))

for s in speeches:
    if s.spkr is not None and s.addr is not None:
        for spkr in s.spkr:
            spkr_wd = checkWD(spkr)
            if spkr_wd is not None:

                for addr in s.addr:
                    addr_wd = checkWD(addr)
                    if addr_wd is not None:
                        df.append((
                            s.id,
                            spkr.char.name, spkr_wd, 
                            addr.char.name, addr_wd,
                            checkWDRelation(spkr.char, addr.char, 'P25', cache=cache_mothers),
                            checkWDRelation(spkr.char, addr.char, 'P22', cache=cache_fathers),
                            ))
    pbar.value += 1
    pbar_label.value = f'{pbar.value}/{len(speeches)}'

df = pd.DataFrame(df, columns=['id', 'spkr', 'sp_wd', 'addr', 'ad_wd', 'mother', 'father'])

In [None]:
df

In [None]:
df[df['father']]

In [None]:
mother = [checkWD(s.)]

In [None]:
speeches[0].data

In [None]:
spkr = speeches[0].data['spkr']

In [None]:
tel = dicesapi.CharacterInstance(spkr[0])

In [None]:
insts = [dicesapi.CharacterInstance(c, api=speeches[0].api) for c in speeches[0].data['spkr']]

In [None]:
insts[0].char.name

In [None]:
from sys import getsizeof

In [None]:
getsizeof(speeches)

In [None]:
getsizeof(speeches)

In [None]:
len(speeches)

In [None]:
tel = speeches[1164].spkr[0].char

In [None]:
pen = speeches[1162].spkr[0].char

In [None]:
tel.wd_ent = WikidataItem(get_entity_dict_from_api(tel.wd))

In [None]:
tel.wd_ent

In [None]:
claim_group = tel.wd_ent.get_truthy_claim_group('P25')

In [None]:
for claim in claim_group:
    print(claim.mainsnak.datavalue.value['id'])

In [None]:
replies = api.getSpeeches(part=2)

In [None]:
q = api.getSpeeches(part=2)

replies = []
openings = []

for reply in q:
    s = api.getSpeeches(cluster_id=reply.cluster.id)
    if len(s) > 1:
        openings.append(s[0])
        replies.append(s[1])
        print(s[0].part, s[1].part)
        

In [None]:
for r in replies:
    print(r, r.part, r.cluster.id)


In [None]:
opening_texts = []
for s in openings:
    try:
        text = s.getCTS().text
    except:
        print(f'{s} failed')
        text = None
    opening_texts.append(text)

In [None]:
reply_texts = []
for s in replies:
    try:
        text = s.getCTS().text
    except:
        print(f'{s} failed')
        text = None
    reply_texts.append(text)

In [None]:
opening_texts