<div class="well" style="margin:1em 2em">
<p>This Notebook demonstrates results reported in the DSH article, <a href="https://doi.org/10.1093/llc/fqac006">"Towards a linked open data resource for direct speech acts in Greek and Latin epic"</a></p>
</div>


# Heroes and their moms

Let's say we're young scholars interested in Telemachus' speech to Penelope.
 - How often does he speak to her?
 - What kind of language does he use?
 - How does the narrator refer to these speeches?
 
We'll start by showing how the DICES database and Python library can be used to retrieve and manipulate the speeches in question. Then we'll expand our perspective to show how DICES enables research on a "distant reading" scale, taking in all heroes and their mothers. Finally, we'll check the accuracy of the automated methods by comparing against a benchmark of hand-curated mother-child speech data.

## Preliminaries

In [1]:
import pandas as pd
from collections import Counter

### Install the client library

If you don't have the DICES client library, you can install it with **pip**:
```
pip install git+https://github.com/cwf2/dices-client.git
```

### The DICES API

When you instantiate the API, you can optionally provide endpoints for the DICES database and for a CTS server hosting the texts.

- The default endpoint for DICES is our Uni-Rostock server.

- The default for texts is the [Perseids CTS server](https://cts.perseids.org/).

Finally, just for Jupyter, I'm passing an optional progress bar generator, and sending log messages to a local file.

In [2]:
from dicesapi import DicesAPI
from dicesapi.jupyter import NotebookPBar

api = DicesAPI(progress_class=NotebookPBar, logfile='dices.log')

### CLTK

#### Set up tokenizers, lemmatizers

I like to have one convenience function that I can call on every speech, regardless of language. That means I have to set up language-specific pipelines first.

In [3]:
from cltk.nlp import NLP

cltk_nlp = {
    'greek': NLP('grc', suppress_banner=True),
    'latin': NLP('lat', suppress_banner=True),    
}

In [4]:
# trim the pipelines to just normalizing and tokenization for speed
cltk_nlp['greek'].pipeline.processes = cltk_nlp['greek'].pipeline.processes[:2]
cltk_nlp['latin'].pipeline.processes = cltk_nlp['latin'].pipeline.processes[:2]

### WikiData

To figure out how characters are related, which isn't in the DICES metadata, I'm going to use [WikiData](https://www.wikidata.org), via the *qwikidata* package.

In [5]:
from qwikidata.linked_data_interface import get_entity_dict_from_api
from qwikidata.entity import WikidataItem, WikidataProperty

##  Part 1

Let's start by building a lexicon for all the words Telemachus speaks to Penelope.

### Identify and download the speeches

Using the hand-rolled DICES API code, we can search speeches using keywords. For now, JSON results from the API are paged, so if your search has a lot of results, you may have to wait for several pages to download. I've added a progress bar widget because I get impatient.

Note that I can specify both the speaker and the addressee.

In [6]:
speeches = api.getSpeeches(spkr_name='Telemachus', addr_name='Penelope')

What did we get?

In [7]:
for s in speeches:
    print(s)

<Speech 713: Odyssey 1.346-1.359>
<Speech 1103: Odyssey 17.46-17.56>
<Speech 1107: Odyssey 17.108-17.149>
<Speech 1170: Odyssey 18.227-18.242>
<Speech 1270: Odyssey 21.344-21.353>
<Speech 1323: Odyssey 23.97-23.103>


### Retrieve the passages from a remote library

We have the metadata for each speech; now we need the text. The DICES library uses [MyCapytain](https://mycapytain.readthedocs.io) under the hood to retrieve the passages from a remote CTS server.

In [8]:
for s in speeches:
    cts_passage = s.getCTS()
    s.text = cts_passage.text
    
    print(f'{s.author.name} {s.work.title} {s.l_range}')
    print(s.text)
    print()

Homer Odyssey 1.346-1.359
μῆτερ ἐμή, τί τʼ ἄρα φθονέεις ἐρίηρον ἀοιδὸν τέρπειν ὅππῃ οἱ νόος ὄρνυται; οὔ νύ τʼ ἀοιδοὶ αἴτιοι, ἀλλά ποθι Ζεὺς αἴτιος, ὅς τε δίδωσιν ἀνδράσιν ἀλφηστῇσιν, ὅπως ἐθέλῃσιν, ἑκάστῳ. τούτῳ δʼ οὐ νέμεσις Δαναῶν κακὸν οἶτον ἀείδειν· τὴν γὰρ ἀοιδὴν μᾶλλον ἐπικλείουσʼ ἄνθρωποι, ἥ τις ἀκουόντεσσι νεωτάτη ἀμφιπέληται. σοὶ δʼ ἐπιτολμάτω κραδίη καὶ θυμὸς ἀκούειν· οὐ γὰρ Ὀδυσσεὺς οἶος ἀπώλεσε νόστιμον ἦμαρ ἐν Τροίῃ, πολλοὶ δὲ καὶ ἄλλοι φῶτες ὄλοντο. ἀλλʼ εἰς οἶκον ἰοῦσα τὰ σʼ αὐτῆς ἔργα κόμιζε, ἱστόν τʼ ἠλακάτην τε, καὶ ἀμφιπόλοισι κέλευε ἔργον ἐποίχεσθαι· μῦθος δʼ ἄνδρεσσι μελήσει πᾶσι, μάλιστα δʼ ἐμοί· τοῦ γὰρ κράτος ἔστʼ ἐνὶ οἴκῳ.

Homer Odyssey 17.46-17.56
μῆτερ ἐμή, μή μοι γόον ὄρνυθι μηδέ μοι ἦτορ ἐν στήθεσσιν ὄρινε φυγόντι περ αἰπὺν ὄλεθρον· ἀλλʼ ὑδρηναμένη, καθαρὰ χροῒ εἵμαθʼ ἑλοῦσα, εἰς ὑπερῷʼ ἀναβᾶσα σὺν ἀμφιπόλοισι γυναιξὶν εὔχεο πᾶσι θεοῖσι τεληέσσας ἑκατόμβας ῥέξειν, αἴ κέ ποθι Ζεὺς ἄντιτα ἔργα τελέσσῃ. αὐτὰρ ἐγὼν ἀγορὴν ἐσελεύσομαι, ὄφρα καλέσσω ξεῖνον, ὅτις

### Use CLTK to parse the text

Running the CLTK NLP pipeline on the text of a speech will return a specialized cltk `Doc` object containing the bundled results of all processes in the pipeline. First, we’ll run the appropriate language-specific pipeline on each speech and save the resulting object as a new attribute of the speech.

In [9]:
pbar = NotebookPBar(max=len(speeches))

for s in speeches:
    s.cltk_doc = cltk_nlp[s.lang](s.text)
    pbar.update()

HBox(children=(IntProgress(value=0, bar_style='info', max=6), Label(value='0/6')))

Now let’s go through again and count the lemmata. We’ll create a lemma counter to hold the tallies.

In [10]:
lems = Counter()

for s in speeches:
    lems.update([w.lemma for w in s.cltk_doc])

Convert the counter to a Pandas data frame for tidier presentation.

In [11]:
results = pd.DataFrame(lems.most_common(20), columns=['lemma', 'count'])
results

Unnamed: 0,lemma,count
0,ἐγώ,23
1,καί,20
2,ὁ,19
3,δʼ,17
4,οὐ,14
5,ὅς,13
6,τε,9
7,σύ,9
8,ἐν,9
9,εἰμί,9


## Part 2

Now let's think more broadly. How typical is this kind of speech? We can use external linked data to find other examples of mother-son conversations in the corpus.

### Some custom code to query WikiData

This lets us ask whether a given addressee belongs to the set of people having a certain relationship to a given speaker. It takes a while to download the WikiData entities, and I had to run this a number of times, so I cached WD data in the respective character objects once it's downloaded.

In [12]:
def checkWD(c):
    '''make sure character has wikidata id'''
    if c.char is not None:
        if c.char.wd is not None:
            if len(c.char.wd.strip()) > 0:
                return c.char.wd.strip()

def checkWDRelation(s, a, relation, cache=None):
    if cache is None:
        cache = {}
    else:
        if (s.id, a.id) in cache:
            return cache[(s.id, a.id)]

    res = False

    if not hasattr(s, 'wd_ent'):
        s.wd_ent = WikidataItem(get_entity_dict_from_api(s.wd))

    claim_group = s.wd_ent.get_truthy_claim_group(relation)

    for claim in claim_group:
        if claim.mainsnak.datavalue is None:
            continue
        if claim.mainsnak.datavalue.value['id'] == a.wd:
            res = True
    
    cache[(s.id, a.id)] = res
    return res

For example, the relation "mother of" has the WikiData ID `'P25'`. Here's how we ask if a given addressee is the mother of a given speaker:

In [13]:
speaker = api.getCharacters(name='Telemachus')[0]
addressee = api.getCharacters(name='Penelope')[0]

print(f'Is {addressee.name} the mother of {speaker.name}?')
print(checkWDRelation(speaker, addressee, 'P25'))

Is Penelope the mother of Telemachus?
True


I also added a separate cache just for the boolean result of checkWDRelation, to save a little more time.

In [14]:
cache_mothers = {}

### Using WikiData to filter the speeches

The DICES dataset includes WikiData ids for most of the characters (not all). The DICES API doesn't let us query WikiData itself, though. For now, the easiest thing is just to download all the speeches and character IDs, and then cross reference them against WikiData using its own API.

In [15]:
# download all the speeches: takes a minute
speeches = (
    api.getSpeeches(work_title='Iliad', progress=True) +
    api.getSpeeches(work_title='Odyssey', progress=True) +    
    api.getSpeeches(author_name='Apollonius', progress=True) +
    api.getSpeeches(author_name='Virgil', progress=True))

speeches.sort()

HBox(children=(IntProgress(value=0, bar_style='info', max=698), Label(value='0/698')))

HBox(children=(IntProgress(value=0, bar_style='info', max=673), Label(value='0/673')))

HBox(children=(IntProgress(value=0, bar_style='info', max=143), Label(value='0/143')))

HBox(children=(IntProgress(value=0, bar_style='info', max=341), Label(value='0/341')))

**Check each speaker-addressee pair against WikiData**

What we actually do here is download the WikiData entity for each speaker, if we don't already have it cached. Then we ask the WD entity for its mom(s), and check the WD ID of the addressee against the results.

In [16]:
# start with an empty table
rows = []

# create a progress bar
pbar = NotebookPBar(start=0, max=len(speeches))

# iterate over all the speeches, checking each speaker-addressee combination
for s in speeches:
    if s.spkr is not None and s.addr is not None:
        for spkr in s.spkr:
            spkr_wd = checkWD(spkr)
            if spkr_wd is not None:

                for addr in s.addr:
                    addr_wd = checkWD(addr)
                    if addr_wd is not None:
                        rows.append((
                            s.id,
                            s.work.title,
                            s.l_fi,
                            s.l_la,
                            spkr.char.name, spkr_wd, 
                            addr.char.name, addr_wd,
                            checkWDRelation(spkr.char, addr.char, 'P25', cache=cache_mothers),
                            checkWDRelation(addr.char, spkr.char, 'P25', cache=cache_mothers)
                            ))
    pbar.update()

# finally, organize the table as a pandas data frame
df = pd.DataFrame(rows, columns=['id', 'work', 'l_first', 'l_last', 'spkr', 'sp_wd', 'addr', 'ad_wd', 'sp_is_mom', 'ad_is_mom'])

HBox(children=(IntProgress(value=0, bar_style='info', max=1855), Label(value='0/1855')))

🤔 Let's take a look at the results. Here is the complete set of speeches, with the additional attribute `sp_is_mom` if the speaker is the addressee's mother, and `ad_is_mom` if the addressee is the speaker's mother.

As a quick sanity check, the first two speeches in the Argonautica, which were at the top of the list when I ran this, are between Jason and his mother, Alcimede.

In [17]:
df[df['work']=='Argonautica']

Unnamed: 0,id,work,l_first,l_last,spkr,sp_wd,addr,ad_wd,sp_is_mom,ad_is_mom
0,1387,Argonautica,1.278,1.291,Alcimede,Q2718542,Jason,Q176758,False,True
1,1388,Argonautica,1.295,1.305,Jason,Q176758,Alcimede,Q2718542,True,False
2,1392,Argonautica,1.411,1.424,Jason,Q176758,Apollo,Q37340,False,False
3,1394,Argonautica,1.463,1.471,Idas,Q1136130,Jason,Q176758,False,False
4,1395,Argonautica,1.476,1.484,Idmon,Q748144,Idas,Q1136130,False,False
...,...,...,...,...,...,...,...,...,...,...
94,1510,Argonautica,4.1073,4.1095,Arete,Q3622209,Alcinous,Q496595,False,False
95,1511,Argonautica,4.1098,4.1109,Alcinous,Q496595,Arete,Q3622209,False,False
96,1522,Argonautica,4.1564,4.1570,Euphemus,Q749123,Triton,Q148030,False,False
97,1524,Argonautica,4.1597,4.1600,Jason,Q176758,Triton,Q148030,False,False


Thanks to pandas, we can filter the data frame on the new boolean columns to show only speeches between mother and child.

In [18]:
hits = df.loc[df['sp_is_mom'] | df['ad_is_mom'],
             ['work', 'l_first', 'l_last', 'spkr', 'addr']]
hits

Unnamed: 0,work,l_first,l_last,spkr,addr
0,Argonautica,1.278,1.291,Alcimede,Jason
1,Argonautica,1.295,1.305,Jason,Alcimede
49,Argonautica,3.129,3.144,Aphrodite,Eros
50,Argonautica,3.151,3.153,Aphrodite,Eros
51,Argonautica,3.26,3.267,Chalciope,Argus (son of Phrixus)
124,Iliad,1.352,1.356,Achilles,Thetis
125,Iliad,1.362,1.363,Thetis,Achilles
126,Iliad,1.365,1.412,Achilles,Thetis
127,Iliad,1.414,1.427,Thetis,Achilles
137,Iliad,1.586,1.594,Hephaestus,Hera


Pandas also comes in handy if I wanted to export this data to Excel:

In [19]:
hits.to_csv('example.csv')

### Validation

Let's see how well the automated approach worked. We'll load up a hand-corrected list of mother-child speeches and compare.

In [20]:
bench = pd.read_csv('data/moms-bench.csv', dtype=str)
bench

Unnamed: 0,work,l_first,l_last,spkr,addr
0,Iliad,1.352,356,Achilles,Thetis
1,Iliad,1.362,363,Thetis,Achilles
2,Iliad,1.365,412,Achilles,Thetis
3,Iliad,1.414,427,Thetis,Achilles
4,Iliad,1.586,594,Hephaestus,Hera
...,...,...,...,...,...
56,Aeneid,6.194,197,Aeneas,two doves and Venus
57,Aeneid,8.612,614,Venus,Aeneas
58,Aeneid,9.83,92,Cybele,Jupiter
59,Aeneid,9.94,103,Jupiter,Cybele


Let's look at the union of `hits` and `bench` to see how we did:

In [21]:
results = hits.merge(bench, on=['work', 'l_first'], how='outer', 
                        suffixes=['_h', '_b'], indicator=True)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(results[['work', 'l_first', 'spkr_h', 'addr_h', 'spkr_b', 'addr_b', '_merge']])

Unnamed: 0,work,l_first,spkr_h,addr_h,spkr_b,addr_b,_merge
0,Argonautica,1.278,Alcimede,Jason,Alcimede,Jason,both
1,Argonautica,1.295,Jason,Alcimede,Jason,Alcimede,both
2,Argonautica,3.129,Aphrodite,Eros,Aphrodite,Eros,both
3,Argonautica,3.151,Aphrodite,Eros,Aphrodite,Eros,both
4,Argonautica,3.26,Chalciope,Argus (son of Phrixus),Chalciope,Argus,both
5,Iliad,1.352,Achilles,Thetis,Achilles,Thetis,both
6,Iliad,1.362,Thetis,Achilles,Thetis,Achilles,both
7,Iliad,1.365,Achilles,Thetis,Achilles,Thetis,both
8,Iliad,1.414,Thetis,Achilles,Thetis,Achilles,both
9,Iliad,1.586,Hephaestus,Hera,Hephaestus,Hera,both


#### Precision and Recall

In [22]:
true_pos = sum(results['_merge'] == 'both')

p = true_pos / hits.shape[0]
r = true_pos / bench.shape[0]

print(f'Precision: {p:.2f}')
print(f'Recall:    {r:.2f}')

Precision: 1.00
Recall:    0.92


### Discussion

#### The good news
We got almost all of the benchmark set, with exceptions to be discussed below, and no false positives.

#### The bad news
Let's look a little more closely at the speeches we missed:

In [23]:
missed = results[results['_merge'] == 'right_only'][
                ['work', 'l_first', 'spkr_h', 'addr_h', 'spkr_b', 'addr_b']]
missed

Unnamed: 0,work,l_first,spkr_h,addr_h,spkr_b,addr_b
56,Iliad,15.104,,,Hera,gods
57,Iliad,15.115,,,Ares,gods
58,Aeneid,9.83,,,Cybele,Jupiter
59,Aeneid,9.94,,,Jupiter,Cybele
60,Aeneid,9.481,,,Euryalus' mother,Euryalus


At a glance, I'd say these fall into three groups:

 1. A conversation in the Iliad between Hera and a group of gods, some of whom were here children
 2. A conversation in the Aeneid between Jupiter and "Cybele," i.e., Rhea.
 3. A conversation in the Aeneid between Euryalus and his anonymous mother.

#### Digging a little deeper

First, let's confirm that all these speeches are in the database results.

In [24]:
missed.merge(df, how='left', on=['work', 'l_first'])[[
    'work', 'l_first',                               # keys: work and locus
    'id', 'spkr', 'addr', 'sp_is_mom', 'ad_is_mom',  # cols from df
    'spkr_b', 'addr_b'                               # cols from bench
    
]]

Unnamed: 0,work,l_first,id,spkr,addr,sp_is_mom,ad_is_mom,spkr_b,addr_b
0,Iliad,15.104,,,,,,Hera,gods
1,Iliad,15.115,,,,,,Ares,gods
2,Aeneid,9.83,1737.0,Cybele,Zeus,False,False,Cybele,Jupiter
3,Aeneid,9.94,1738.0,Zeus,Cybele,False,False,Jupiter,Cybele
4,Aeneid,9.481,,,,,,Euryalus' mother,Euryalus


The speech between Euyalus and his mom is missing from the database. That's because as of this writing we don't have a systematic way of including anonymous characters like her--folks described only by a family relation or an occupation. Because the character doesn't fit our current data model, she gets omitted, and this speech fails to be added to the database.

The conversation between Hera and the gods is there in the database, but the speaker-addressee pairs are not registering as mother-child relationships. In this case, it's because "gods" isn't being parsed as including all the individual gods, but rather a corporate entity that doesn't have "mother" or "child" as properties. This highlights another issue that needs to be resolved in our data model.

Finally, the conversation between Jupiter and his mom is also in the database, and each of these characters is matched with a WikiData entity, but we're not getting the right answer about their kinship relation because WikiData has distinct entities for the Greek goddess Rhea and the Phrygian goddess Cybele. We can fix this by pointing the character's WikiData ID to the Greek goddess instead (as we did for the Roman deities), but maybe we should think about the larger problem of poetic ambiguity/metonymy/syncretism.

### Takeaways

 - WikiData gave us a lot for free -- all of the individual mother-child relationships were in there when we knew where to look.
 
 - There is still some important work to be done refining our underlying data model.
 
 - If we want to rely on linked open data for high-stakes work, we need resources that are sensitive to the details we care about. We hope that MANTO, because it's specific to Classical myth and hand-curated by domain experts, can help us with problems like when to treat Cybele and Rhea as independent entities and when to consider them identical.