# Annotating TLG results with DICES speech data

In this example, we have a file (`data/tlg_κερτομεωLemma.csv`) that contains exported results from a TLG lemma search for κερτομέω. Each line represents one result, giving author, work, and locus, but in TLG's idiosyncratic format. Our goal here is to check the TLG output file line by line against the DICES database to see whether each use of κερτομέω occurs in a speech, in the introduction to a speech, or in the narrator text.

To do this, we'll use the `tlg` attribute recently added to DICES Work records. Once we know which work we're looking at, we have to parse TLG's representation of the locus and convert it to something like **book.line**. Next, we download all the speeches in the corresponding work, and check to see whether the TLG locus falls between `l_fi`, the speech's first line, and `l_la` the speech's last line.

## Preliminaries

### import statements

In [1]:
import os
import re
import pandas as pd
from dicesapi import DicesAPI
from dicesapi.text import CtsAPI
from modules import tlg

### input file

In [2]:
tlg_file = os.path.join('data', 'tlg_κερτομεωLemma.csv')

### set up the connection to DICES

At the moment, this only works with the development server; I'm hopeful that in the next week it will work with the default server.

In [3]:
server = 'http://dices.ub.uni-rostock.de/dev/api'
api = DicesAPI(dices_api=server, logfile='dices.log')

## Input

### Read the TLG file

Here we use pandas to load the table of values exported from TLG.

- I'm stripping some extraneous characters from the locus
- The date and locus ended up in the same column, so I'm separating them
- The original locus, in English, is left as `tlg_locus`
- We try our best to parse it and create a new, more standard locus under `locus`

In [4]:
df = pd.read_csv(tlg_file, header=None)
df.columns = ['author', 'work', 'tlg_id', 'locus']
df['tlg_id'] = df['tlg_id'].str.strip(r'="')

date_pat = r'^\s*\((.+?)\)\s+'

df['date'] = df['locus'].str.extract(date_pat)
df['tlg_locus'] = df['locus'].str.replace(date_pat, '', regex=True)
df['locus'] = df['tlg_locus'].apply(tlg.extractLoc)
df

Unnamed: 0,author,work,tlg_id,locus,date,tlg_locus
0,HOMERUS Epic.,Ilias,0012.001,2.256,8 B.C.,Book 2 line 256
1,HOMERUS Epic.,Ilias,0012.001,16.261,8 B.C.,Book 16 line 261
2,HOMERUS Epic.,Odyssea,0012.002,2.323,8 B.C.,Book 2 line 323
3,HOMERUS Epic.,Odyssea,0012.002,7.17,8 B.C.,Book 7 line 17
4,HOMERUS Epic.,Odyssea,0012.002,8.153,8 B.C.,Book 8 line 153
...,...,...,...,...,...,...
95,COSMAS Hierosolymitanus Poeta,Canon,3025.005,4.27,A.D. 7-8,Ode 4 line 27
96,IGNATIUS Diaconus Hagiogr.,Tetrasticha iambica,9012.007,2.11.2,A.D. 8-9,Book 2 poem 11 line 2
97,IGNATIUS Diaconus Hagiogr.,Tetrasticha iambica,9012.007,2.15.4,A.D. 8-9,Book 2 poem 15 line 4
98,ANONYMUS LEXICOGRAPHUS Lexicogr.,Συναγωγὴ λέξεων χρησίμων (Versio antiqua),4160.001,,A.D. 8/9,Alphabetic letter kappa lemma 294 line 1


### Download the speech data

We download the list of Work records and check the `tlg` attribute against the list of TLG ids in the input file. For each work that occurs in the data, we download all the speeches. I'm keeping them in one big dictionary, with work ids as the keys, and SpeechGroup objects as the values.

In [5]:
works = api.getWorks()
speeches = {}

for w in works:
    tlg_id = tlg.getTLG(w)
    if (tlg_id is not None) and (tlg_id in df['tlg_id'].values):
        print(f'Downloading speeches for {w}')
        speeches[tlg_id] = api.getSpeeches(work_id=w.id)

Downloading speeches for <Work 3: Argonautica>
Downloading speeches for <Work 19: Rape of Helen>
Downloading speeches for <Work 15: Theogony>
Downloading speeches for <Work 1: Iliad>
Downloading speeches for <Work 2: Odyssey>
Downloading speeches for <Work 47: 4 To Hermes>
Downloading speeches for <Work 11: Dionysiaca>
Downloading speeches for <Work 23: Halieutica>
Downloading speeches for <Work 12: Posthomerica>


## Use TLG results to filter the speech list

### An annotation function

This function takes a given filter and applies it to all the records in the TLG results. There are two filters defined (in `modules/tlg.py`):

 - `lineIsInSpeech(line, speech)` returns True if the locus is within the bounds of the speech
 - `lineIsSpeechIntro(line, speech, window)` returns True if the locus is within *window* lines of the speech start

In [6]:
def getAnnotations(label, filterFunc, **kwargs):
    results = []

    for rec in df.to_dict(orient='records'):
        tlg_id = rec['tlg_id']
        line = rec['locus']

        note = dict(
            label = None,
            spkr = None,
            addr = None,
            l_fi = None,
            l_la = None,
            tags = None,
        )

        if tlg_id in speeches:
            matches = speeches[tlg_id].advancedFilter(lambda s: filterFunc(line, s, **kwargs))

            # if multiple matches, we likely have embedded speech: prefer most embedded
            matches.sort(key=lambda s: s.level, reverse=True)

            if len(matches) >= 1:
                note['label'] = label
                note['spkr'] = '; '.join([inst.name for inst in matches[0].spkr])
                note['addr'] = '; '.join([inst.name for inst in matches[0].addr])
                note['l_fi'] = matches[0].l_fi
                note['l_la'] = matches[0].l_la
                note['tags'] = '; '.join([t['type'] for t in matches[0]._attributes['tags']])

        results.append(note)
        
    return pd.DataFrame(results)

### Apply filters and generate annotation

For now I'm collecting the results of each filter in a separate dataframe.

In [7]:
in_speech = getAnnotations('speech', tlg.lineIsInSpeech)
in_intro = getAnnotations('intro', tlg.lineIsSpeechIntro, window=2)

Can't parse loci: 2.116, 305-307
Can't parse loci: 2.116, 336-349
Can't parse loci: 2.116, 560-564
Can't parse loci: 2.304, 305-307
Can't parse loci: 2.304, 336-349
Can't parse loci: 2.304, 560-564


### Do the annotations overlap at all?

Could happen if κερτομέω is used at the end of a speech, within `window` lines of the following speech...

In [8]:
if pd.Series.any((in_speech.label == 'speech') & (in_intro.label == 'intro')):
    print('Some annotations overlap!')
else:
    print('Everything is okay')

Everything is okay


### Combine the annotations

As long as the annotations don't overlap, we can combined them into a single table. The `label` column records whether the annotation refers to a within-speech use of κερτομέω or one in a speech introduction.

In [9]:
annotations = in_speech
annotations.loc[in_intro.label=='intro',:] = in_intro

In [10]:
pd.concat([df, annotations], axis=1)

Unnamed: 0,author,work,tlg_id,locus,date,tlg_locus,label,spkr,addr,l_fi,l_la,tags
0,HOMERUS Epic.,Ilias,0012.001,2.256,8 B.C.,Book 2 line 256,speech,Odysseus,Thersites,2.246,2.264,vit; thr
1,HOMERUS Epic.,Ilias,0012.001,16.261,8 B.C.,Book 16 line 261,,,,,,
2,HOMERUS Epic.,Odyssea,0012.002,2.323,8 B.C.,Book 2 line 323,intro,suitor of Penelope,suitors of Penelope,2.325,2.330,del
3,HOMERUS Epic.,Odyssea,0012.002,7.17,8 B.C.,Book 7 line 17,,,,,,
4,HOMERUS Epic.,Odyssea,0012.002,8.153,8 B.C.,Book 8 line 153,speech,Odysseus,Laodamas,8.153,8.157,res; lam
...,...,...,...,...,...,...,...,...,...,...,...,...
95,COSMAS Hierosolymitanus Poeta,Canon,3025.005,4.27,A.D. 7-8,Ode 4 line 27,,,,,,
96,IGNATIUS Diaconus Hagiogr.,Tetrasticha iambica,9012.007,2.11.2,A.D. 8-9,Book 2 poem 11 line 2,,,,,,
97,IGNATIUS Diaconus Hagiogr.,Tetrasticha iambica,9012.007,2.15.4,A.D. 8-9,Book 2 poem 15 line 4,,,,,,
98,ANONYMUS LEXICOGRAPHUS Lexicogr.,Συναγωγὴ λέξεων χρησίμων (Versio antiqua),4160.001,,A.D. 8/9,Alphabetic letter kappa lemma 294 line 1,,,,,,


## Save the results

Results are saved to a new csv file (`κερτομεω_annotated.csv`).

In [11]:
output = 'κερτομεω_annotated.csv'
pd.concat([df, annotations], axis=1).to_csv(output, index=False)