### Install beta version of DICES client library

```
pip install git+https://github.com/cwf2/dices-client.git
```

### Import statements

In [1]:
from dicesapi import DicesAPI
from dicesapi.jupyter import NotebookPBar
import pandas as pd
import re
from copy import deepcopy
from IPython.display import HTML, display

### Set up connection to DICES database

In [2]:
dices = DicesAPI(
    dices_api = 'http://csa20211203-005.uni-rostock.de/api',
    cts_api = 'https://scaife-cts.perseus.org/api/cts',
    progress_class = NotebookPBar,
    logfile='dices.log',
)

### Download Lucan's speeches

In [3]:
speeches = dices.getSpeeches(author_name='Lucan', progress=True)

HBox(children=(IntProgress(value=0, bar_style='info', max=123), Label(value='0/123')))

A super quick look at what we get:

In [4]:
pd.DataFrame([dict(
    urn = s.urn,
    first_line = s.l_fi,
    last_line = s.l_la,
    speaker = ', '.join([inst.name for inst in s.spkr]),
    addressee = ', '.join([inst.name for inst in s.addr]),
) for s in speeches])

Unnamed: 0,urn,first_line,last_line,speaker,addressee
0,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,1.190,1.192,Roma,"Gaius Julius Caesar, soldiers"
1,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,1.195,1.203,Gaius Julius Caesar,Roma
2,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,1.248,1.257,people of Arminium,"Fortuna, people of Arminium"
3,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,1.273,1.291,Gaius Scribonius Curio,Gaius Julius Caesar
4,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,1.299,1.351,Gaius Julius Caesar,soldiers
...,...,...,...,...,...
118,urn:cts:latinLit:phi0917.phi001.perseus-lat2:9...,9.1064,9.1104,Gaius Julius Caesar,
119,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,10.85,10.103,Cleopatra VII Philopator,Gaius Julius Caesar
120,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,10.176,10.192,Gaius Julius Caesar,Acoreus
121,urn:cts:latinLit:phi0917.phi001.perseus-lat2:1...,10.194,10.331,Acoreus,Gaius Julius Caesar


### Getting the text

The speech records in DICES only have metadata; to get the text, we use CTS to request each passage from Perseus. I'm going to tack the passages onto the existing speech objects.

One limitation of DICES right now: **line is the finest granularity we have for beginnings and endings.** So we're picking up *verba dicendi* and other extra material in speeches that start or end partway through a line. For a lot of our Greek texts it isn't an issue; and for some other authors there are quotation marks or `<q>` tags in the xml that let us find the edges of the speech, but not for Lucan.

In [5]:
# takes long enough that I like a progress bar
pbar = NotebookPBar(max=len(speeches))

for s in speeches:
    try:
        s.cts_passage = s.getCTS()
    except:
        print('Failed to get', s)
        s.cts_passage = None
    pbar.update()

HBox(children=(IntProgress(value=0, bar_style='info', max=123), Label(value='0/123')))

#### Whole speeches

The simplest way to get the text is the `text` attribute of the cts passages.

In [6]:
pd.DataFrame([dict(
    first_line = s.l_fi,
    last_line = s.l_la,
    speaker = ', '.join([inst.name for inst in s.spkr]),
    addressee = ', '.join([inst.name for inst in s.addr]),
    text = s.cts_passage.text,
) for s in speeches])

Unnamed: 0,first_line,last_line,speaker,addressee,text
0,1.190,1.192,Roma,"Gaius Julius Caesar, soldiers",Et gemitu permixta loqui: Quo tenditis ultra? ...
1,1.195,1.203,Gaius Julius Caesar,Roma,Mox ait: O magnae qui moenia prospicis urbis T...
2,1.248,1.257,people of Arminium,"Fortuna, people of Arminium","O male vicinis haec moenia condita Gallis, O t..."
3,1.273,1.291,Gaius Scribonius Curio,Gaius Julius Caesar,"Conspexit: Dum voce tuae potuere iuvari, Caesa..."
4,1.299,1.351,Gaius Julius Caesar,soldiers,"Bellorum o socii, qui, mille pericula Martis M..."
...,...,...,...,...,...
118,9.1064,9.1104,Gaius Julius Caesar,,"Aufer ab adspectu nostro funesta, satelles, Re..."
119,10.85,10.103,Cleopatra VII Philopator,Gaius Julius Caesar,"Et sic orsa loqui: Si qua est, o maxime Caesar..."
120,10.176,10.192,Gaius Julius Caesar,Acoreus,"O sacris devote senex, quodque arguit aetas, N..."
121,10.194,10.331,Acoreus,Gaius Julius Caesar,"Fas mihi, magnorum, Caesar, secreta parentum P..."


#### Line-by-line

This is the best way I've come up with to parse the cts passages into lines. **💁🏻‍♂️ Any suggestions here?**

In [None]:
xpath = '//{http://www.tei-c.org/ns/1.0}l'

for s in speeches:
    s.verse_array = [dict(
        n = l.get('n'), 
        text = ''.join(l.itertext()),
    ) for l in s.cts_passage.xml.getroottree().findall(xpath)]

In [None]:
pd.DataFrame(speeches[-1].verse_array)

### NLP with CLTK

In [None]:
from cltk import NLP

#### Working with language-specific pipelines

This isn't necessary when we're just looking at Lucan, but I'm including it to show my more general workflow, in combination with the `.lang` attribute of DICES Speech objects.

In [None]:
cltk_nlp = dict(
    latin = NLP('lat', suppress_banner=True),
    greek = NLP('grc', suppress_banner=True),
)

# remove LatinLexiconProcess
#     - assumes it's the last process
cltk_nlp['latin'].pipeline.processes = cltk_nlp['latin'].pipeline.processes[:-1]

#### Parsing the whole text of each speech

In [None]:
# this takes a long time
pbar = NotebookPBar(max=len(speeches))

for s in speeches:
    s.cltk_doc = cltk_nlp[s.lang](s.cts_passage.text)
    pbar.update()

💁🏻‍♂️ Questions:

 - ~Can I leave out of the pipeline whatever is retrieving all the dictionary entries?~ ✅
 - Can I make this any faster? ✅ (see prev)
 - Can I get the attributes `index_char_start` and `index_char_stop`?
 - ~Should I be breaking this up into sentences?~
  - Seems like
    - it gets broken into sentences anyway
    - it's slightly faster to do it all at once
 - Would it work on individual lines, even if they're not grammatically complete?


#### Identifying line of origin for each token in the cltk document

Yeah, this seems ugly. But it works, and I want to leave time for my fanfiction project this evening...

Should create a record for every token in the `cltk_doc.words` list, giving the index of its line within `verse_array`, the canonical line number, and start and end positions within the string representation of the verse. I have a feeling that tags like `<note>` will have to be pruned for this to be really robust.

In [None]:
def get_word_loci(cltk_doc, verse_array):
    '''Look up each string in the verse array, return locs'''
    
    # we're going to look up each token string from the full speech
    #     in the verse-line array, crossing off each as we go
    
    # these are the strings to lookup
    tok_strings = [w.string for w in cltk_doc.words]
    
    # this is the current position in the verse array
    #     and  string of the verse text
    verse_i = 0
    last_good_i = 0
    working_text = verse_array[verse_i]['text']
    
    # holds results
    token_loc = []

    for tok in tok_strings:        
        while True:
        
            if working_text is None:
                working_text = ''
        
            # look for the word with boundaries
            regex = re.compile(r'\b' + re.escape(tok) + r'\b')
            m = regex.search(working_text)
            
            # then try without boundaries (punctuation, enclitics?)
            if m is None:
                regex = re.compile(re.escape(tok))
                m = regex.search(working_text)           
                
                # still no? then try next line
                if m is None:
                    verse_i += 1
                
                    if verse_i < len(verse_array):
                        working_text = verse_array[verse_i]['text']                    
                        continue
                        
                    # if we're at the end of the verse array, give up
                    else:
                        token_loc.append(dict())
                        
                        # go back to the last line that matched something
                        verse_i = last_good_i
                        break

            # if we found a match, cross it off
            offset = m.start()
            working_text = regex.sub('🧀'*len(tok), working_text, count=1)

            token_loc.append(dict(
                i = verse_i,
                start = offset,
                end = offset + len(tok),
                n = verse_array[verse_i]['n'],
            ))
            break
    
    return token_loc

Add this info to every speech, so I can use it later to correlate the parsed tokens in `cltk_doc.words` with the loci and text of `verse_array`

In [None]:
for s in speeches:
    s.token_loc = get_word_loci(s.cltk_doc, s.verse_array)

Here's an example.

In [None]:
pd.DataFrame(speeches[-1].token_loc)

### Colour all the verbs

A test case. Can I pick out tokens based on a cltk feature like part of speech, and then use that info in a line-based treatment of the passage, such as displaying one line per row in an HTML table?

In [None]:
def html_table_coloured_verbs(s):
    '''Produce an HTML table of the speech text,
          with all verbs highlighted.
    '''
    
    rows = deepcopy(s.verse_array)
    
    # work through the tokens backwards, so when we modify
    #     the line strings, our start/end offsets aren't
    #     messed up by earlier words
    
    for tok_index in reversed(range(len(s.cltk_doc.words))):
        
        tok = s.cltk_doc.words[tok_index]
        verse_index = s.token_loc[tok_index]['i']
        
        if tok.pos.name == 'verb' and tok.features['VerbForm'][0].name == 'finite':
            start = s.token_loc[tok_index]['start']
            end = s.token_loc[tok_index]['end']
            text = rows[verse_index]['text']
            
            rows[verse_index]['text'] = '{before}<span class="d-inline-block" data-bs-toggle="tooltip" title="{meta}"><span style="color:red">{middle}</span></span>{after}'.format(
                meta = f'Lemma: {tok.lemma}\n' + ', '.join(f'{k}:{v}' for k,v in tok.features.all()),
                before = text[:start],
                middle = text[start:end],
                after = text[end:],
            )
    
    rows = [f'<tr><td>{row["n"]}</td><td style="text-align:left">{row["text"]}</td></tr>' 
                for row in rows]
    
    return ('<table class="table table-border" style="width:80%">' +
             '<thead><tr><th style="width:5em">line</th><th style="text-align:left">text</th></tr></thead>'+ 
             '<tbody>' +
                ''.join(rows) +
             '</tbody>' +
            '</table>')

In [None]:
for s in speeches:
    display(
        HTML(f'<h3>{s.work.title} {s.l_range}</h3>'),
        HTML(html_table_coloured_verbs(s))
    )

There are some odd mistakes in here: I've added tooltips above to help examine the details. Below I look at all the words in one speech.

In [None]:
with pd.option_context('max_rows', None, 'max_columns', None, 'max_colwidth', 100):
    display(pd.DataFrame(dict(
    string = w.string,
    pos = w.pos.name,
    lemma = w.lemma,
    features = '; '.join(f'{k}:{v[0].name}' for k, v in w.features.all())
) for w in speeches[0].cltk_doc.words))
    

In [None]:
s

In [None]:
for w in s.cltk_doc:
    if w.pos.name == 'verb' and w.features['VerbForm'][0].name =='finite':
        if 'Mood' in w.features.features:
            print(type(w.features['Mood']))

In [None]:
print(type(w.features))
print(type(w.features.features))