### Install beta version of DICES client library

```
pip install git+https://github.com/cwf2/dices-client.git
```

### Import statements

In [None]:
from dicesapi import DicesAPI
from dicesapi.jupyter import NotebookPBar
import pandas as pd

### Set up connection to DICES database

In [None]:
dices = DicesAPI(
    dices_api = 'http://csa20211203-005.uni-rostock.de/api',
    cts_api = 'https://scaife-cts.perseus.org/api/cts',
    progress_class = NotebookPBar,
    logfile='dices.log',
)

### Download Lucan's speeches

In [None]:
speeches = dices.getSpeeches(author_name='Lucan', progress=True)

A super quick look at what we get:

In [None]:
pd.DataFrame([dict(
    urn = s.urn,
    first_line = s.l_fi,
    last_line = s.l_la,
    speaker = ', '.join([inst.name for inst in s.spkr]),
    addressee = ', '.join([inst.name for inst in s.addr]),
) for s in speeches])

### Getting the text

The speech records in DICES only have metadata; to get the text, we use CTS to request each passage from Perseus. I'm going to tack the passages onto the existing speech objects.

One limitation of DICES right now: **line is the finest granularity we have for beginnings and endings.** So we're picking up *verba dicendi* and other extra material in speeches that start or end partway through a line. For a lot of our Greek texts it isn't an issue; and for some other authors there are quotation marks or `<q>` tags in the xml that let us find the edges of the speech, but not for Lucan.

In [None]:
# takes long enough that I like a progress bar
pbar = NotebookPBar(max=len(speeches))

for s in speeches:
    try:
        s.cts_passage = s.getCTS()
    except:
        print('Failed to get', s)
        s.cts_passage = None
    pbar.update()

#### Whole speeches

The simplest way to get the text is the `text` attribute of the cts passages.

In [None]:
pd.DataFrame([dict(
    first_line = s.l_fi,
    last_line = s.l_la,
    speaker = ', '.join([inst.name for inst in s.spkr]),
    addressee = ', '.join([inst.name for inst in s.addr]),
    text = s.cts_passage.text,
) for s in speeches])

#### Line-by-line

This is the best way I've come up with to parse the cts passages into lines. **💁🏻‍♂️ Any suggestions here?**

In [None]:
xpath = '//{http://www.tei-c.org/ns/1.0}l'

for s in speeches:
    s.verse_array = [dict(
        n = l.get('n'), 
        text = l.text,
    ) for l in s.cts_passage.xml.getroottree().findall(xpath)]

In [None]:
pd.DataFrame(speeches[2].verse_array)

### NLP with CLTK

In [None]:
from cltk import NLP

#### Working with language-specific pipelines

This isn't necessary when we're just looking at Lucan, but I'm including it to show my more general workflow, in combination with the `.lang` attribute of DICES Speech objects.

In [None]:
cltk_nlp = dict(
    latin = NLP('lat'),
    greek = NLP('grc'),
)

#### Parsing the whole text of each speech

In [None]:
# this takes a long time and I've never actually run it all the way through...
pbar = NotebookPBar(max=len(speeches))

for s in speeches:
    s.cltk_doc = cltk_nlp[s.lang](s.cts_passage.text)
    pbar.update()

💁🏻‍♂️ Questions:

 - Can I leave out of the pipeline whatever is retrieving all the dictionary entries?
 - Can I make this any faster?
 - I notice that the words have placeholder attributes for the start and end positions in the string. Can I turn these on?
 - Should I be breaking this up into sentences?
 - Would it work on individual lines, even if they're not grammatically complete?


#### Breaking into sentences

In [None]:
from cltk.sentence.lat import LatinPunktSentenceTokenizer
splitter = LatinPunktSentenceTokenizer()

for s in speeches:
    s.sentences = splitter.tokenize(s.cts_passage.text)