# Part I

In the first part, we connect to the databases and collect and parse the speeches.


## `import` statements

This section loads ancillary code that isn't part of base Python.

In [None]:
# code related to DICES
from dicesapi import DicesAPI
from dicesapi.text import CtsAPI
from dicesapi.jupyter import NotebookPBar

# science and graphing tools
import pandas as pd
from matplotlib import pyplot as plt

## Create connections to external data sources

This section instantiates two important "objects" and saves them to variables for later use. One, `api` is a connection to the DICES database. We'll use this to download speech data. The other, `cts`, is a connection to the Perseus Digital Library. It will be used to download the actual text of the speeches once we know their beginning and ending loci.

In [None]:
# connection to DICES
api = DicesAPI(
    logfile = 'dices.log',
    progress_class = NotebookPBar,
)

# connection to Perseus
cts = CtsAPI(
    dices_api = api,
)

## Download all the speeches

Here we download all the speeches from DICES using a single command. The resulting collection of data (we call it a SpeechGroup) is saved to a variable called `speeches`.

In [None]:
speeches = api.getSpeeches(progress=True)

## Select only the Latin speeches

For now, let's look just at the Latin speeches. We can select a subset of `speeches` by using the `advancedFilter` method. This command takes as its argument a simple function definition. That function is then run on every one of the speeches in the SpeechGroup: any speeches for which the function returns `True` are selected; those for which it returns `False` are left behind.

The function definition is created by the `lambda` keyword -- don't worry too much about the details, but basically the function we're creating here just returns `True` if the speech's `lang` tag is set to `'latin'` and `False` otherwise.

In [None]:
latin_speeches = speeches.advancedFilter(lambda s: s.lang == 'latin')

### Sanity check: did the filter work?

How many speeches are there in total? How many are in the Latin subset?

In [None]:
print('total speeches:', len(speeches))
print('latin speeches:', len(latin_speeches))

## Download the text of the speeches from Perseus

In this section, we loop over all the speeches in the SpeechGroup. Our **loop variable**, here called `speech`, is set to each of the Latin speeches in turn as we repeatedly execute all the indented commands.

Within the loop, we attempt to download the text of the speech using `cts`, our connection to the Perseus Digital Library. Some of the speeches don't work: in some cases there are whole texts that aren't available from Perseus, in other cases, it's a matter of misalignment between the textual editions used by DICES versus Perseus.

In [None]:
# create a progress bar: this can take a while
pbar = NotebookPBar(max=len(latin_speeches))

for speech in latin_speeches:
    
    # advance the progress bar
    pbar.update()

    # if this speech has already been downloaded, skip it
    if hasattr(speech, 'passage') and (speech.passage is not None):
        continue
    
    # otherwise, try to download
    speech.passage = cts.getPassage(speech)

## Drop speeches for which text download failed

Here we weed out any speeches for which the previous step didn't work. The final line in the loop above attempts to download the text from Perseus as a CTS Passage object, and saves the result as a new attribute of the speech, here called `speech.passage`. If this step fails, then `speech.passage` will be `None` instead of a new Passage object.

In [None]:
selected_speeches = latin_speeches.advancedFilter(lambda s: s.passage is not None)

In [None]:
print('latin speeches:', len(latin_speeches))
print('selected:', len(selected_speeches))

## Parse the text of the speeches with SpaCy

In this section, we parse all the speeches with the Natural Language Processing toolkit SpaCy. For the Latin texts, we're using Patrick Burns' [LatinCy](https://huggingface.co/latincy), specifically the model `la_core_web_sm`.

In [None]:
# create a progress bar
pbar = NotebookPBar(max=len(selected_speeches))

for speech in selected_speeches:
    
    # update the progress bar
    pbar.update()
    
    # run SpaCy
    speech.passage.runSpacyPipeline()

# Part II

Now that we've got the speeches parsed, let's explore the data a little. We'll start with a single speech, Juno's speech to Aeolus in *Aeneid* 1. I happen to know its speech id is 1529.

In [None]:
speech = selected_speeches.filterIDs([1529])[0]
print(speech)

### Speech text

The plain text of the speech is stored as the `.passage.text` attribute.

In [None]:
print(speech.passage.text)

### SpaCy document

After performing NLP, SpaCy collects information about the text in an object called a "Document", which is saved for us here as `.passage.spacy_doc`. One way we can use this document is as a container of tokens.

In [None]:
for token in speech.passage.spacy_doc:
    print(token)

### SpaCy tokens

Each of these tokens carries a number of useful attributes:
- `lemma_`: the dictionary headword
- `pos_`: a universal part of speech tag
- `morph`: a collection of morphological attributes

Let's examine the first ten tokens more closely:

In [None]:
for token in speech.passage.spacy_doc[:10]:
    print(token.text, token.lemma_, token.pos_, token.morph, sep='\t')