# Part I

In the first part, we connect to the databases and collect and parse the speeches.


## `import` statements

This section loads ancillary code that isn't part of base Python.

In [None]:
# code related to DICES
from dicesapi import DicesAPI
from dicesapi.text import CtsAPI
from dicesapi.jupyter import NotebookPBar

# science and graphing tools
import pandas as pd
from matplotlib import pyplot as plt

## Create connections to external data sources

This section instantiates two important "objects" and saves them to variables for later use. One, `api` is a connection to the DICES database. We'll use this to download speech data. The other, `cts`, is a connection to the Perseus Digital Library. It will be used to download the actual text of the speeches once we know their beginning and ending loci.

In [None]:
# connection to DICES
api = DicesAPI(
    logfile = 'dices.log',
    progress_class = NotebookPBar,
)

# connection to Perseus
cts = CtsAPI(
    dices_api = api,
)

## Download some speeches

Here we download all the speeches by Achilles from DICES. The resulting collection of data (we call it a SpeechGroup) is saved to a variable called `speeches`.

In [None]:
speeches = api.getSpeeches(spkr_name="Achilles")

### ðŸ¤” Sanity check

#### How many speeches did we get?

The command `len()` tells us the **length** of the collection, ie., how many speeches.

In [None]:
print(len(speeches))

#### List the speeches in tabular format

In [None]:
table = pd.DataFrame(dict(
        speech_id = speech.id,
        language = speech.lang,
        author = speech.author.name,
        work = speech.work.title,
        loci = speech.l_range,
) for speech in speeches)

display(table)

#### Summarize by language

In [None]:
table.groupby('language').agg(count=('speech_id', 'count'))

#### Summarize by author and work

In [None]:
table.groupby(['author', 'work']).agg(count=('speech_id', 'count'))

## Download the text of the speeches from Perseus

In this section, we loop over all the speeches in the SpeechGroup. Our **loop variable**, here called `speech`, is set to each of the speeches in turn as we repeatedly execute all the indented commands.

Within the loop, we attempt to download the text of the speech using `cts`, our connection to the Perseus Digital Library. Some of the speeches don't work: in some cases there are whole texts that aren't available from Perseus, in other cases, it's a matter of misalignment between the textual editions used by DICES versus Perseus.

In [None]:
# create a progress bar: this can take a while
pbar = NotebookPBar(max=len(speeches))

for speech in speeches:
    
    # advance the progress bar
    pbar.update()

    # if this speech has already been downloaded, skip it
    if hasattr(speech, 'passage') and (speech.passage is not None):
        continue
    
    # otherwise, try to download
    speech.passage = cts.getPassage(speech)

## Drop speeches for which text download failed

Here we weed out any speeches for which the previous step didn't work. The final line in the loop above attempts to download the text from Perseus as a CTS Passage object, and saves the result as a new attribute of the speech, here called `speech.passage`. If this step fails, then `speech.passage` will be `None` instead of a new Passage object.

#### Which ones failed?

In [None]:
print('Failed:')
for s in speeches:
    if s.passage is None:
        print('  ', s)

#### Keep only speeches with a `passage` attribute

In [None]:
selected_speeches = speeches.advancedFilter(lambda s: s.passage is not None)

## Parse the text of the speeches with SpaCy

In this section, we parse all the speeches with the Natural Language Processing toolkit SpaCy. For the Latin texts, we're using Patrick Burns' [LatinCy](https://huggingface.co/latincy), specifically the model `la_core_web_sm`.

In [None]:
# create a progress bar
pbar = NotebookPBar(max=len(selected_speeches))

for speech in selected_speeches:
    
    # update the progress bar
    pbar.update()
    
    # run SpaCy
    speech.passage.runSpacyPipeline()

# Part II

## Examining SpaCy output

We've successfully run NLP on all the speeches that we could get. What do the results look like? Let's inspect the first speech a little more closely.

In [None]:
speech = selected_speeches[0]
speech

### The speech passage

The DICES client collects information about the text of the speech in a special object saved as the `passage` attribute. The passage object has a couple of important attributes of its own, the most basic of which is just `text`, the plain text of the speech.

Let's look at the text of Iliad 1.59-1.67.

In [None]:
print(speech.passage.text)

### SpaCy document

After performing NLP, SpaCy collects information about the text in an object called a "Document", which is saved for us here as `.passage.spacy_doc`. One way we can use this document is as a container of **tokens**. A token is a unit of parsed languageâ€”most often a word.

### Print the first 10 tokens

In this simple `for` loop, we iterate over the first 10 tokens in Il. 59 *ff*. and just print them to the screen. 

In [None]:
for token in speech.passage.spacy_doc[:10]:
    print(token)

### SpaCy tokens

Each of these tokens carries a number of useful attributes:
- `lemma_`: the dictionary headword
- `pos_`: a universal part of speech tag
- `morph`: a collection of morphological attributes

Let's examine the first ten tokens more closely:

In [None]:
for token in speech.passage.spacy_doc[:10]:
    print(token.text, token.lemma_, token.pos_, token.morph, sep='\t')

### All the tokens in a speech, tabular format

Here we use Pandas to build a nice table with one row per token, putting the token attributes into individual columns. Using list comprehension we iterate over all the tokens in the speech.

In [None]:
pd.DataFrame(dict(
    form = token.text,
    lemma = token.lemma_,
    upos = token.pos_,
    features = token.morph,
) for token in s.passage.spacy_doc)

### All the tokens in all the speeches, tabular format

To create one table containing all tokens from **all** speeches, we need to do a double list-comprehension: we'll use one loop variable, `speech`, as a placeholder for the current speech, and another, `token` to iterate over all the tokens in `speech`.

In [None]:
token_table = pd.DataFrame(dict(
    speech_id = speech.id,
    author = speech.author.name,
    work = speech.work.title,
    spkr = [inst.name for inst in speech.spkr],
    addr = [inst.name for inst in speech.addr],
    token = token.text,
    lemma = token.lemma_,
    pos = token.pos_,
    features = token.morph,
) for speech in selected_speeches for token in speech.passage.spacy_doc)

display(token_table)

### Morphological features

SpaCy stores some complicated information in the `morph` attribute.

In [None]:
print(token)
print(token.morph)

We can tease out the details a bit further and extract features we're interested in by treating `morph` as a dictionary. Instead of just printing out e.g. `Case=Acc|Gender=Masc|Number=Sing`, we can query by keyword to get one feature at a time.

The `morph.to_dict()` method turns the bundle of features represented by `morph` into a Python dictionary. We can use the dictionary method `get()` to extract one feature by name. 

In [None]:
print(token.morph.to_dict().get('Case'))

### Extracting all features of interest

One of the nice things about `get()` is that if the named feature does not exist in the dictionary, we get the useful answer `None` instead of an error. So we can go ahead and ask for everything we're interested in, e.g. Mood, Case, Voice, with every single token. If a given feature isn't applicable, Python will return `None` and move on.

In [None]:
token_table = pd.DataFrame(dict(
    speech_id = speech.id,
    author = speech.author.name,
    work = speech.work.title,
    spkr = [inst.name for inst in speech.spkr],
    addr = [inst.name for inst in speech.addr],
    token = token.text,
    lemma = token.lemma_,
    pos = token.pos_,
    mood = token.morph.to_dict().get('Mood'),
    voice = token.morph.to_dict().get('Voice'),
    person = token.morph.to_dict().get('Person'),
    number = token.morph.to_dict().get('Number'),
    gender = token.morph.to_dict().get('Gender'),    
    case = token.morph.to_dict().get('Case'),
) for speech in selected_speeches for token in speech.passage.spacy_doc)

display(token_table)

### Export to Excel

Let's pause here for a moment to export the full dataset to a CSV file, so that we can look at it in Excel if we want to

In [None]:
token_table.to_csv('achilles_tokens.csv', index=False)