# Working with text

In the second part of the workshop, we go beyond the speech metadata recorded by DICES. We're going to download the text of the speeches from the Perseus Digital Library and then parse the Latin and Greek with the natural language processing framework spaCy.

The DICES client includes basic functionality for working with speech text, including convenience functions that wrap third-party tools for interacting with Perseus and spaCy. If you want to do more complex tasks, or work with Greek and Latin passages that aren't speeches, you can use these tools directly.

## Preliminaries

### Install DICES client software

Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step.

In [None]:
# revert pip for language model compatibility
!pip install -q --force-reinstall pip==22.0.0
# install DICES client
!pip install -q git+https://github.com/cwf2/dices-client
# install language models
!pip install -q https://huggingface.co/latincy/la_core_web_md/resolve/main/la_core_web_md-any-py3-none-any.whl

### Import statements

Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas.

In [None]:
# for talking to DICES
from dicesapi import DicesAPI
# for talking to Perseus
from dicesapi.text import CtsAPI

# for displaying HTML
from IPython.display import HTML
# for creating data tables
import pandas as pd

### Connect to external resources
Here we instantiate a connection to the DICES database as in our earlier examples. We also create a second connection, this time to an external digital library. By default, `CtsAPI` uses the Perseus Digital Library's [CTS endpoint](https://scaife-cts.perseus.org/api/cts?request=GetCapabilities). 

Not all of the texts in Perseus are currently accessible via this interface, but its capabilities continue to improve. For now, we need to check whether a given speech has downloaded successfully before we can use its text in our scripts.

In [None]:
# create a new DICES connection
api = DicesAPI(logfile="dices.log", logdetail=0)

# create a new CTS connection
cts = CtsAPI(dices_api=api)

## Download some speeches

In these examples, we're going to look at a subset of speeches. Downloading and parsing the text of every speech in DICES takes about half an hour and exceeds the computational limits of Colab's free tier.

For this example, we will try to compare two speeches by female characters who are bidding a male hero farewell: Calypso to Odysseus in *Odyssey* 5 and Dido to Aeneas in *Aeneid* 4.

### Speech metadata from DICES

The DICES client's ability to communicate very specific or complex criteria to the database server is somewhat limited. Two strategies can help:
1. Make multiple requests. You can add together the results of multiple `getSpeeches()` calls with the `+` operator. At the moment, this can shuffle the speeches together, so you might want to sort them afterwards.
2. Make a less specific request and then filter the results. The DICES client is better at sorting and filtering speeches once you have them in Python.

In [None]:
# download metadata on speeches by women in the Odyssey and the Aeneid
speeches = (
    api.getSpeeches(work_title="Odyssey", spkr_gender="female") +
    api.getSpeeches(work_title="Aeneid", spkr_gender="female")
).sorted()

# check that it worked
print(len(speeches), "speeches.")

### Quick overview of the results

Let's just take a quick look to see what we got. It's nice to identify potentially interesting or erroneous data before we get too far into our experiment.

In [None]:
# make a table of speeches
table = pd.DataFrame(dict(
    speech_id = speech.id,
    author = speech.author.name,
    work = speech.work.title,
    locus = speech.l_range,
    speaker = speech.getSpkrString(),
    addressee = speech.getAddrString(),
) for speech in speeches)

# show table
display(table)

### Summary statistics

#### Who are the most loquatious women in our data?

In [None]:
(
    table                                         # original data
    .groupby("speaker")                           # factor to group by
    .aggregate(                                   # create a summary table
        speeches = ("speech_id", "count"))        #   - count of speech_id
    .sort_values("speeches", ascending=False)     # sort the results from highest to lowest
    .head(10)                                     # take the top 10
)

#### Who do Dido and Calypso talk to?

In [None]:
(
    table                                         # original data
    .query("speaker in ['Calypso', 'Dido']")      # filter on speaker
    .groupby(["speaker", "addressee"])            # group by speaker-addressee pairs
    .aggregate(                                   # create a summary table
        speeches = ("speech_id", "count"))        #   - count of speech_id
    .sort_values("speeches", ascending=False)     # sort from highest to lowest
    .sort_values("speaker")                       # then by speaker
)

### Isolate speeches of interest

The easiest way to identify the two speeches we're interested in is to look inspect individual records and note the speech IDs.

Here's one way to find the Calypso speech:

In [None]:
# display an anonymous view on the table of speech data
(
    table 
    .query("speaker=='Calypso'")           # speeches by Calypso
    .query("addressee=='Odysseus'")        #   ... to Odysseus
)

The speech we want is the ten-line speech with ID 820.

Let's look for Dido's speech next:

In [None]:
# display an anonymous view on the table of speech data
(
    table
    .query("speaker=='Dido'")        # speeches by Dido
    .query("addressee=='Aeneas'")    #    ... to Aeneas
)

The correct Dido speech has ID 1610. 

Knowing the IDs of the two speeches, we can create a new list of just these two speeches by filtering the larger set of results on ID. The DICES client has a collection of filtering tools: in this case, the `filterIDs()` method is what we want. It takes a list of IDs to keep and produces a new speech group.

In [None]:
# speech IDs to keep
ids = [820, 1610]

# filter the list of speeches by ID
farewells = speeches.filterIDs(ids)

# confirm that it worked
for speech in farewells:
    print(speech)

### Download the text from Perseus
This is the part where we download the text from the remote digital library. If you're interested, you can compare the Calypso speech in Perseus' [human interface](https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:5.203-5.213/) versus the [machine interface](https://scaife-cts.perseus.org/api/cts?urn=urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:5.203-5.213&request=GetPassage).

The `getPassage()` method takes a DICES speech as its argument. It generates an appropriate call to the machine interface and processes the XML response. In the code below, we save the results back to the speech object as a new attribute called `passage`.

In [None]:
# loop over two farewell speeches
for speech in farewells:
    # download the text from perseus
    speech.passage = cts.getPassage(speech)

    # display the results
    print(speech, "\n")
    print(speech.passage.text, "\n")

#### üíÅüèª‚Äç‚ôÇÔ∏è a note about DICES text passages

The text returned by Perseus has line numbers embedded in it. DICES client tries to keep track of them in an attribute of the passage called `line_array`. You can see it here:

In [None]:
farewells[0].passage.line_array

`line_array` is a *list* of *dictionaries*. Each element of the list represents one verse line. It has two keys, `n` and `text`, whose values are the line number and verse text respectively.

Meanwhile, the passage attribute called `text` joins the text together in a single string. Depending on what you want to do, you might want to iterate over the individual lines using `line_array` or work with the complete speech using `text`.

*These features are relatively undeveloped so far. If you're interested in using them for something, please don't hesitate to get in touch and let me know how I can make them work more intuitively.*

### Parse the text using natural language processing

In this step, we process the downloaded text of the speeches using a natural language processing (NLP) package for Python called [spaCy](https://spacy.io/). spaCy is a versitile NLP framework that can perform tasks including:
 - tokenization (breaking the text into words)
 - lemmatization (identifying dictionary headword for each inflected form)
 - part of speech (POS) tagging (noun, verb, adjective, etc.)
 - morphological tagging (mood, voice, case, gender, etc.)
 - dependency tagging (syntactic structure)

spaCy depends on third-party language models that must be trained for each of these tasks. There are a couple of trained models for Ancient Greek and Latin, and this is currently an area of rapid change. We're going to use [ŒüŒ¥œçCy](https://centre-for-humanities-computing.github.io/odyCy/getting_started.html) for Greek and [LatinCy](https://huggingface.co/latincy) for Latin. Each of these projects has multiple models available, with tradeoffs between model size, processing requirements, and accuracy; we're using their "small" and "medium" models, respectively.

**Note:** these models are a form of AI. Unlike older "stemmers" that work by a set of fixed rules (e.g. cutting off -us or -ŒøœÇ to find noun stems), they may produce different results for the same form depending on context. That said, they can also *make things up* just like other forms of AI.

#### Running NLP

The passage method `runSpacyPipeline()` runs the appropriate spaCy pipeline based on the passage's language and saves the output to a new attribute called `spacy_doc`.

In [None]:
# loop over two farewell speeches
for speech in farewells:
    # run spacy
    speech.passage.runSpacyPipeline()

    # check that it worked
    print(speech, len(speech.passage.spacy_doc), "tokens")

#### spaCy output

`spacy_doc` gives you access to all the output of the NLP processing. The easiest way to use it is as a collection of *tokens*. These represent spaCy's best attempt to cut up the continuous string of text into words (or word-like elements). Each token has many attributes of its own; here are a few that might be useful:

- `text`: this is the way the word looked in the original text
- `lemma_`: this is a dictionary headword assigned by the model
- `pos_`: this is a part of speech tag assigned by the model
- `morph`: this is a collection of morphological tags

**Notes:**
1. Some of these attributes end in an underscore.
2. The `morph` attribute is a collection of tags; different parts of speech may have different tags. To check for a tag, use `morph.get()`.

### Create a table of tokens

This code creates a table with one row per token. Using nested `for` loops, we take each speech in turn, then each token of that speech. We build a record for the row that incorporates information about the speech and the token. At the end, we turn the list of rows into a Pandas data frame.

In [None]:
# start with an empty list of rows
rows = []

# iterate over speeches
for speech in farewells:
    
    # iterate over tokens
    for token in speech.passage.spacy_doc:

        # create a record for this token
        row = {
            "speech_id": speech.id,
            "language": speech.lang,
            "author": speech.author.name,
            "title": speech.work.title,
            "book": speech.getPrefix(),                       # see note 1 below
            "line": speech.passage.getLine(token)["n"],       # see note 2
            "speaker": speech.getSpkrString(),
            "addressee": speech.getAddrString(),
            "token": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "verbform": ''.join(token.morph.get("VerbForm")), # see note 3
            "mood": ''.join(token.morph.get("Mood")),
            "voice": ''.join(token.morph.get("Voice")),
            "tense": ''.join(token.morph.get("Tense")),
            "person": ''.join(token.morph.get("Person")),
            "case": ''.join(token.morph.get("Case")),
            "gender": ''.join(token.morph.get("Gender")),
            "number": ''.join(token.morph.get("Number")),    
        }

        # add row to the list of rows
        rows.append(row)

# convert to a table
tokens = pd.DataFrame(rows)

**Notes:**
1. I'm using the `getPrefix()` method to get just the book number of the locus.
2. `passage.getLine(token)` returns the one line of `line_array` that contains `token`. This allows us to give the individual line number for each token, rather than just the first and last lines for the speech as a whole.
3. The output of `morph.get()` is always a list, but it usually has one element (if the tag exists) or zero (if it doesn't). The expression `''.join(token.morph.get("VerbForm"))` checks for the tag "VerbForm" and joins any elements of the resulting list into a single string. If `get()` returns an empty list, the output is an empty string "". If `get()` returns a list of one tag, then the output is just the tag name.

In [None]:
# write the tokens to a CSV file
tokens.to_csv("tokens.tsv", sep="\t", index=False)

# display the table
display(tokens)