# Assignment: word count

In this assignment we create basic word counts for two speeches: Calypso to Odysseus at *Odyssey* 5.203-5.213 and Dido to Aeneas at *Aeneid* 4.305-4.330.

The first section fast-forwards through the same steps as the **Example 2** notebook, skipping some of the longer explanations. If you're continuing from that point, you can skip ahead to [Section 3](#3.-Word-counts) below.

## 1. Preliminaries

### Install DICES client software

Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step.

In [None]:
# revert pip for language model compatibility
!pip install -q --force-reinstall pip==22.0.0
# install DICES client
!pip install -q git+https://github.com/cwf2/dices-client
# install language models
!pip install -q https://huggingface.co/latincy/la_core_web_md/resolve/main/la_core_web_md-any-py3-none-any.whl
!pip install -q https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl

### Import statements

Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas.

In [None]:
# for talking to DICES
from dicesapi import DicesAPI
# for talking to Perseus
from dicesapi.text import CtsAPI
# for creating data tables
import pandas as pd

### Connect to external resources
Here we instantiate connections to the DICES database and to the Perseus Digital Library.

In [None]:
# create a new DICES connection
api = DicesAPI(logfile="dices.log", logdetail=0)

# create a new CTS connection
cts = CtsAPI(dices_api=api)

## 2. Download speeches

It's a little hard to request these exact speeches from DICES. I'm going to:

- request all speeches by Calypso to Odysseus
- request all speeches by Dido to Aeneas
- add the results together
- look at them manually and note the IDs of the speeches I want
- filter the results for just those IDs

### Get speeches from DICES

In [None]:
# download metadata on speeches by women in the Odyssey and the Aeneid
speeches = (
    api.getSpeeches(spkr_name="Calypso", addr_name="Odysseus") +
    api.getSpeeches(spkr_name="Dido", addr_name="Aeneas")
).sorted()

# check that it worked
print(len(speeches), "speeches.")

### Make a speech table

Here we create a Pandas data frame from our results.

In [None]:
# an empty list of rows
rows = []

# iterate over all speeches
for speech in speeches:
    
    # create a new row
    row = {
        "speech_id": speech.id,
        "language": speech.lang,
        "author": speech.author.name,
        "work": speech.work.title,
        "first_line": speech.l_fi,
        "last_line": speech.l_la,
        "speaker": speech.getSpkrString(),
        "address": speech.getAddrString(),
    }

    # add row to the list
    rows.append(row)

# make a table
speech_table = pd.DataFrame(rows)

# display
display(speech_table)

### Filter on ID

From the table, I can see that the speeches I want have IDs 820 and 1610. I'm going to create a new group with just those.

In [None]:
# filter on ID
farewells = speeches.filterIDs([820, 1610])

# check that it worked
for speech in farewells:
    print(speech)

### Download the text from Perseus

Next, I'm going to download the text of both speeches from Perseus using the CTS connection I created earlier. The `getPassage()` method takes a DICES speech as its argument and returns the corresponding passage from Perseus. I'll save the passage as a new attribute of the the speech object to make sure that the speech and its text stay together.

It's important to check that this step works... a few speeches in Perseus don't work well with CTS yet.

In [None]:
# loop over two farewell speeches
for speech in farewells:
    # download the text from perseus
    speech.passage = cts.getPassage(speech)

    # display the results
    print(speech, "\n")
    print(speech.passage.text, "\n")

### Parse the text using natural language processing

In this step, we process the downloaded text of the speeches using NLP. The method `runSpacyPipeline()` runs the appropriate spaCy pipeline based on the passage's language and saves the output to a new attribute called `spacy_doc`.

In [None]:
# loop over two farewell speeches
for speech in farewells:
    # run spacy
    speech.passage.runSpacyPipeline()

    # check that it worked
    print(speech, len(speech.passage.spacy_doc), "tokens")

### Create a table of tokens

This code creates a table with one row per token. Using nested `for` loops, we take each speech in turn, then each token of that speech. We build a record for the row that incorporates information about the speech and the token. At the end, we turn the list of rows into a Pandas data frame.

In [None]:
# start with an empty list of rows
rows = []

# iterate over speeches
for speech in farewells:
    
    # iterate over tokens
    for token in speech.passage.spacy_doc:

        # create a record for this token
        row = {
            "speech_id": speech.id,
            "language": speech.lang,
            "author": speech.author.name,
            "title": speech.work.title,
            "book": speech.getPrefix(),                       # see note 1 below
            "line": speech.passage.getLine(token)["n"],       # see note 2
            "speaker": speech.getSpkrString(),
            "addressee": speech.getAddrString(),
            "token": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "verbform": ''.join(token.morph.get("VerbForm")), # see note 3
            "mood": ''.join(token.morph.get("Mood")),
            "voice": ''.join(token.morph.get("Voice")),
            "tense": ''.join(token.morph.get("Tense")),
            "person": ''.join(token.morph.get("Person")),
            "case": ''.join(token.morph.get("Case")),
            "gender": ''.join(token.morph.get("Gender")),
            "number": ''.join(token.morph.get("Number")),    
        }

        # add row to the list of rows
        rows.append(row)

# convert to a table
tokens = pd.DataFrame(rows)

# write the tokens to a CSV file
tokens.to_csv("tokens.tsv", sep="\t", index=False)

# display the table
display(tokens)

## 3. Word counts

Beginning from the token table, we can create counts of any feature using `groupby()` and `agg()`.

### Basic counts

#### How many words does each character speak?

In [None]:
# word count by character
wc = (
    tokens                            # original data 
    .groupby("speaker")               # group by character
    .agg(                             # define new table
        tokens = ("token", "count"),  #   - token count per character
    )
)

# show results
display(wc)

#### Type-token ratio

We can add to this the number of unique lemmata used by each character. The ratio of lemmata to tokens is one measure of language richness. Because Dido and Calypso speak different languages, and because their speeches are of different lengths, we might be cautious about drawing conclusions from this without further study.

In [None]:
# word count by character
wc = (
    tokens                             # original data 
    .groupby("speaker")                # group by character
    .agg(                              # define new table
        tokens = ("token", "count"),   #   - token count per character
        lemmas = ("lemma", "nunique"), #   - number of unique lemmata
    )
)

# add a column for the ratio of lemmata to tokens
wc["ratio"] = (wc["lemmas"] / wc["tokens"]).round(2)

# show results
display(wc)

### Ranked word list

Let's create a table of word counts for each lemma in Calypso's speech. To examine her speech separately we can use `query()` to filter on speaker.

In [None]:
# create a word count by lemma
wc = (
    tokens                                 # original data
    .query(f"speaker=='Calypso'")          # filter by speaker
    .groupby("lemma")                      # group by lemma
    .agg(                                  # define new table
        count = ("token", "count"),        #  - count how many tokens have this lemma
    )
    .sort_values("count", ascending=False) # sort results from highest to lowest
)

# show results
display(wc)

### Comparing speakers

If I want to make lists for both speakers, I can use a `for` loop to avoid writing the same code twice.

In [None]:
# consider each speaker in turn
for name in ["Calypso", "Dido"]:

    # create a word count by lemma
    wc = (
        tokens                                 # original data
        .query(f"speaker=='{name}'")           # filter by speaker
        .groupby("lemma")                      # group by lemma
        .agg(                                  # define new table
            count = ("token", "count"),        #  - count how many tokens have this lemma
        )
        .sort_values("count", ascending=False) # sort results from highest to lowest
    )

    # show just the top of the list
    display(name, wc.head(10))

**Notes**

- the variable `name` is set to "Calypso" and "Dido" in turn
- in the expression `f"speaker=='{name}'"`, the `f` before the quotation marks allows me to insert variables into the quoted text by using curly braces.

### Add frequency

Because the two speeches are different lengths, it makes sense to divide by the total number of tokens. Word frequencies are often expressed per 1000 words, to make them easier to read.

In [None]:
# consider each speaker in turn
for name in ["Calypso", "Dido"]:

    # create a word count by lemma
    wc = (
        tokens                                 # original data
        .query(f"speaker=='{name}'")           # filter by speaker
        .groupby("lemma")                      # group by lemma
        .agg(                                  # define new table
            count = ("token", "count"),        #  - count how many tokens have this lemma
        )
        .sort_values("count", ascending=False) # sort results from highest to lowest
    )
    
    # calculate total tokens
    total = wc["count"].sum()
    
    # add frequency column
    wc["freq"] = (1000 * wc["count"] / total).round(1)
    
    # show just the top
    display(name, wc.head(10))

### A miniature concordance

Let's make use of some additional information in the table. Alongside the token count for each lemma, we can include a list of the unique forms that actually occur, plus a list of lines on which they are found.

**Notes**

- Not all aggregation functions produce a single number. `unique` produces a list of values with duplicates removed.

In [None]:
# consider each speaker in turn
for name in ["Calypso", "Dido"]:

    # create a word count by lemma
    wc = (
        tokens                                 # original data
        .query(f"speaker=='{name}'")           # filter by speaker
        .groupby("lemma")                      # group by lemma
        .agg(                                  # define new table
            count = ("token", "count"),        #  - count how many tokens have this lemma
            forms = ("token", "unique"),       #  - which forms are being counted
            lines = ("line", "unique"),        #  - which lines does it occur on
        )
        .sort_values("count", ascending=False) # sort results from highest to lowest
    )
    
    # calculate total tokens
    total = wc["count"].sum()
    
    # add frequency column
    wc.insert(1, "freq", (1000 * wc["count"] / total).round(1))
    
    # show just the top
    display(name, wc.head(10))