<a href="https://colab.research.google.com/github/cwf2/style_2025/blob/main/Assignment%202b%20-%20pronouns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment: pronouns

In this assignment we look at the use of pronouns in two speeches: Calypso to Odysseus at *Odyssey* 5.203-5.213 and Dido to Aeneas at *Aeneid* 4.305-4.330.

The first section fast-forwards through the same steps as the **Example 2** notebook, skipping some of the longer explanations. If you're continuing from that point, you can skip ahead to [Section 3](#3.-Part-of-speech-counts) below.

## 1. Preliminaries

### Install DICES client software

Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step.

In [None]:
# revert pip for language model compatibility
!pip install -q --force-reinstall pip==22.0.0
# install DICES client
!pip install -q git+https://github.com/cwf2/dices-client
# install language models
!pip install -q https://huggingface.co/latincy/la_core_web_md/resolve/main/la_core_web_md-any-py3-none-any.whl

### Import statements

Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas.

In [None]:
# for talking to DICES
from dicesapi import DicesAPI
# for talking to Perseus
from dicesapi.text import CtsAPI
# for creating data tables
import pandas as pd

### Connect to external resources
Here we instantiate connections to the DICES database and to the Perseus Digital Library.

In [None]:
# create a new DICES connection
api = DicesAPI(logfile="dices.log", logdetail=0)

# create a new CTS connection
cts = CtsAPI(dices_api=api)

## 2. Download speeches

It's a little hard to request these exact speeches from DICES. I'm going to:

- request all speeches by Calypso to Odysseus
- request all speeches by Dido to Aeneas
- add the results together
- look at them manually and note the IDs of the speeches I want
- filter the results for just those IDs

### Get speeches from DICES

In [None]:
# download metadata on speeches by women in the Odyssey and the Aeneid
speeches = (
    api.getSpeeches(spkr_name="Calypso", addr_name="Odysseus") +
    api.getSpeeches(spkr_name="Dido", addr_name="Aeneas")
).sorted()

# check that it worked
print(len(speeches), "speeches.")

### Make a speech table

Here we create a Pandas data frame from our results.

In [None]:
# an empty list of rows
rows = []

# iterate over all speeches
for speech in speeches:

    # create a new row
    row = {
        "speech_id": speech.id,
        "language": speech.lang,
        "author": speech.author.name,
        "work": speech.work.title,
        "first_line": speech.l_fi,
        "last_line": speech.l_la,
        "speaker": speech.getSpkrString(),
        "address": speech.getAddrString(),
    }

    # add row to the list
    rows.append(row)

# make a table
speech_table = pd.DataFrame(rows)

# display
display(speech_table)

### Filter on ID

From the table, I can see that the speeches I want have IDs 820 and 1610. I'm going to create a new group with just those.

In [None]:
# filter on ID
farewells = speeches.filterIDs([820, 1610])

# check that it worked
for speech in farewells:
    print(speech)

### Download the text from Perseus

Next, I'm going to download the text of both speeches from Perseus using the CTS connection I created earlier. The `getPassage()` method takes a DICES speech as its argument and returns the corresponding passage from Perseus. I'll save the passage as a new attribute of the the speech object to make sure that the speech and its text stay together.

It's important to check that this step works... a few speeches in Perseus don't work well with CTS yet.

In [None]:
# loop over two farewell speeches
for speech in farewells:
    # download the text from perseus
    speech.passage = cts.getPassage(speech)

    # display the results
    print(speech, "\n")
    print(speech.passage.text, "\n")

### Parse the text using natural language processing

In this step, we process the downloaded text of the speeches using NLP. The method `runSpacyPipeline()` runs the appropriate spaCy pipeline based on the passage's language and saves the output to a new attribute called `spacy_doc`.

In [None]:
# loop over two farewell speeches
for speech in farewells:
    # run spacy
    speech.passage.runSpacyPipeline()

    # check that it worked
    print(speech, len(speech.passage.spacy_doc), "tokens")

### Create a table of tokens

This code creates a table with one row per token. Using nested `for` loops, we take each speech in turn, then each token of that speech. We build a record for the row that incorporates information about the speech and the token. At the end, we turn the list of rows into a Pandas data frame.

In [None]:
# start with an empty list of rows
rows = []

# iterate over speeches
for speech in farewells:

    # iterate over tokens
    for token in speech.passage.spacy_doc:

        # create a record for this token
        row = {
            "speech_id": speech.id,
            "language": speech.lang,
            "author": speech.author.name,
            "title": speech.work.title,
            "book": speech.getPrefix(),                       # see note 1 below
            "line": speech.passage.getLine(token)["n"],       # see note 2
            "speaker": speech.getSpkrString(),
            "addressee": speech.getAddrString(),
            "token": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "verbform": ''.join(token.morph.get("VerbForm")), # see note 3
            "mood": ''.join(token.morph.get("Mood")),
            "voice": ''.join(token.morph.get("Voice")),
            "tense": ''.join(token.morph.get("Tense")),
            "person": ''.join(token.morph.get("Person")),
            "case": ''.join(token.morph.get("Case")),
            "gender": ''.join(token.morph.get("Gender")),
            "number": ''.join(token.morph.get("Number")),
        }

        # add row to the list of rows
        rows.append(row)

# convert to a table
tokens = pd.DataFrame(rows)

# write the tokens to a CSV file
tokens.to_csv("tokens.tsv", sep="\t", index=False)

# display the table
display(tokens)

## 3. Part-of-speech counts

Beginning from the token table, we can create counts of any feature using `groupby()` and `agg()`. Note that both the Latin and Greek models are using [Universal Part of Speech](https://universaldependencies.org/u/pos/) tags. In theory, this should allow at least some cross-language comparison.

### Basic counts

This is the same sort of data manipulation we did in the first half of the workshop. We're grouping by two factors, speaker and part-of-speech. We want our summary table to have rows drawn from the values of **speaker** and columns from the values of **pos**.

In [None]:
# part of speech usage by character
pos = (
    tokens                            # original data
    .groupby(["speaker", "pos"])      # group by character
    .agg(                             # define new table
        count = ("token", "count"),   #   - token count
    )
    .unstack()                        # move second factor to columns
    .fillna(0)                        # add zeros for missing data
    .astype(int)                      # convert to integer (since counts are whole numbers)
)

# show results
display(pos)

**Notes**

- If a given part of speech occurs only in one speaker, the default (stacked) version of the table will just leave it out for the other speaker. But when we `unstack()` to move POS tags to columns, then something has to go in the cell for that column. Pandas defaults to puting `NaN`, representing missing data. Sometimes `NaN` is the best answer, but in this case we can safely replace missing values with zeros.
- Another default of Pandas is to use *floating-point* values (i.e. numbers with a decimal) in summary tables. Because all our counts are necessarily whole numbers, the decimal doesn't make sense. I'm adding a line to force the whole table to integer format.

### Converting to frequencies

Since the two speeches are different lengths, it makes sense to convert raw counts to frequencies. We'll represent these as per 1000 words to keep the numbers manageable.

In [None]:
# calculate total number of tokens for each speech
totals = pos.apply(sum, axis=1)

# divide each value in each row by the respective total
pos_freq = (
    pos["count"]              # needed because column names have two levels
    .div(totals, axis=0)      # divide by totals, row-wise
    .multiply(1000)           # convert to freq per 1000 words
    .round(1)                 # round answer to 1 decimal
)

# show results
display(pos_freq)

Now we can see more clearly that, for example, while Calypso uses fewer adjectives in absolute terms, they represent a larger proportion of her speech. Nouns and verbs, meanwhile, seem relatively similar between the two speeches.

Some things, like adverbs, will depend a lot on the difference between the languages and the behaviour of the models. It's worth looking more closely to see what is included in these tags.

As for pronouns, the subject of our interest here, Dido uses more both in absolute terms and proportionally.

### A little concordance of pronouns

Let's create a list of pronouns used by each speaker. We'll count how often each pronoun is used, and include a list of the forms it takes and where it occurs in the speech.

In [None]:
# create a concordance of pronouns
conc = (
    tokens                                 # start with full token table
    .query("pos=='PRON'")                  # keep only pronouns
    .groupby(["speaker", "lemma"])         # group by speaker and lemma
    .agg(                                  # define new summary table
        count = ("token", "count"),        #  - count of tokens per lemma
        forms = ("token", "unique"),       #  - list of all unique forms
        lines = ("line", "unique"),        #  - list of loci
    )
    .sort_values(["speaker", "count"], ascending=False) # sort results
)

display(conc)

### Distribution within the speeches

While our little concordance gives us some sense of the pronouns' distributions in their respective speeches, it can be hard to visualize. Let's count pronouns per line so we can get a clearer picture of where they are in the speech.

In [None]:
# create a table of POS counts per line
pos_by_line = (
    tokens                                # original token table
    .query(f"speaker=='Calypso'")         # filter by speaker name
    .groupby(["line", "pos"])             # group line by line
    .agg(                                 # define new table
        count = ("token", "count"),       #  - tokens per line
    )
    .unstack()          # move pos to columns
    .fillna(0)          # fill missing values with zero
    .astype(int)        # convert to whole numbers
)

# drop the extra "count" level in the column names
pos_by_line = pos_by_line["count"]

# show results
display(pos_by_line)


### Graph distribution

Now let's graph the resulting distribution.

In [None]:
# display a bar plot of PRON counts
pos_by_line["PRON"].plot.bar(title="Calypso")

#### Likewise for Dido...

In [None]:
# create a table of POS counts per line
pos_by_line = (
    tokens                                # original token table
    .query(f"speaker=='Dido'")            # filter by speaker name
    .groupby(["line", "pos"])             # group line by line
    .agg(                                 # define new table
        count = ("token", "count"),       #  - tokens per line
    )
    .unstack()          # move pos to columns
    .fillna(0)          # fill missing values with zero
    .astype(int)        # convert to whole numbers
)

# drop the extra "count" level in the column names
pos_by_line = pos_by_line["count"]

# display a bar plot of PRON counts
pos_by_line["PRON"].plot.bar(title="Dido")