<a href="https://colab.research.google.com/github/cwf2/style_2025/blob/main/Assignment%201b%20-%20female%20speakers%20in%20Il%20and%20Od.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install DICES client software

This step is only necessary once on most machines, but because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step.

In [None]:
!pip install -q git+https://github.com/cwf2/dices-client

### Import statements

This tells Python which ancillary functions we want to use in this notebook.

In [None]:
from dicesapi import DicesAPI
from dicesapi.text import CtsAPI
import pandas as pd
import seaborn as sns

### Initialize connection to external sources

This creates a connection to the DICS database.

In [None]:
# DICES database
api = DicesAPI(logfile="dices.log", logdetail=0)

### Get some speeches

This is the basic search function to get speeches from DICES according to specific parameters.

In [None]:
# Download speeches by female speakers in Homer
speeches = api.getSpeeches(author_name="Homer", spkr_gender="female")

# how many did we get?
n = len(speeches)

# print out a message
print(f"Retrieved {n} speeches")

### Print a list of speeches retrieved.

In addition to basic attributes of the speeches, we also do a rough calculation of number of lines based on first and last line number.

In [None]:
for speech in speeches:
    # separate book and line numbers
    book_first, line_first = speech.l_fi.split(".")
    book_last, line_last = speech.l_la.split(".")

    # calculate length of speech
    nlines = int(line_last) - int(line_first) + 1

    # print row
    print(
        speech.id,
        speech.author.name,
        speech.work.title,
        book_first,
        line_first,
        line_last,
        nlines,
        speech.getSpkrString(),
        speech.getAddrString(),
        sep="\t")

### Make a table

Python can work with tabular data like a spreadsheet with the help of the ancillary package [Pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide). Here we make the same data into a Pandas data frame.

We can select which parts of the data we have collected above we want to put in our table.

In [None]:
# an empty list to hold the rows
rows = list()

# iterate over the speeches
for speech in speeches:
    # separate book and line numbers
    book_first, line_first = speech.l_fi.split(".")
    book_last, line_last = speech.l_la.split(".")

    # calculate length of speech
    nlines = int(line_last) - int(line_first) + 1

    # create a new row, labelling all the data values
    row = {
        "id": speech.id,
        "author": speech.author.name,
        "work": speech.work.title,
        "book": int(book_first),
        "first_line": line_first,
        "last_line": line_last,
        "num_lines": nlines,
        "speaker": speech.getSpkrString(),
    }

    # add the row to the list
    rows.append(row)

# make the table
table = pd.DataFrame(rows)

# write the table to a file for import to Excel
table.to_csv("speeches.tsv", sep="\t", index=False)

# display the table
display(table)

### Summarize data

Just like in Excel, we can summarize tabular data with a pivot table (draaitabel). In this example, we'll count how many speeches are attributed to female speakers in each book of the *Iliad* and the *Odyssey*.

We need to specify which columns in the original table we want to use:
- The rows (or "index") of our summary table will come from **book**. Each book number gets one row in the new table.
- The columns will come from **work**, i.e., "iliad" vs "odyssey".
- We'll derive the values for each cell from the `id` column: that is, we're going to count how many speeches each gender gets.

We also need to specify how we want to summarize the speech ids. In this case, we just want to count them. We tell Python this using the `aggfunc` ("aggregation function") parameter.

In [None]:
count_by_book = (
    table
    .pivot_table(
        index="book",
        columns="work",
        values="id",
        aggfunc="count"
    )
    .fillna(0)
    .astype(int)
)
count_by_book.to_csv("speech_count_by_book.csv", index=False)
display(count_by_book)

### Make a graph

Pandas has some basic visualization functions built in. Let's turn the summary table above into a bar graph.

In [None]:
# generate a bar graph
plot_by_book = count_by_book.plot.bar(title="Speeches by female speakers", ylabel="number of speeches")

# save to an image file
plot_by_book.figure.savefig("speech_count_by_book.png")

### More aggregation options

Let's do a second summary, this time looking at the number of lines spoken by women in each book of each poem. The rows and columns of our summary table will be the same as last time. But now the values will come from `num_lines` and the aggregation function will be `"sum"` instead of `"count"`.

In [None]:
count_by_line = (
    table
    .pivot_table(
        index="book",
        columns="work",
        values="num_lines",
        aggfunc="sum"
    )
    .fillna(0)
    .astype(int)
)
count_by_line.to_csv("lines_by_gender.csv", index=False)
display(count_by_line)

In [None]:
# generate a bar graph
plot_by_line = count_by_line.plot.bar(title="line count of female speakers", ylabel="number of lines")

# save to an image file
plot_by_line.figure.savefig("line_count_by_gender.png")