<a target="_blank" href="https://colab.research.google.com/github/cwf2/toronto2024/blob/main/Ex_01%20-%20Data%20Structures.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Introduction to the Python DICES client

In this workshop, we’re going to look at how to retrieve and work with DICES data inside a Python script. While the DICES web interface can be helpful for browsing and exploring the data, more complicated tasks are better suited to a script.

For example:
- when your search has several steps, and you want to make sure they're done in a specific order
- when you want to repeat an operation many times and collate the results
- when you have to connect information from different sources, like DICES, Perseus, MANTO, etc.

## The DICES API

You could, if you wanted, examine the machine-oriented version of the database manually using your web-browser. A separate set of URLs provides access to the same data, but without the human-friendly tables, drop-downs and buttons. For example, compare the two pages below. Both represent the same search, for speeches made by Aphrodite to her son, Eros.

- for humans: http://dices.ub.uni-rostock.de/app/speeches?spkr_name=Aphrodite&addr_name=Eros
- for machines: http://dices.ub.uni-rostock.de/api/speeches?spkr_name=Aphrodite&addr_name=Eros

The machine-actionable API is provided by Django Web Framework. If you’re interested in working with the API directly or have questions or suggestions about its implementation, please feel free to let us know!

## The DICES client

Most of the time, working with URLs like the one above and parsing the JSON responses from the server isn’t something you want to have to deal with. The Python DICES client provides a wrapper around the API that lets you make requests and manipulate the results using Python objects.

### The database connection

The client provides a class, **DicesAPI**, which allows you to manage your connection to the database. This is how you request data; it also lets you specify a custom server, in case you’re running your own mirror of the database. The first part of this tutorial will cover searching the database and manipulating the results.

In [None]:
# Google Colab only:
#   run the line below to install the DICES client

!pip install --quiet git+https://github.com/cwf2/dices-client.git

In [None]:
from dicesapi import DicesAPI

# create a connection to the DICES database
api = DicesAPI(logdetail=0)

### Searching speech records

Now that we have a connection to the database, we can search for speeches in more or less the same terms as via the web interface. Here's how we download records for speeches from Aphrodite to Eros using the API.

The code below uses `api`, our connection to the database, to ask for speeches (via the `getSpeeches()` method) and then saves (or "assigns") the results to a new variable, `speeches`.

In [None]:
# request speeches from Aphrodite to Eros
speeches = api.getSpeeches(
    spkr_name = 'Aphrodite',
    addr_name = 'Eros',
)

#### 🤔 Did anything happen?
Let's find out. We can check the length of `speeches` to see how many results we got. We can also iterate over the results with a `for` loop and print them to the screen one at a time.

In [None]:
# check the number of speeches
print(len(speeches), 'speeches found.\n')

# iterate over all the results
for s in speeches:
    print(s)

#### DICES object classes

By default, the speeches print to the screen with some basic information about their location in the text. But each speech is represented in Python as a complex object with several additional properties.

For example, let's look at just the first speech.

In [None]:
# select the first speech
s = speeches[0]

print('author:', s.author)
print('work:', s.work)
print('loci:', s.l_range)
print('language:', s.lang)
print('speaker:', s.spkr)
print('addressee:', s.addr)
print('embedded level:', s.level)
print('discourse type:', s.type)
print('turn:', s.part)

Notice that the author, work, speaker(s) and addressee(s) are all represented as objects. That means that we can interrogate their attributes as well. The dot notation (`.`) allows us to drill down through nested objects.

In [None]:
print('author name:', s.author.name)
print('work title:', s.work.title)

#### Characters and Character Instances

Much of what we're interested in has to do with speakers and addressees. DICES has information about a large catalogue of epic characters. Representing these characters across multiple texts can be challenging, especially when key attributes change according to context.

We use two different levels of representation to handle this: **Characters** represent the underlying, core characteristics of a character, while **Character Instances** are used to represent the instantiation of a character in a particular context.

We can see this reflected in the use of Roman names for Aphrodite and Eros in some of our results:

In [None]:
for s in speeches:
    print(s.work.title, s.l_range)
    print(' - speaker:', s.getSpkrString())
    print(' - addressee:', s.getAddrString())
    print()

In all of these cases the **Character** Aphrodite is the speaker, but some of the **Character Instances** of this character have a different name. In more complex situations, the Character Instance may also have a different gender from the Character.

Note that the `spkr` and `addr` attributes of a speech are always **lists** of `CharacterInstance` objects. Often those lists have only one CharacterInstance in them, but some speeches have multiple addressees and a few have multiple speakers. To keep things consistent, we use lists throughout. Use `getSpkrString()` and `getAddrString()` if you want a single value. They'll join multiple names with a comma if need be.

## Tabular data

Even with only a few results, it can be helpful to organize records as a table of values. The **Pandas** package is a great way to work with tables (Pandas calls them **Data Frames**) in Python. You can also use Pandas to import/export tabular data from a spreadsheet.

In [None]:
import pandas as pd

# define the table
table = pd.DataFrame(dict(
    author = s.author.name,
    work = s.work.title,
    loci = s.l_range,
    spkr = s.getSpkrString(),
    addr = s.getAddrString(),
    turn = s.part,
    level = s.level,
    lang = s.lang,
) for s in speeches)

display(table)

#### A slightly more complicated example

Let's look at how we can use Pandas to manage a slightly larger set of results. This time, we'll search for all of Aphrodite's speeches throughout the corpus.

In [None]:
# request records for all speeches by aphrodite
aph_speeches = api.getSpeeches(spkr_name='Aphrodite')

Now we build a table including the gender of every addressee. I'm going to use character names rather than character instance names here, to avoid for example dividing Aphrodite's speeches to her husband between "Mars" and "Ares".

Remember that the `spkr` attribute of each speech is a list of character instances. To drill down to the character level, we need to check the `char` of each instance. Note that some anonymous instances and collectives don't have an underlying char, so I have to tell Python what to do in that case.

In [None]:
aph_table = pd.DataFrame(dict(
    author = s.author.name,
    work = s.work.title,
    loci = s.l_range,
    spkr_name = [inst.char.name if inst.char is not None else inst.name for inst in s.spkr],
    addr_name = [inst.char.name if inst.char is not None else inst.name for inst in s.addr],
    spkr_gend = [inst.gender for inst in s.spkr],
    addr_gend = [inst.gender for inst in s.addr],
    turn = s.part,
    level = s.level,
    lang = s.lang,
) for s in aph_speeches)

display(aph_table)

The square brackets that you see in the speaker and addressee columns are Python's way of showing us that all the values are lists. Most of the lists contain a single element, but sometimes Aphrodite addresses both Athena and Hera together, for example.

Let's imagine that we want to count up every one of Aphrodite's addressees, and so for example Argonautica 3.52-3.54 should be counted twice, once for Athena's tally and once for Hera's. In that case, we need to break out all the lists so that each row has only one addressee. Pandas does that with the `explode()` method.

In [None]:
aph_table = aph_table.explode(['spkr_name', 'spkr_gend'])
aph_table = aph_table.explode(['addr_name', 'addr_gend'])
display(aph_table)

### Exporting tabular data

This might already be something that we're interested in working with in Excel. To save the table to a file, use the `to_csv()` method.

In [None]:
aph_table.to_csv('aphrodite_speeches.csv')

### Aggregating results

On the other hand, there's a lot we can do right here in Python. For example, let's tally how many speeches Aphrodite has in each language. First we **group** the rows by language using `groupby()`, and then we **aggregate** the groups using `aggregate()` ... or `agg()` for short.

When we aggregate, we create a new table in which each of the groups is represented by a single row. We can choose which columns we want to summarize and what **aggregation function**(s) we want to use for each. The format is, NEW_COL = ('OLD_COL', 'FUNC'). For example, here, I'm using
```
speeches = ('loci', 'count')
```
which means, create a new column called **speeches** based on the old column **loci**, using the function `count()`. The value of **speeches** for a given group is calculated as the count of loci in that group.

In [None]:
aph_table.groupby('lang').agg(speeches = ('loci', 'count'))

Let's consider how the gender of addressees breaks down by language. For that we can **cross-tabulate** two columns in the table with `crosstab()`.

To pull a given column from a table, we use square brackets around the column name.

In [None]:
pd.crosstab(aph_table['lang'], aph_table['addr_gend'])

From this we can see that Aphrodite talks to women much less in Latin texts than in Greek texts. We also see an instance of a non-binary addressee. Let's find that row in the table and see who it is.

We can use criteria to select particular rows of the table with `.loc`. In this case, we'll pull out the row we want based on a value of column **addr_gend**.

In [None]:
aph_table.loc[aph_table['addr_gend'] == 'x']

Using the criterion, we found the speech we were interested in. It's not a non-binary character, but a collective addressee including both male and female deities.

#### Ranking addressees

Let's use `groupby()` and `agg()` again, this time to tally how many times Aphrodite addresses each named addressee. While we're at it, let's sort the list from greatest number of speeches to least.

In [None]:
(aph_table
    .groupby('addr_name')
    .agg(speeches = ('loci', 'count'))
    .sort_values('speeches', ascending=False)
)