<a target="_blank" href="https://colab.research.google.com/github/cwf2/toronto2024/blob/main/Ex_03%20-%20Text.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Retrieving texts and counting words

In this example, we'll retrieve the texts of speeches from a remote server and do a basic word count.

## Scenario

Let's say we want to know how many words Aphrodite speaks to each of her interlocutors. We can search the DICES database for the relevant speeches using the API. Then, to count the number of words, we'll have to retrieve the text of the speeches themselves. Since the DICES *Speech* objects include CTS URNS, we can request the passages from a remote server. 

## Preliminaries

### The DICES API

First step is to instantiate a connection to the DICES api. 

In [None]:
# Google Colab only:
#   run the line below to install the DICES client

!pip install --quiet git+https://github.com/cwf2/dices-client.git

In [None]:
from dicesapi import DicesAPI
api = DicesAPI(logfile='dices.log', logdetail=0)

### Connection to the digital library

Text-retrieval and processing tools are moving to the module `text`. We retrieve the text from an online (or locally mirrored) digital library. By default, it's Perseus's [CTS endpoint](https://scaife-cts.perseus.org/api/cts).

In [None]:
from dicesapi.text import CtsAPI
cts = CtsAPI(dices_api=api)

#### Additional modules

Let's also import **pyplot**, for drawing a simple bar graph of the results, and Pandas for tabular results.

In [None]:
import pandas as pd
from matplotlib import pyplot as plt

## Download the speeches

### First, the speech metadata from DICES

Using the API, we can search speeches using a set of key-value pairs. For now, JSON results from the API are paged, so if your search has a lot of results, you may have to wait for several pages to download.

In [None]:
speeches = api.getSpeeches(spkr_name='Aphrodite', work_title='Iliad')

print(f'Got {len(speeches)} speeches.')

### Next, the text of the speeches from Perseus

This involves retrieving each passage from the CTS server, and extracting the plaintext of its contents.

In [None]:
failed = []

# iterate over all speeches
for i, s in enumerate(speeches):
    print(f'\r [{i+1}/{len(speeches)}]', end='')
    
    # retrieve the passage from the remote library
    s.passage = cts.getPassage(s)
    
    if s.passage is None:
        failed.append(s)
        
print(' done.')
if len(failed) > 0:
    print(f'Failed to download text for {len(failed)} speeches:')
    for s in failed:
        print(s)

### Run CLTK's NLP pipeline

We run a stripped-down version of CLTK's default NLP pipeline, specific to the speech's language. By default, our wrapper method around the NLP pipeline also creates an index recording the locus of each token. For this example, I'm going to turn that feature off to save time.

In [None]:
failed = []

# iterate over all speeches
for s in speeches:
    print(f'\r [{i+1}/{len(speeches)}]', end='')

    # parse with CLTK
    s.passage.runCltkPipeline(index=False)
    
    if s.passage.cltk_doc is None:
        failed.append(s)

print(' done.')
if len(failed) > 0:
    print(f'CLTK pipeline failed for {len(failed)} speeches:')
    for s in failed:
        print(s)

### Examine the results

`Passage.cltk_doc` gives us acces to the `Doc` object created by CLTK's `NLP()`. It's a pretty complicated object.

In [None]:
# select the first speech
s = speeches[0]

# examine the results from CLTK
print(s.passage.cltk_doc)

The simplest way to work with the NLP results is to iterate over `cltk_doc`: this will give you one word at a time (according to CLTK's partitioning of the text, anyway).

In [None]:
# look at the first five words
for word in s.passage.cltk_doc[:10]:
    print(word)

In [None]:
Let's count the words spoken to each of Aphrodite's interlocutors.

In [None]:
count = {}
for s in speeches:
    for addressee in s.addr:
        count[addressee.name] = count.get(addressee.name, 0) + len(s.passage.cltk_doc.words)

for name in sorted(count):
    print(name, count[name])

### Make a simple graph with pyplot

Seems good. Now let's visualize it with a simple bar chart.

In [None]:
# data for the graph
names = sorted(count)
y_pos = range(len(names))
bars = [count[name] for name in names]

# create a new figure
fig, ax = plt.subplots(figsize=(8, 8))

# draw the bars
ax.barh(y_pos, bars, align='center')

# annotate the graph
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Number of Words')
ax.set_ylabel('Addressee')
ax.set_title('Length of Aphrodite\'s speeches')

plt.show()

### Extracting additional details about words

Each word object has many associated attributes, incuding the string as it appears in the original text, the lemma that CLTK has attributed it to, and some syntactical and morphological information.

Here's a recipe for creating a big table of words, organizing some of those attribtues, using Pandas.

In [None]:
pd.DataFrame(dict(
    speech_id = s.id,
    author = s.author.name,
    work = s.work.title,
    loci = s.l_range,
    spkr = s.getSpkrString(),
    addr = s.getAddrString(),
    token = w.string,
    lemma = w.lemma,
    pos = w.upos, 
) for s in speeches for w in s.passage.cltk_doc)