# Retrieving texts and counting words

In this example, we'll retrieve the texts of speeches from a remote server and do a basic word count.

## Scenario

Let's say we want to know how many words Achilles speaks to each of his interlocutors. We can search the DICES database for the relevant speeches using the API. Then, to count the number of words, we'll have to retrieve the text of the speeches themselves. Since the DICES *Speech* objects include CTS URNS, we can request the passages from a remote server. 

## Preliminaries

### The DICES API

First step is to instantiate a connection to the DICES api. 

In [None]:
# Google Colab only:
#   run the line below to install the DICES client

!pip install --quiet git+https://github.com/cwf2/dices-client.git

In [2]:
from dicesapi import DicesAPI
api = DicesAPI(logfile='dices.log', logdetail=0)

### Connection to the digital library

Text-retrieval and processing tools are moving to the module `text`. We retrieve the text from an online (or locally mirrored) digital library. By default, it's Perseus's [CTS endpoint](https://scaife-cts.perseus.org/api/cts).

In [3]:
from dicesapi.text import CtsAPI
cts = CtsAPI(dices_api=api)

ImportError: cannot import name 'triu' from 'scipy.linalg' (/Users/chris/Documents/git/toronto2024/venv/lib/python3.11/site-packages/scipy/linalg/__init__.py)

### Matplotlib for figures

Let's also import **pyplot**, for drawing a simple bar graph of the results. Note the Jupyter magic `%matplotlib inline` to display the figure right in the notebook. Some people like `%matplotlib notebook` better — it gives you some fancier display options.

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

## Download the speeches

### First, the speech metadata from DICES

Using the API, we can search speeches using a set of key-value pairs. For now, JSON results from the API are paged, so if your search has a lot of results, you may have to wait for several pages to download.

In [None]:
speeches = api.getSpeeches(spkr_name='Achilles', work_title='Iliad', progress=True)

### Next, the text of the speeches from Perseus

This involves retrieving each passage from the CTS server, and extracting the plaintext of its contents.

In [None]:
pbar = NotebookPBar(max=len(speeches))

# iterate over all speeches
for s in speeches:
    
    # retrieve the passage from the remote library
    s.passage = cts.getPassage(s)
    
    if s.passage is None:
        print(f'Failed to download {speech.urn}')
        
    pbar.update()

### Run CLTK's NLP pipeline

We run a stripped-down version of CLTK's default NLP pipeline, specific to the speech's language. By default, our wrapper method around the NLP pipeline also creates an index recording the locus of each token. For this example, I'm going to turn that feature off to save time.

In [None]:
pbar = NotebookPBar(max=len(speeches))

# iterate over all speeches
for s in speeches:
    
    # parse with CLTK
    s.passage.runCltkPipeline(index=False)
    
    pbar.update()

### Count words in each speech

`Passage.cltk_doc` gives us acces to the `Doc` object created by CLTK's `NLP()`.

In [None]:
count = {}
for s in speeches:
    for addressee in s.addr:
        count[addressee.name] = count.get(addressee.name, 0) + len(s.passage.cltk_doc.words)

### Examine the results

🤔 Let's see whether it worked!

In [None]:
for name in sorted(count):
    print(name, count[name])

### Make a simple graph with pyplot

Seems good. Let's visualize it with a simple bar chart.

In [None]:
# data for the graph
names = sorted(count)
y_pos = range(len(names))
bars = [count[name] for name in names]

# create a new figure
fig, ax = plt.subplots(figsize=(8, 8))

# draw the bars
ax.barh(y_pos, bars, align='center')

# annotate the graph
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Number of Words')
ax.set_ylabel('Addressee')
ax.set_title('Length of Achilles\' speeches')

plt.show()

### NLP with SpaCy

The DICES text module also provides wrappers around SpaCy's NLP pipeline. This produces a different parsing of the text with different relative strengths. Running SpaCy via the DICES library is parallel to running CLTK, and the resulting output is very similar in structure; attributes of the tokens have slightly different names and properties, though. 

By default, the SpaCy wrapper uses Jacobo Myerston's [grc_proiel_sm](https://huggingface.co/Jacobo/grc_proiel_sm) for Greek and LatinCy's [la_web_core_sm](https://huggingface.co/latincy/la_core_web_sm) for Latin.

In [None]:
pbar = NotebookPBar(max=len(speeches))

# iterate over all speeches
for s in speeches:
    
    # parse with SpaCy
    s.passage.runSpacyPipeline(index=False)
    
    pbar.update()

### Count words using SpaCy

`Passage.spacy_doc` gives us acces to a Spacy Doc object that, like the CLTK doc, can be treated as a container of tokens.

In [None]:
count = {}
for s in speeches:
    for addressee in s.addr:
        count[addressee.name] = count.get(addressee.name, 0) + len(s.passage.spacy_doc)
        
for name in sorted(count):
    print(name, count[name])

### Plot

In [None]:
# data for the graph
names = sorted(count)
y_pos = range(len(names))
bars = [count[name] for name in names]

# create a new figure
fig, ax = plt.subplots(figsize=(8, 8))

# draw the bars
ax.barh(y_pos, bars, align='center')

# annotate the graph
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Number of Words')
ax.set_ylabel('Addressee')
ax.set_title('Length of Achilles\' speeches')

plt.show()