<a href="https://colab.research.google.com/github/cwf2/dices-mta/blob/main/spacy_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install packages

This cell is necessary for Google Colab. It installs language models, local copies of the texts, and the DICES client.

In [None]:
# install the language models
!pip install https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl
!pip install https://huggingface.co/chcaa/grc_odycy_joint_trf/resolve/main/grc_odycy_joint_trf-any-py3-none-any.whl

# install Capitains/Nautilus
!pip install git+https://github.com/Capitains/Nautilus.git

# install local text repositories
!git clone https://github.com/cwf2/canonical-latinLit.git
!git clone https://github.com/cwf2/canonical-greekLit.git

# install DICES client
!pip install git+https://github.com/cwf2/dices-client

## Import statements

In [None]:
# DICES client
from dicesapi import DicesAPI
from dicesapi.text import CtsAPI, spacy_load

# necessary for retrieving text from local repositories
from MyCapytain.resolvers.cts.local import CtsCapitainsLocalResolver
from MyCapytain.resources.prototypes.metadata import UnknownCollection

# Pandas for tabular data
import pandas as pd

## Initialize DICES connection

This is the DICES API, allowing us to search for speeches.

In [None]:
# create connection to DICES
api = DicesAPI(logdetail=0)

## Initialize CTS connection

This is the CTS API, allowing us to retrieve texts by URN. In this example, we not only instantiate a default CTS API, but we also create a *local resolver* that can serve texts from the local repositories we downloaded in the first cell.

We have to do a little surgery to overwrite the default CTS API object's resolver with the local one.

- Note: The resolver will generate a lot of errors; these can be ignored unless they pertain to a text you want to retrieve.

In [None]:
# create a local resolver
local_resolver = CtsCapitainsLocalResolver(['canonical-greekLit', 'canonical-latinLit'])

# initialize the CTS API
cts = CtsAPI(dices_api=api)

# overwrite the default resolver
cts._resolvers = {None: local_resolver}

# Retrieve some speeches

## First, get the speech metadata from DICES

Using the API, we can search speeches using a set of key-value pairs. For now, JSON results from the API are paged, so if your search has a lot of results, you may have to wait for several pages to download.

In [None]:
# search for speeches by Achilles
speeches = api.getSpeeches(spkr_name='Achilles')

## Retrieve the text of the speeches

- When using a local resolver we have to explicitly trap errors resulting from missing texts.

In [None]:
# iterate over all the speeches
#  - retrieve the text with CTS
for s in speeches:
    try:
        s.passage = cts.getPassage(s)
    except UnknownCollection:
        s.passage = None
        print(f'failed: {s}')

## Natural language processing with SpaCy

### Set the SpaCy language models

Here, we're using different language models than the defaults. We downloaded these in the first cell.

In [None]:
# initialize spacy models
spacy_load(
    latin_model = 'la_core_web_lg',
    greek_model = 'grc_odycy_joint_trf',
)

### Run the SpaCy pipeline to parse the text of each passage

In [None]:
for i, s in enumerate(speeches):
    print(f'[{i+1}/{len(speeches)}] {s.author.name} {s.work.title} {s.l_range}', end=' ... ')
    if s.passage is not None:
        s.passage.runSpacyPipeline()
        if s.passage.spacy_doc is not None:
            print(f'{len(s.passage.spacy_doc)} tokens')
        else:
            print('failed')
    else:
        print('no text')

### Extract the token features

In [None]:
token_table = pd.DataFrame(dict(
    speech = s.id,
    urn = s.work.urn,
    author = s.author.name,
    work = s.work.title,
    l_fi = s.l_fi,
    l_la = s.l_la,
    spkr = [inst.name for inst in s.spkr],
    addr = [inst.name for inst in s.addr],
    line = s.passage.line_array[s.passage.getLineIndex(tok)]['n'],
    lpos = s.passage.getLinePos(tok),
    token = tok.text,
    lemma = tok.lemma_,
    pos = tok.pos_,
    mood = tok.morph.get('Mood'),
    tense = tok.morph.get('Tense'),
    voice = tok.morph.get('Voice'),
    person = tok.morph.get('Person'),
    number = tok.morph.get('Number'),
    case = tok.morph.get('Case'),
    gender = tok.morph.get('Gender'),

) for s in speeches if s.passage is not None for tok in s.passage.spacy_doc)

display(token_table)