# Table of contents
1. [Introduction](#introduction)
1. [Install pre-release of CLTK](#install)
1. [Get data](#get-data)
1. [Run NLP pipeline with `NLP()`](#run-nlp)
1. [Inspect CLTK `Doc`](#inspect-doc)
1. [Inspect CLTK `Word`](#inspect-word)
1. [Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`](#morph)
1. [Modeling syntax with `Form` and `DependencyTree`](#syntax)
1. [Feature extraction](#features)
1. [Brief demonstration of `NLP()` for Ancient Greek](#greek-nlp)

# Introduction <a name="introduction"></a>

This notebook demonstrates how to use `NLP()`, the CLTK's primary interface, in Latin and Ancient Greek. Pipelines are available for 17 languages (see [Languages](https://docs.cltk.org/en/latest/languages.html) in the docs).

Full documentation available at <https://docs.cltk.org/en/latest/cltk.html#cltk.nlp.NLP>.

Note that there is a large amoung of code from this project's first six years (v. `0.1`), not all of which has been or will be moved over to this v. `1.0`. Docs for `0.1` still available at [https://docs.cltk.org](https://docs.cltk.org/en/latest/) and tutorial notebooks at [https://github.com/cltk/tutorials](https://github.com/cltk/tutorials).

# Install pre-release of CLTK <a name="install"></a>

This notebook comes from <https://github.com/cltk/cltk/tree/dev/notebooks>. For full instructions on installing the CLTK are available at <https://docs.cltk.org/en/latest/installation.html>.

In [1]:
## Requires Python 3.7, 3.8, 3.9 on a POSIX-compliant OS

## The latest published beta:
# !pip install --pre cltk

In [2]:
## Or directly from this repo:
# cd .. && make install

# %load_ext autoreload
# %autoreload 2

# Get data <a name="get-data"></a>

The following obtain two plaintext documents of two Classical authors. A subset of each will be used to demonstrate the CLTK.

In [3]:
# Get Latin text
# https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370
# now at `notebooks/lat-livy.txt`
!curl -O -L https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370/raw/05b7a17af4b216a4986d897c57a9987e836cc91a/lat-livy.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  899k  100  899k    0     0   370k      0  0:00:02  0:00:02 --:--:-- 5469k


In [4]:
# Get Ancient Greek text
# https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0
# now at `notebooks/grc-thucydides.txt`
!curl -O -L https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0/raw/e08e47849f64484b0950b14563bb5a9fd1e1ef1c/grc-thucydides.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  899k  100  899k    0     0   692k      0  0:00:01  0:00:01 --:--:-- 1660k


In [5]:
# read the Latin file
# We'll run the full demonstration in the Latin language first
with open("lat-livy.txt") as fo:
    livy_full = fo.read()

In [6]:
print("Text snippet:", livy_full[:200])
print("Character count:", len(livy_full))
print("Approximate token count:", len(livy_full.split()))

Text snippet: Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos, duobus, Aeneae Antenorique, et vetusti iure hospitii et quia pacis reddendaeque Helenae semper auctores fuerant, omne ius
Character count: 920884
Approximate token count: 129799


In [7]:
len(livy_full) // 12

76740

In [8]:
# Now let's cut this down to roughly 10k tokens for this demonstration's purposes
livy = livy_full[:len(livy_full) // 12]
print("Approximate token count:", len(livy.split()))

Approximate token count: 10911


# Run NLP pipeline with `NLP()` <a name="run-nlp"></a>

In [9]:
# For most users, this is the only import required
from cltk import NLP

In [10]:
# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")

‎𐤀 CLTK version 'cltk 1.0.0b10'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.


In [11]:
# Removing ``LatinLexiconProcess`` for this demo b/c it is slow (adds ~9 mins total)
cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

[<class 'cltk.alphabet.processes.LatinNormalizeProcess'>, <class 'cltk.dependency.processes.LatinStanzaProcess'>, <class 'cltk.embeddings.processes.LatinEmbeddingsProcess'>, <class 'cltk.stops.processes.StopsProcess'>, <class 'cltk.ner.processes.LatinNERProcess'>]


In [12]:
# Now execute NLP algorithms upon input text
# Aside from download, execution time is ~50 sec on a 2015 Macbook Pro
%time cltk_doc = cltk_nlp.analyze(text=livy)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

CPU times: user 1min 21s, sys: 10 s, total: 1min 31s
Wall time: 1min 24s


# Inspect CLTK `Doc` <a name="inspect-doc"></a>

In [13]:
# We can now inspect the result
print(type(cltk_doc))

<class 'cltk.core.data_types.Doc'>


In [14]:
# All accessors
print([x for x in dir(cltk_doc) if not x.startswith("__")])

['_get_words_attribute', 'embeddings', 'embeddings_model', 'language', 'lemmata', 'morphosyntactic_features', 'normalized_text', 'pipeline', 'pos', 'raw', 'sentences', 'sentences_strings', 'sentences_tokens', 'stanza_doc', 'stems', 'tokens', 'tokens_stops_filtered', 'words']


In [15]:
# Several of the more useful

# List of tokens
print(cltk_doc.tokens[:20])

['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti']


In [16]:
# List of lemmas
print(cltk_doc.lemmata[:20])

['Iam', 'primus', 'omnis', 'satis', 'consto', 'mroia', 'capio', 'in', 'ceterus', 'saevio', 'sum', 'mroianus', ',', 'duo', ',', 'menea', 'mntenorique', ',', 'et', 'vetus']


In [17]:
# Basic part-of-speech info
print(cltk_doc.pos[:20])

['ADV', 'ADJ', 'PRON', 'ADV', 'VERB', 'NOUN', 'VERB', 'ADP', 'PRON', 'VERB', 'AUX', 'NOUN', 'PUNCT', 'NUM', 'PUNCT', 'NOUN', 'ADV', 'PUNCT', 'CCONJ', 'ADJ']


In [18]:
# A list of list of tokens
print(cltk_doc.sentences_tokens[:2])

[['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti', 'iure', 'hospitii', 'et', 'quia', 'pacis', 'reddendaeque', 'Helenae', 'semper', 'auctores', 'fuerant', ',', 'omne', 'ius', 'belli', 'Achiuos', 'abstinuisse', ';'], ['casibus', 'deinde', 'variis', 'Antenorem', 'cum', 'multitudine', 'Enetum', ',', 'qui', 'seditione', 'ex', 'Paphlagonia', 'pulsi', 'et', 'sedes', 'et', 'ducem', 'rege', 'Pylaemene', 'ad', 'Troiam', 'amisso', 'quaerebant', ',', 'venisse', 'in', 'intimum', 'maris', 'Hadriatici', 'sinum', ',', 'Euganeisque', 'qui', 'inter', 'mare', 'Alpesque', 'incolebant', 'pulsis', 'Enetos', 'Troianosque', 'eas', 'tenuisse', 'terras', '.']]


# Inspect CLTK `Word` <a name="inspect-word"></a>

Most powerful, though, is the ``Doc.words`` accessor, which is a list of ``Word`` objects. These ``Word`` objects contain all information that was generated during the NLP pipeline

In [19]:
# One ``Word`` object for each token
print(len(cltk_doc.words))

12556


Users can go token-by-token via ``Doc.words`` or via the intermediary step of looping through sentences.

In [20]:
# Let's look at a non-trivial sentence from Book 1
print("Original:", cltk_doc.sentences_strings[6])
print("")
print("Translation:", "Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.")
# source: http://www.perseus.tufts.edu/hopper/text?doc=Liv.+1+1+5&fromdoc=Perseus%3Atext%3A1999.02.0151
sentence_6 = cltk_doc.sentences[6]  # type: List[Word]

Original: Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aboriginesque qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .

Translation: Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.


In [21]:
# Looking at one Word, 'concurrunt' ('they run together')
a_word_concurrunt = sentence_6[40]
print(a_word_concurrunt)

Word(index_char_start=None, index_char_stop=None, index_token=40, index_sentence=6, string='concurrunt', pos=verb, lemma='concurro', stem=None, scansion=None, xpos='L3|modA|tem1|gen9', upos='VERB', dependency_relation='acl:relcl', governor=33, features={Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}, category={F: [neg], N: [neg], V: [pos]}, embedding=array([-0.16746  , -0.18548  ,  0.30632  , -0.29627  , -0.27262  ,
       -0.0767   ,  0.19405  ,  0.12386  , -0.0076342,  0.13037  ,
        0.17128  ,  0.1189   , -0.22169  , -0.57089  ,  0.28066  ,
       -0.14514  , -0.041256 , -0.021754 ,  0.02212  , -0.25983  ,
        0.53374  , -0.042267 ,  0.27314  ,  0.083616 ,  0.30746  ,
        0.087764 , -0.10098  ,  0.22689  , -0.17577  , -0.35894  ,
       -0.39609  ,  0.43406  ,  0.21306  ,  0.26909  ,  0.099561 ,
        0.26916  , -0.46547  ,  0.1416   , -0.21319  , -0.15126  ,
        0.36604  , -0.020737 ,  0.42397  ,  0.01

In this word, you can see information for lexicography (`.lemmata`), semantics (`.embedding`), morphology (`.pos`, `.features`), syntax (`.governor`, `.dependency_relation`), plus other information most users would find helpful (`.stop`, `.named_entity`).

# Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle` <a name="morph"></a>

When a language's `Pipeline` builds each `Word` object, morphological information is stored at several accessors. Those of interest to most users are `.pos` and `.features`.

In [22]:
print("`Word.string`:", a_word_concurrunt.string)
print("")
# Part-of-speech is always be available at `.pos`.
print("`Word.pos`:", a_word_concurrunt.pos)

`Word.string`: concurrunt

`Word.pos`: verb


The CLTK contains classes a specific class for [the annotation types defined by v2 of the Universal Dependencies project](https://universaldependencies.org/u/feat/all.html). In the CLTK's codebase, these are located at [cltk/morphology/universal_dependencies_features.py](https://github.com/cltk/cltk/blob/dev/src/cltk/morphology/universal_dependencies_features.py).

For instance, a Latin verb requires a label for its [https://universaldependencies.org/u/feat/all.html#al-u-feat/Mood](mood) (e.g., indicative), which the UD project defines as "a feature that expresses modality and subclassifies finite verb forms".

Though morphological taggers may annnotate a verb's mood variously ("ind.", "indicative", "Indic", etc.), the CLTK maps the term into the following, standardized `Mood`.

``` python
class Mood(MorphosyntacticFeature):
    """The mood of a verb.
    see https://universaldependencies.org/u/feat/Mood.html
    """

    admirative = auto()
    conditional = auto()
    desiderative = auto()
    imperative = auto()
    indicative = auto()
    jussive = auto()
    necessitative = auto()
    optative = auto()
    potential = auto()
    purposive = auto()
    quotative = auto()
    subjunctive = auto()
```

Turning back to the the above example word, we can see such features at `.features`.

In [23]:
# type
print("type(`Word.features`):", type(a_word_concurrunt.features))
print("")
# str repr of `MorphosyntacticFeatureBundle`
print("`Word.features`:", a_word_concurrunt.features)

type(`Word.features`): <class 'cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle'>

`Word.features`: {Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}


A user may inspect a `MorphosyntacticFeatureBundle` in a manner similar to a `dict`

In [24]:
print("Mood:", a_word_concurrunt.features["Mood"])  # type: List[Mood]
print("Number:", a_word_concurrunt.features["Number"])  # type: List[Number]
print("Person:", a_word_concurrunt.features["Person"])  # type: List[Person]
print("Tense:", a_word_concurrunt.features["Tense"])  # type: List[Tense]
print("VerbForm:", a_word_concurrunt.features["VerbForm"])  # type: List[VerbForm]
print("Voice:", a_word_concurrunt.features["Voice"])  # type: List[Voice]

# Note: The values returned here are a list, though under normally only one 
# morphological form will be available

Mood: [indicative]
Number: [plural]
Person: [third]
Tense: [present]
VerbForm: [finite]
Voice: [active]


Looking a bit closer at `MorphosyntacticFeature`, we can see how its data type inherits from the Python builtin [IntEnu](https://docs.python.org/3/library/enum.html#enum.IntEnum).

In [25]:
a_mood_obj = a_word_concurrunt.features["Mood"][0]
# see type
print("type(a_mood_obj):", type(a_mood_obj))
print("")
# See inheritance
from enum import IntEnum
print("Is `IntEnum`?", isinstance(a_mood_obj, IntEnum))
print("")
# 
from cltk.morphology.morphosyntax import MorphosyntacticFeature
print("`Mood` inherits from `MorphosyntacticFeature`?", isinstance(a_mood_obj, MorphosyntacticFeature))

type(a_mood_obj): Mood

Is `IntEnum`? True

`Mood` inherits from `MorphosyntacticFeature`? True


In [26]:
# You can manipulate this object as any IntEnum plus a few extras

print("`MorphosyntacticFeature` accessors:", [x for x in dir(a_mood_obj) if not x.startswith("__")])
print("")
print("MorphosyntacticFeature.name:", a_mood_obj.name)  # type: str
# A stable int value is available, too, associated with this name
print("MorphosyntacticFeature.value:", a_mood_obj.value)  # type: int

`MorphosyntacticFeature` accessors: ['as_integer_ratio', 'bit_length', 'conjugate', 'denominator', 'from_bytes', 'imag', 'name', 'numerator', 'real', 'to_bytes', 'value']

MorphosyntacticFeature.name: indicative
MorphosyntacticFeature.value: 5


Users can create their own `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`:

In [27]:
from cltk.morphology.morphosyntax import MorphosyntacticFeatureBundle
from cltk.morphology.universal_dependencies_features import Mood, Number, Person, VerbForm, Voice

latin_word_sim = "sim"

mood = Mood.subjunctive
voice = Voice.active
person = Person.first
number = Number.singular
verb_form = VerbForm.finite

latin_word_sim_bundle = MorphosyntacticFeatureBundle(mood, voice, person, number, verb_form)
print(latin_word_sim_bundle)

{Mood: [subjunctive], Voice: [active], Person: [first], Number: [singular], VerbForm: [finite]}


Finally, we may even construct a `Word` with this information:

In [28]:
from cltk.core.data_types import Word

print(Word(string="sim", features=latin_word_sim_bundle))

Word(index_char_start=None, index_char_stop=None, index_token=None, index_sentence=None, string='sim', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={Mood: [subjunctive], Voice: [active], Person: [first], Number: [singular], VerbForm: [finite]}, category={}, embedding=None, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)


In [29]:
# For more on this or any other CLTK class, use `help()`
# help(a_mood_obj)
# help(MorphosyntacticFeatureBundle)

In [30]:
# Note: Extra morphological info may be written in `str` type
# to to the values at `.upos` and `.xpos` for languages using
# Stanza project

# Note: The particular annoations at these are often inconsistent across
# languages or even treebanks within a single language; hence the benefit
# of the CLTK's modeling at `.pos`.
print("`Word.upos`:", a_word_concurrunt.upos)
print("`Word.xpos`:", a_word_concurrunt.xpos)

`Word.upos`: VERB
`Word.xpos`: L3|modA|tem1|gen9


# Modeling syntax with `Form` and `DependencyTree`  <a name="syntax"></a>

The CLTK uses the builtin `xml` library to build tree for modeling dependency parses. A `Word` is mapped into a `Form`, then `ElemntTree` is used to organize these `Form`s into a `DependencyTree`. With a tree, certain measurements are more efficient (counting depth, breadth, edge types).

In [31]:
from cltk.dependency.tree import DependencyTree

In [32]:
# Let's look at this sentence again
print(cltk_doc.sentences_strings[6])  # text form of `sentence_6`

Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aboriginesque qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .


In [33]:
a_tree = DependencyTree.to_tree(sentence_6)

In [34]:
from pprint import pprint

pprint(a_tree.get_dependencies())

[advmod(egressi_1/verb, Ibi_0/adverb),
 punct(superesset_15/verb, ,_3/punctuation),
 mark(superesset_15/verb, ut_4/subordinating_conjunction),
 obl(superesset_15/verb, quibus_5/pronoun),
 obl(superesset_15/verb, errore_9/noun),
 obl(superesset_15/verb, arma_12/noun),
 case(immenso_7/adjective, ab_6/adposition),
 obl:arg(superesset_15/verb, immenso_7/adjective),
 case(errore_9/noun, prope_8/adposition),
 obl(superesset_15/verb, quibus_5/pronoun),
 obl(superesset_15/verb, errore_9/noun),
 obl(superesset_15/verb, arma_12/noun),
 nsubj(superesset_15/verb, nihil_10/pronoun),
 case(arma_12/noun, praeter_11/adposition),
 cc(naues_14/noun, et_13/coordinating_conjunction),
 conj(arma_12/noun, naues_14/noun),
 obl(superesset_15/verb, quibus_5/pronoun),
 obl(superesset_15/verb, errore_9/noun),
 obl(superesset_15/verb, arma_12/noun),
 punct(agerent_21/verb, ,_16/punctuation),
 punct(agerent_21/verb, ,_22/punctuation),
 mark(agerent_21/verb, cum_17/subordinating_conjunction),
 obj(agerent_21/verb, 

In [35]:
a_tree.print_tree()

root | egressi_1/verb
    └─ advmod | Ibi_0/adverb
    └─ nsubj:pass | Troiani_2/noun
        └─ acl:relcl | superesset_15/verb
            └─ punct | ,_3/punctuation
            └─ mark | ut_4/subordinating_conjunction
            └─ obl | quibus_5/pronoun
            └─ obl:arg | immenso_7/adjective
                └─ case | ab_6/adposition
            └─ obl | errore_9/noun
                └─ case | prope_8/adposition
            └─ nsubj | nihil_10/pronoun
            └─ obl | arma_12/noun
                └─ case | praeter_11/adposition
                └─ conj | naues_14/noun
                    └─ cc | et_13/coordinating_conjunction
            └─ advcl | agerent_21/verb
                └─ punct | ,_16/punctuation
                └─ mark | cum_17/subordinating_conjunction
                └─ obj | praedam_18/noun
                └─ obl | agris_20/noun
                    └─ case | ex_19/adposition
                └─ punct | ,_22/punctuation
    └─ conj | rex_24/noun
        └─ amod

# Feature extraction <a name="features"></a>

The CLTK offers the function `cltk_doc_to_features_table()`, which assist users when preparing a `Doc` for training data for machine learning. It converts the list of `Word` objects at `Doc.words` into a tabular list of lists.

In [36]:
from cltk.utils.feature_extraction import cltk_doc_to_features_table

In [37]:
feature_names, list_of_list_features = cltk_doc_to_features_table(cltk_doc=cltk_doc)

In [38]:
# See here the names of the features extracted
print(feature_names)

['string', 'lemma', 'embedding', 'is_stop', 'lemma', 'pos', 'case', 'gender', 'animacy', 'number', 'definiteness', 'degree', 'strength', 'verbform', 'tense', 'mood', 'aspect', 'voice', 'person', 'polarity', 'politeness', 'clusivity', 'evidentiality', 'strength', 'governing_word', 'governor_case', 'governor_gender', 'governor_animacy', 'governor_number', 'governor_definiteness', 'governor_degree', 'governor_strength', 'governor_verbform', 'governor_tense', 'governor_mood', 'governor_aspect', 'governor_voice', 'governor_person', 'governor_polarity', 'governor_politeness', 'governor_clusivity', 'governor_evidentiality', 'governor_strength', 'governing_relationship']


In [39]:
# Number of "inner lists" matches number of tokens
print("Number tokens:", len(cltk_doc.words))
print("len() of feature instances (one for each token):", len(list_of_list_features))

Number tokens: 12556
len() of feature instances (one for each token): 12556


In [40]:
# Look at one row of data `(variable name, variable value)`
pprint(list(zip(feature_names, list_of_list_features[108])))

[('string', 'ad'),
 ('lemma', 'ad'),
 ('embedding',
  array([-0.018321 ,  0.012199 , -0.13971  ,  0.04643  , -0.24308  ,
       -0.17181  ,  0.11337  ,  0.050907 ,  0.059226 , -0.10186  ,
        0.025673 ,  0.39837  , -0.35537  , -0.12414  , -0.22189  ,
        0.19608  , -0.13432  ,  0.17865  , -0.15335  , -0.76687  ,
        0.17368  , -0.30593  ,  0.18471  ,  0.021884 ,  0.030866 ,
        0.046143 ,  0.55945  ,  0.034793 , -0.20696  , -0.038531 ,
       -0.06099  , -0.45349  ,  0.054395 ,  0.044977 , -0.21321  ,
        0.0084667, -0.14771  , -0.10546  , -0.17701  ,  0.090003 ,
        0.096426 ,  0.060619 , -0.20689  ,  0.022415 , -0.046664 ,
       -0.088348 , -0.093567 ,  0.21098  , -0.15069  , -0.11391  ,
        0.094938 ,  0.07905  , -0.15513  ,  0.041997 , -0.043762 ,
        0.19156  ,  0.019991 ,  0.05901  , -0.23694  ,  0.17699  ,
       -0.020411 , -0.12895  , -0.31144  ,  0.056038 ,  0.34857  ,
       -0.20495  , -0.25597  ,  0.31019  ,  0.020982 ,  0.029523 ,
        

# Brief demonstration of `NLP()` for Ancient Greek <a name="greek-nlp"></a>

The API for Greek is the same as Latin.

In [41]:
# read the Ancient Greek file
with open("grc-thucydides.txt") as fo:
    thucydides_full = fo.read()

In [42]:
print("Text snippet:", thucydides_full[0:200])
print("Character count:", len(thucydides_full))
print("Approximate token count:", len(thucydides_full.split()))

Text snippet: Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τ
Character count: 470056
Approximate token count: 72619


In [43]:
len(thucydides_full) // 7

67150

In [44]:
# Cut this down to roughly 10k tokens for this demonstration's purposes
thucydides = thucydides_full[:len(thucydides_full) // 7]
print("Approximate token count:", len(thucydides.split()))

Approximate token count: 10309


In [45]:
thucydides[:200]

'Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τ'

In [46]:
cltk_nlp_grc = NLP(language="grc")

‎𐤀 CLTK version 'cltk 1.0.0b10'.
Pipeline for language 'Ancient Greek' (ISO: 'grc'): `GreekNormalizeProcess`, `GreekStanzaProcess`, `GreekEmbeddingsProcess`, `StopsProcess`, `GreekNERProcess`.


In [47]:
# Execution time is 50 sec on a 2015 Macbook Pro
%time cltk_doc_grc = cltk_nlp_grc.analyze(text=thucydides)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

CPU times: user 41.5 s, sys: 8.31 s, total: 49.9 s
Wall time: 43.3 s


In [48]:
print("`Doc.tokens`:", cltk_doc_grc.tokens[:20])

`Doc.tokens`: ['Θουκυδίδης', 'Ἀθηναῖος', 'ξυνέγραψε', 'τὸν', 'πόλεμον', 'τῶν', 'Πελοποννησίων', 'καὶ', 'Ἀθηναίων,', 'ὡς', 'ἐπολέμησαν', 'πρὸς', 'ἀλλήλους,', 'ἀρξάμενος', 'εὐθὺς', 'καθισταμένου', 'καὶ', 'ἐλπίσας', 'μέγαν', 'τε']


In [49]:
print(cltk_doc_grc.words[4])  # πόλεμον ('war')

Word(index_char_start=None, index_char_stop=None, index_token=4, index_sentence=0, string='πόλεμον', pos=noun, lemma='πόλεμος', stem=None, scansion=None, xpos='Nb', upos='NOUN', dependency_relation='obj', governor=2, features={Case: [accusative], Gender: [masculine], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, embedding=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), stop=False, named_entity=False, syllables=None, phonetic_transcription=None, definition=None)


In [50]:
a_tree_grc = DependencyTree.to_tree(cltk_doc_grc.sentences[0])  #81

In [51]:
pprint(a_tree_grc.get_dependencies())

[appos(Θουκυδίδης_0/proper_noun, Ἀθηναῖος_1/adjective),
 nsubj(ξυνέγραψε_2/verb, Θουκυδίδης_0/proper_noun),
 det(πόλεμον_4/noun, τὸν_3/determiner),
 det(Πελοποννησίων_6/noun, τῶν_5/determiner),
 cc(Πελοποννησίων_6/noun, καὶ_7/coordinating_conjunction),
 conj(Πελοποννησίων_6/noun, Ἀθηναίων,_8/noun),
 nmod(πόλεμον_4/noun, Πελοποννησίων_6/noun),
 obj(ξυνέγραψε_2/verb, πόλεμον_4/noun),
 mark(ἐπολέμησαν_10/verb, ὡς_9/subordinating_conjunction),
 case(ἀλλήλους,_12/adjective, πρὸς_11/adposition),
 obl(ἐπολέμησαν_10/verb, ἀλλήλους,_12/adjective),
 advmod(ἀρξάμενος_13/verb, εὐθὺς_14/adverb),
 iobj(ἀρξάμενος_13/verb, καθισταμένου_15/noun),
 cc(ἀρξάμενος_13/verb, καὶ_16/coordinating_conjunction),
 cc(μέγαν_18/noun, τε_19/coordinating_conjunction),
 cc(μέγαν_18/noun, καὶ_21/coordinating_conjunction),
 cop(μέγαν_18/noun, ἔσεσθαι_20/auxiliary),
 cc(μέγαν_18/noun, τε_19/coordinating_conjunction),
 cc(μέγαν_18/noun, καὶ_21/coordinating_conjunction),
 det(προγεγενημένων,_24/noun, τῶν_23/determiner),
 i

In [52]:
print(cltk_doc_grc.sentences_strings[0])
print("")
print("Translation:", "Thucydides, an Athenian, wrote the history of the war between the Peloponnesians and the Athenians, beginning at the moment that it broke out, and believing that it would be a great war, and more worthy of relation than any that had preceded it. This belief was not without its grounds. The preparations of both the combatants were in every department in the last state of perfection; and he could see the rest of the Hellenic race taking sides in the quarrel; those who delayed doing so at once having it in contemplation.")
print("")
a_tree_grc.print_tree()

Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον.

Translation: Thucydides, an Athenian, wrote the history of the war between the Peloponnesians and the Athenians, beginning at the moment that it broke out, and believing that it would be a great war, and more worthy of relation than any that had preceded it. This belief was not without its grounds. The preparations of both the combatants were in every department in the last state of perfection; and he could see the rest of the Hellenic race taking sides in the quarrel; those who delayed doing so at once having it in contemplation.

root | ξυνέγραψε_2/verb
    └─ nsubj | Θουκυδίδης_0/proper_noun
        └─ appos | Ἀθηναῖος_1/

In [53]:
feature_names_grc, list_of_list_features_grc = cltk_doc_to_features_table(cltk_doc=cltk_doc_grc)

In [54]:
print(feature_names_grc)

['string', 'lemma', 'embedding', 'is_stop', 'lemma', 'pos', 'case', 'gender', 'animacy', 'number', 'definiteness', 'degree', 'strength', 'verbform', 'tense', 'mood', 'aspect', 'voice', 'person', 'polarity', 'politeness', 'clusivity', 'evidentiality', 'strength', 'governing_word', 'governor_case', 'governor_gender', 'governor_animacy', 'governor_number', 'governor_definiteness', 'governor_degree', 'governor_strength', 'governor_verbform', 'governor_tense', 'governor_mood', 'governor_aspect', 'governor_voice', 'governor_person', 'governor_polarity', 'governor_politeness', 'governor_clusivity', 'governor_evidentiality', 'governor_strength', 'governing_relationship']


In [55]:
print("len() of feature instances (one for each token):", len(list_of_list_features_grc))
print("")
print("Example of one instance row:", list_of_list_features_grc[4])

len() of feature instances (one for each token): 10531

Example of one instance row: ['πόλεμον', 'πόλεμος', array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), False, False, 'noun', 'accusative', 'masculine', None, 'singular', None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'verb', None, None, None, 'singular', None, None, None, 'finite', 'past', 'indicative', 'perfective', 'active', 'third', None, None, None, None, None, 'obj']


In [56]:
# Putting these together for easier reading
pprint(list(zip(feature_names_grc, list_of_list_features_grc[4])))

[('string', 'πόλεμον'),
 ('lemma', 'πόλεμος'),
 ('embedding',
  array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])),
 ('is_stop', False),
 ('lemma', False),
 ('pos', 'noun'),
 ('case', 'accusative'),
 ('gender', 'masculine'),
 ('animacy', None),
 ('number', 'singular'),
 ('definiteness', None),
 ('degree', None),
 ('strength', None),
 ('verbform', None),
 ('tense', None),
 ('mood', None),
 ('aspect', None),
 ('voice', None),
 ('person', None),
 ('polarity', None),
 ('politeness', None),
 ('clusivity', None),
 ('evidentiality', None),
 ('strength', None),
 ('governing_word', 'verb'),
 ('governor_case',