# Table of contents
1. [Introduction](#introduction)
1. [Install pre-release of CLTK](#install)
1. [Get data](#get-data)
1. [Run NLP pipeline with `NLP()`](#run-nlp)
1. [Inspect CLTK `Doc`](#inspect-doc)
1. [Inspect CLTK `Word`](#inspect-word)
1. [Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`](#morph)
1. [Modeling syntax with `Form` and `DependencyTree`](#syntax)
1. [Feature extraction](#features)
1. [Brief demonstration of `NLP()` for Ancient Greek](#greek-nlp)

# Introduction <a name="introduction"></a>

This notebook demonstrates how to use `NLP()`, the CLTK's primary interface, in Latin and Ancient Greek. Pipelines are available for 17 languages (see [Languages](https://docs.cltk.org/en/latest/languages.html) in the docs).

Full documentation available at <https://docs.cltk.org/en/latest/cltk.html#cltk.nlp.NLP>.

Note that there is a large amoung of code from this project's first six years (v. `0.1`), not all of which has been or will be moved over to this v. `1.0`. Docs for `0.1` still available at [https://legacy.cltk.org](https://legacy.cltk.org/) and tutorial notebooks at [https://github.com/cltk/tutorials](https://github.com/cltk/tutorials).

# Install CLTK <a name="install"></a>

This notebook comes from <https://github.com/cltk/cltk/tree/master/notebooks>. For full instructions on installing the CLTK are available at <https://docs.cltk.org/en/latest/installation.html>.

In [1]:
# !pip install cltk

In [2]:
## Or directly from this repo:
# cd .. && make install

# %load_ext autoreload
# %autoreload 2

# Get data <a name="get-data"></a>

The following obtain two plaintext documents of two Classical authors. A subset of each will be used to demonstrate the CLTK.

In [3]:
# Get Latin text
# https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370
# now at `notebooks/lat-livy.txt`
# !curl -O -L https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370/raw/05b7a17af4b216a4986d897c57a9987e836cc91a/lat-livy.txt

In [4]:
# Get Ancient Greek text
# https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0
# now at `notebooks/grc-thucydides.txt`
# !curl -O -L https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0/raw/e08e47849f64484b0950b14563bb5a9fd1e1ef1c/grc-thucydides.txt

In [5]:
# read the Latin file
# We'll run the full demonstration in the Latin language first
with open("lat-livy.txt") as fo:
    livy_full = fo.read()

In [6]:
print("Text snippet:", livy_full[:200])
print("Character count:", len(livy_full))
print("Approximate token count:", len(livy_full.split()))

Text snippet: Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos, duobus, Aeneae Antenorique, et vetusti iure hospitii et quia pacis reddendaeque Helenae semper auctores fuerant, omne ius
Character count: 920884
Approximate token count: 129799


In [7]:
len(livy_full) // 12

76740

In [8]:
# Now let's cut this down to roughly 10k tokens for this demonstration's purposes
livy = livy_full[:len(livy_full) // 12]
print("Approximate token count:", len(livy.split()))

Approximate token count: 10911


# Run NLP pipeline with `NLP()` <a name="run-nlp"></a>

In [9]:
# For most users, this is the only import required
from cltk import NLP

In [10]:
# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")

‎𐤀 CLTK version '1.1.7'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinSpacyProcess`, `LatinLexiconProcess`.


In [11]:
# Removing ``LatinLexiconProcess`` for this demo b/c it is slow (adds ~9 mins total)
cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

[<class 'cltk.alphabet.processes.LatinNormalizeProcess'>, <class 'cltk.dependency.processes.LatinSpacyProcess'>]


In [12]:
# Now execute NLP algorithms upon input text
# Aside from download, execution time is ~50 sec on a 2015 Macbook Pro
%time cltk_doc = cltk_nlp.analyze(text=livy)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

CPU times: user 4.27 s, sys: 604 ms, total: 4.88 s
Wall time: 4.91 s


# Inspect CLTK `Doc` <a name="inspect-doc"></a>

In [13]:
# We can now inspect the result
print(type(cltk_doc))

<class 'cltk.core.data_types.Doc'>


In [14]:
# All accessors
print([x for x in dir(cltk_doc) if not x.startswith("__")])

['_get_words_attribute', 'embeddings', 'embeddings_model', 'language', 'lemmata', 'morphosyntactic_features', 'normalized_text', 'pipeline', 'pos', 'raw', 'sentence_embeddings', 'sentences', 'sentences_strings', 'sentences_tokens', 'spacy_doc', 'stems', 'tokens', 'tokens_stops_filtered', 'words']


In [15]:
# Several of the more useful

# List of tokens
print(cltk_doc.tokens[:20])

['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenori', 'que', ',', 'et']


In [16]:
# List of lemmas
print(cltk_doc.lemmata[:20])

['iam', 'primum', 'omnis', 'satis', 'consto', 'Troia', 'campo', 'in', 'ceterus', 'saevitum', 'sum', 'Troianos', ',', 'duo', ',', 'aeneae', 'Antenor', 'que', ',', 'et']


In [17]:
# Basic part-of-speech info
print(cltk_doc.pos[:20])

['ADV', 'ADJ', 'ADJ', 'ADV', 'VERB', 'NOUN', 'VERB', 'ADP', 'ADJ', 'VERB', 'AUX', 'ADJ', 'PUNCT', 'NUM', 'PUNCT', 'PROPN', 'PROPN', 'NOUN', 'PUNCT', 'CCONJ']


In [18]:
# A list of list of tokens
print(cltk_doc.sentences_tokens[:2])

[['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenori', 'que', ',', 'et', 'vetusti', 'iure', 'hospitii', 'et', 'quia', 'pacis', 'reddendae', 'que', 'Helenae', 'semper', 'auctores', 'fuerant', ',', 'omne', 'ius', 'belli', 'Achiuos', 'abstinuisse', ';'], ['casibus', 'deinde', 'variis', 'Antenorem', 'cum', 'multitudine', 'Enetum', ',', 'qui', 'seditione', 'ex', 'Paphlagonia', 'pulsi', 'et', 'sedes', 'et', 'ducem', 'rege', 'Pylaemene', 'ad', 'Troiam', 'amisso', 'quaerebant', ',', 'venisse', 'in', 'intimum', 'maris', 'Hadriatici', 'sinum', ',', 'Euganeis', 'que', 'qui', 'inter', 'mare', 'Alpes', 'que', 'incolebant', 'pulsis', 'Enetos', 'Troianos', 'que', 'eas', 'tenuisse', 'terras', '.']]


# Inspect CLTK `Word` <a name="inspect-word"></a>

Most powerful, though, is the ``Doc.words`` accessor, which is a list of ``Word`` objects. These ``Word`` objects contain all information that was generated during the NLP pipeline

In [19]:
# One ``Word`` object for each token
print(len(cltk_doc.words))

12986


Users can go token-by-token via ``Doc.words`` or via the intermediary step of looping through sentences.

In [20]:
# Let's look at a non-trivial sentence from Book 1
print("Original:", cltk_doc.sentences_strings[5])
print("")
print("Translation:", "Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.")
# source: http://www.perseus.tufts.edu/hopper/text?doc=Liv.+1+1+5&fromdoc=Perseus%3Atext%3A1999.02.0151
sentence_6 = cltk_doc.sentences[5]  # type: List[Word]

Original: Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aborigines que qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .

Translation: Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.


In [37]:
# Looking at one Word, 'concurrunt' ('they run together')
a_word_concurrunt = sentence_6[41]
print(a_word_concurrunt)

Word(index_char_start=1091, index_char_stop=1101, index_token=188, index_sentence=5, string='concurrunt', pos=verb, lemma='concurro', stem=None, scansion=None, xpos='verb', upos='VERB', dependency_relation='conj', governor=162, features={Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}, category={F: [neg], N: [neg], V: [pos]}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)


In this word, you can see information for lexicography (`.lemmata`), semantics (`.embedding`), morphology (`.pos`, `.features`), syntax (`.governor`, `.dependency_relation`), plus other information most users would find helpful (`.stop`, `.named_entity`).

# Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle` <a name="morph"></a>

When a language's `Pipeline` builds each `Word` object, morphological information is stored at several accessors. Those of interest to most users are `.pos` and `.features`.

In [38]:
print("`Word.string`:", a_word_concurrunt.string)
print("")
# Part-of-speech is always be available at `.pos`.
print("`Word.pos`:", a_word_concurrunt.pos)

`Word.string`: concurrunt

`Word.pos`: verb


The CLTK contains classes a specific class for [the annotation types defined by v2 of the Universal Dependencies project](https://universaldependencies.org/u/feat/all.html). In the CLTK's codebase, these are located at [cltk/morphology/universal_dependencies_features.py](https://github.com/cltk/cltk/blob/dev/src/cltk/morphology/universal_dependencies_features.py).

For instance, a Latin verb requires a label for its [https://universaldependencies.org/u/feat/all.html#al-u-feat/Mood](mood) (e.g., indicative), which the UD project defines as "a feature that expresses modality and subclassifies finite verb forms".

Though morphological taggers may annnotate a verb's mood variously ("ind.", "indicative", "Indic", etc.), the CLTK maps the term into the following, standardized `Mood`.

``` python
class Mood(MorphosyntacticFeature):
    """The mood of a verb.
    see https://universaldependencies.org/u/feat/Mood.html
    """

    admirative = auto()
    conditional = auto()
    desiderative = auto()
    imperative = auto()
    indicative = auto()
    jussive = auto()
    necessitative = auto()
    optative = auto()
    potential = auto()
    purposive = auto()
    quotative = auto()
    subjunctive = auto()
```

Turning back to the the above example word, we can see such features at `.features`.

In [39]:
# type
print("type(`Word.features`):", type(a_word_concurrunt.features))
print("")
# str repr of `MorphosyntacticFeatureBundle`
print("`Word.features`:", a_word_concurrunt.features)

type(`Word.features`): <class 'cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle'>

`Word.features`: {Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}


A user may inspect a `MorphosyntacticFeatureBundle` in a manner similar to a `dict`

In [40]:
print("Mood:", a_word_concurrunt.features["Mood"])  # type: List[Mood]
print("Number:", a_word_concurrunt.features["Number"])  # type: List[Number]
print("Person:", a_word_concurrunt.features["Person"])  # type: List[Person]
print("Tense:", a_word_concurrunt.features["Tense"])  # type: List[Tense]
print("VerbForm:", a_word_concurrunt.features["VerbForm"])  # type: List[VerbForm]
print("Voice:", a_word_concurrunt.features["Voice"])  # type: List[Voice]

# Note: The values returned here are a list, though under normally only one 
# morphological form will be available

Mood: [indicative]
Number: [plural]
Person: [third]
Tense: [present]
VerbForm: [finite]
Voice: [active]


Looking a bit closer at `MorphosyntacticFeature`, we can see how its data type inherits from the Python builtin [IntEnu](https://docs.python.org/3/library/enum.html#enum.IntEnum).

In [41]:
a_mood_obj = a_word_concurrunt.features["Mood"][0]
# see type
print("type(a_mood_obj):", type(a_mood_obj))
print("")
# See inheritance
from enum import IntEnum
print("Is `IntEnum`?", isinstance(a_mood_obj, IntEnum))
print("")
# 
from cltk.morphology.morphosyntax import MorphosyntacticFeature
print("`Mood` inherits from `MorphosyntacticFeature`?", isinstance(a_mood_obj, MorphosyntacticFeature))

type(a_mood_obj): Mood

Is `IntEnum`? True

`Mood` inherits from `MorphosyntacticFeature`? True


In [42]:
# You can manipulate this object as any IntEnum plus a few extras

print("`MorphosyntacticFeature` accessors:", [x for x in dir(a_mood_obj) if not x.startswith("__")])
print("")
print("MorphosyntacticFeature.name:", a_mood_obj.name)  # type: str
# A stable int value is available, too, associated with this name
print("MorphosyntacticFeature.value:", a_mood_obj.value)  # type: int

`MorphosyntacticFeature` accessors: ['_generate_next_value_', '_member_map_', '_member_names_', '_member_type_', '_missing_', '_name_', '_new_member_', '_sort_order_', '_unhashable_values_', '_use_args_', '_value2member_map_', '_value_', '_value_repr_', 'admirative', 'as_integer_ratio', 'bit_count', 'bit_length', 'conditional', 'conjugate', 'denominator', 'desiderative', 'from_bytes', 'gerundive', 'imag', 'imperative', 'indicative', 'infinitive', 'jussive', 'name', 'necessitative', 'numerator', 'optative', 'potential', 'purposive', 'quotative', 'real', 'subjunctive', 'to_bytes', 'value']

MorphosyntacticFeature.name: indicative
MorphosyntacticFeature.value: 6


Users can create their own `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`:

In [43]:
from cltk.morphology.morphosyntax import MorphosyntacticFeatureBundle
from cltk.morphology.universal_dependencies_features import Mood, Number, Person, VerbForm, Voice

latin_word_sim = "sim"

mood = Mood.subjunctive
voice = Voice.active
person = Person.first
number = Number.singular
verb_form = VerbForm.finite

latin_word_sim_bundle = MorphosyntacticFeatureBundle(mood, voice, person, number, verb_form)
print(latin_word_sim_bundle)

{Mood: [subjunctive], Voice: [active], Person: [first], Number: [singular], VerbForm: [finite]}


This too can be interated through as `dict`:

In [44]:
for feature, value in latin_word_sim_bundle.items():
    print(feature, value)  # Note: value has `list` type

Mood [subjunctive]
Voice [active]
Person [first]
Number [singular]
VerbForm [finite]


Finally, we may even construct a `Word` with this information:

In [45]:
from cltk.core.data_types import Word

print(Word(string="sim", features=latin_word_sim_bundle))

Word(index_char_start=None, index_char_stop=None, index_token=None, index_sentence=None, string='sim', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={Mood: [subjunctive], Voice: [active], Person: [first], Number: [singular], VerbForm: [finite]}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)


In [46]:
# For more on this or any other CLTK class, use `help()`
# help(a_mood_obj)
# help(MorphosyntacticFeatureBundle)

In [47]:
# Note: Extra morphological info may be written in `str` type
# to to the values at `.upos` and `.xpos` for languages using
# Stanza project

# Note: The particular annoations at these are often inconsistent across
# languages or even treebanks within a single language; hence the benefit
# of the CLTK's modeling at `.pos`.
print("`Word.upos`:", a_word_concurrunt.upos)
print("`Word.xpos`:", a_word_concurrunt.xpos)

`Word.upos`: VERB
`Word.xpos`: verb


# Modeling syntax with `Form` and `DependencyTree`  <a name="syntax"></a>

The CLTK uses the builtin `xml` library to build tree for modeling dependency parses. A `Word` is mapped into a `Form`, then `ElemntTree` is used to organize these `Form`s into a `DependencyTree`. With a tree, certain measurements are more efficient (counting depth, breadth, edge types).

In [48]:
from cltk.dependency.tree import DependencyTree

In [49]:
# Let's look at this sentence again
print(cltk_doc.sentences_strings[5])  # text form of `sentence_6`

Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aborigines que qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .


In [50]:
a_tree = DependencyTree.to_tree(sentence_6)

In [51]:
from pprint import pprint

pprint(a_tree.get_dependencies())

AttributeError: 'NoneType' object has no attribute 'get_dependencies'

In [52]:
a_tree.print_tree()

AttributeError: 'NoneType' object has no attribute 'print_tree'

# Feature extraction <a name="features"></a>

The CLTK offers the function `cltk_doc_to_features_table()`, which assist users when preparing a `Doc` for training data for machine learning. It converts the list of `Word` objects at `Doc.words` into a tabular list of lists.

In [53]:
from cltk.utils.feature_extraction import cltk_doc_to_features_table

In [54]:
feature_names, list_of_list_features = cltk_doc_to_features_table(cltk_doc=cltk_doc)

IndexError: list index out of range

In [None]:
# See here the names of the features extracted
print(feature_names)

In [None]:
# Number of "inner lists" matches number of tokens
print("Number tokens:", len(cltk_doc.words))
print("len() of feature instances (one for each token):", len(list_of_list_features))

In [None]:
# Look at one row of data `(variable name, variable value)`
pprint(list(zip(feature_names, list_of_list_features[108])))

# Brief demonstration of `NLP()` for Ancient Greek <a name="greek-nlp"></a>

The API for Greek is the same as Latin.

In [None]:
# read the Ancient Greek file
with open("grc-thucydides.txt") as fo:
    thucydides_full = fo.read()

In [None]:
print("Text snippet:", thucydides_full[0:200])
print("Character count:", len(thucydides_full))
print("Approximate token count:", len(thucydides_full.split()))

In [None]:
len(thucydides_full) // 7

In [None]:
# Cut this down to roughly 10k tokens for this demonstration's purposes
thucydides = thucydides_full[:len(thucydides_full) // 7]
print("Approximate token count:", len(thucydides.split()))

In [None]:
thucydides[:200]

In [None]:
cltk_nlp_grc = NLP(language="grc")

In [None]:
# Execution time is 50 sec on a 2015 Macbook Pro
%time cltk_doc_grc = cltk_nlp_grc.analyze(text=thucydides)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

In [None]:
print("`Doc.tokens`:", cltk_doc_grc.tokens[:20])

In [None]:
print(cltk_doc_grc.words[4])  # πόλεμον ('war')

In [None]:
a_tree_grc = DependencyTree.to_tree(cltk_doc_grc.sentences[0])  #81

In [None]:
pprint(a_tree_grc.get_dependencies())

In [None]:
print(cltk_doc_grc.sentences_strings[0])
print("")
print("Translation:", "Thucydides, an Athenian, wrote the history of the war between the Peloponnesians and the Athenians, beginning at the moment that it broke out, and believing that it would be a great war, and more worthy of relation than any that had preceded it. This belief was not without its grounds. The preparations of both the combatants were in every department in the last state of perfection; and he could see the rest of the Hellenic race taking sides in the quarrel; those who delayed doing so at once having it in contemplation.")
print("")
a_tree_grc.print_tree()

In [None]:
feature_names_grc, list_of_list_features_grc = cltk_doc_to_features_table(cltk_doc=cltk_doc_grc)

In [None]:
print(feature_names_grc)

In [None]:
print("len() of feature instances (one for each token):", len(list_of_list_features_grc))
print("")
print("Example of one instance row:", list_of_list_features_grc[4])

In [None]:
# Putting these together for easier reading
pprint(list(zip(feature_names_grc, list_of_list_features_grc[4])))