# Table of contents
1. [Introduction](#introduction)
1. [Install pre-release of CLTK](#install)
1. [Get data](#get-data)
1. [Run NLP pipeline with `NLP()`](#run-nlp)
1. [Inspect CLTK `Doc`](#inspect-doc)
1. [Inspect CLTK `Word`](#inspect-word)
1. [Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`](#morph)
1. [Modeling syntax with `Form` and `DependencyTree`](#syntax)
1. [Feature extraction](#features)
1. [Brief demonstration of `NLP()` for Ancient Greek](#greek-nlp)

# Introduction <a name="introduction"></a>

TODO

# Install pre-release of CLTK <a name="install"></a>

In [None]:
# Requires Python 3.7 or 3.8 on a POSIX-compliant OS
# the latest published beta
!pip install --pre cltk

In [None]:
# or directly from this repo
# !cd .. && make install

# Get data <a name="get-data"></a>

The following obtain two plaintext documents of two Classical authors. A subset of each will be used to demonstrate the CLTK.

In [None]:
# Get Latin text
# https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370
# now at `notebooks/lat-livy.txt`
!curl -O https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370/raw/4b98b15017b1bd31e77447309bd9b7cb9086349c/lat-livy.txt

In [None]:
# Get Ancient Greek text
# https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0
# now at `notebooks/grc-thucydides.txt`
!curl -O https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0/raw/8f5aa440363dc66952bb1eb12effc7d3ada101a8/grc-thucydides.txt

In [1]:
# read the Latin file
# We'll run the full demonstration in the Latin language first
with open("lat-livy.txt") as fo:
    livy_full = fo.read()

In [2]:
print("Text snippet:", livy_full[200:400])
print("Character count:", len(livy_full))
print("Approximate token count:", len(livy_full.split()))

Text snippet: riptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem vetustatem superaturos credunt. utcumque erit, iuvabit tamen rerum gestarum memoriae principis terrarum populi pro virili pa
Character count: 3580331
Approximate token count: 503818


In [3]:
len(livy_full) // 50

71606

In [4]:
# Now let's cut this down to roughly 10k tokens for this demonstration's purposes
livy = livy_full[:len(livy_full) // 50]
print("Approximate token count:", len(livy.split()))

Approximate token count: 10209


# Run NLP pipeline with `NLP()` <a name="run-nlp"></a>

In [5]:
# For most users, this is the only import required
from cltk import NLP

In [6]:
# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")

In [7]:
# Now execute NLP algorithms upon input text
# Execution time is 48 sec on a 2015 Macbook Pro
%time cltk_doc = cltk_nlp.analyze(text=livy)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

CPU times: user 1min 26s, sys: 7.61 s, total: 1min 33s
Wall time: 1min 27s


# Inspect CLTK `Doc` <a name="inspect-doc"></a>

In [8]:
# We can now inspect the result
print(type(cltk_doc))

<class 'cltk.core.data_types.Doc'>


In [9]:
# All accessors
print([x for x in dir(cltk_doc) if not x.startswith("__")])

['_get_words_attribute', 'embeddings', 'embeddings_model', 'language', 'lemmata', 'morphosyntactic_features', 'pipeline', 'pos', 'raw', 'sentences', 'sentences_strings', 'sentences_tokens', 'stanza_doc', 'stems', 'tokens', 'tokens_stops_filtered', 'words']


In [10]:
# Several of the more useful

# List of tokens
print(cltk_doc.tokens[:20])

['facturusne', 'operae', 'pretium', 'sim', ',', 'si', 'a', 'primordio', 'urbis', 'res', 'populi', 'Romani', 'perscripserim', ',', 'nec', 'satis', 'scio', 'nec', ',', 'si']


In [11]:
# List of lemmas
print(cltk_doc.lemmata[:20])

['facturusne', 'opus', 'pretium', 'sum', ',', 'si', 'ab', 'primordius', 'urbis', 'res', 'populus', 'momanum', 'perscribo', ',', 'nec', 'satis', 'scio', 'nec', ',', 'si']


In [12]:
# Basic part-of-speech info
print(cltk_doc.pos[:20])

['ADV', 'NOUN', 'NOUN', 'AUX', 'PUNCT', 'SCONJ', 'ADP', 'ADJ', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'CCONJ', 'ADV', 'VERB', 'CCONJ', 'PUNCT', 'SCONJ']


In [13]:
# A list of list of tokens
print(cltk_doc.sentences_tokens[:2])

[['facturusne', 'operae', 'pretium', 'sim', ',', 'si', 'a', 'primordio', 'urbis', 'res', 'populi', 'Romani', 'perscripserim', ',', 'nec', 'satis', 'scio', 'nec', ',', 'si', 'sciam', ',', 'dicere', 'ausim', ',', 'quippe', 'qui', 'cum', 'veterem', 'tum', 'vulgatam', 'esse', 'rem', 'videam', ',', 'dum', 'novi', 'semper', 'scriptores', 'aut', 'in', 'rebus', 'certius', 'aliquid', 'allaturos', 'se', 'aut', 'scribendi', 'arte', 'rudem', 'vetustatem', 'superaturos', 'credunt', '.'], ['utcumque', 'erit', ',', 'iuvabit', 'tamen', 'rerum', 'gestarum', 'memoriae', 'principis', 'terrarum', 'populi', 'pro', 'virili', 'parte', 'et', 'ipsum', 'consuluisse', ';']]


# Inspect CLTK `Word` <a name="inspect-word"></a>

Most powerful, though, is the ``Doc.words`` accessor, which is a list of ``Word`` objects. These ``Word`` objects contain all information that was generated during the NLP pipeline

In [14]:
# One ``Word`` object for each token
print(len(cltk_doc.words))

11735


Users can go token-by-token via ``Doc.words`` or via the intermediary step of looping through sentences.

In [15]:
# Let's look at a non-trivial sentence from Book 1
print("Original:", cltk_doc.sentences_strings[26])
print("")
print("Translation:", "Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.")
# source: http://www.perseus.tufts.edu/hopper/text?doc=Liv.+1+1+5&fromdoc=Perseus%3Atext%3A1999.02.0151
sentence_26 = cltk_doc.sentences[26]  # type: List[Word]

Original: Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aboriginesque qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .

Translation: Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.


In [16]:
# Looking at one Word, 'concurrunt' ('they run together')
a_word_concurrunt = sentence_26[40]
print(a_word_concurrunt)

Word(index_char_start=None, index_char_stop=None, index_token=40, index_sentence=26, string='concurrunt', pos=verb, lemma='concurro', stem=None, scansion=None, xpos='L3|modA|tem1|gen9', upos='VERB', dependency_relation='acl:relcl', governor=33, features={Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}, category={F: [neg], N: [neg], V: [pos]}, embedding=array([-0.16746  , -0.18548  ,  0.30632  , -0.29627  , -0.27262  ,
       -0.0767   ,  0.19405  ,  0.12386  , -0.0076342,  0.13037  ,
        0.17128  ,  0.1189   , -0.22169  , -0.57089  ,  0.28066  ,
       -0.14514  , -0.041256 , -0.021754 ,  0.02212  , -0.25983  ,
        0.53374  , -0.042267 ,  0.27314  ,  0.083616 ,  0.30746  ,
        0.087764 , -0.10098  ,  0.22689  , -0.17577  , -0.35894  ,
       -0.39609  ,  0.43406  ,  0.21306  ,  0.26909  ,  0.099561 ,
        0.26916  , -0.46547  ,  0.1416   , -0.21319  , -0.15126  ,
        0.36604  , -0.020737 ,  0.42397  ,  0.0

In this word, you can see information for lexicography (`.lemmata`), semantics (`.embedding`), morphology (`.pos`, `.features`), syntax (`.governor`, `.dependency_relation`), plus other information most users would find helpful (`.stop`, `.named_entity`).

# Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle` <a name="morph"></a>

When the a language's `Pipeline` builds each `Word` object, morphological information is stored at several accessors. Those of interest to most users are `.pos` and `.features`.

In [17]:
print("`Word.string`:", a_word_concurrunt.string)
print("")
# Part-of-speech is always be available at `.pos`.
print("`Word.pos`:", a_word_concurrunt.pos)

`Word.string`: concurrunt

`Word.pos`: POS.verb


The CLTK contains classes a specific class for [the annotation types defined by v2 of the Universal Dependencies project](https://universaldependencies.org/u/feat/all.html). In the CLTK's codebase, these are located at [cltk/morphology/universal_dependencies_features.py](https://github.com/cltk/cltk/blob/dev/src/cltk/morphology/universal_dependencies_features.py)..

For instance, a Latin verb requires a label for its [https://universaldependencies.org/u/feat/all.html#al-u-feat/Mood](mood) (e.g., indicative), which the UD project defines as "a feature that expresses modality and subclassifies finite verb forms".

Though morphological taggers may annnotate a verb's mood variously ("ind.", "indicative", "Indic", etc.), the CLTK maps the term into the following, standardized `Mood`.

``` python
class Mood(MorphosyntacticFeature):
    """The mood of a verb.
    see https://universaldependencies.org/u/feat/Mood.html
    """

    admirative = auto()
    conditional = auto()
    desiderative = auto()
    imperative = auto()
    indicative = auto()
    jussive = auto()
    necessitative = auto()
    optative = auto()
    potential = auto()
    purposive = auto()
    quotative = auto()
    subjunctive = auto()
```

Turning back to the the above example word, we can see such features at `.features`.

In [18]:
# type
print("type(`Word.features`):", type(a_word_concurrunt.features))
print("")
# str repr of `MorphosyntacticFeatureBundle`
print("`Word.features`:", a_word_concurrunt.features)

type(`Word.features`): <class 'cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle'>

`Word.features`: {Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}


A user may inspect a `MorphosyntacticFeatureBundle` in a manner similar to a `dict`

In [19]:
print("Mood:", a_word_concurrunt.features["Mood"])  # type: List[Mood]
print("Number:", a_word_concurrunt.features["Number"])  # type: List[Number]
print("Person:", a_word_concurrunt.features["Person"])  # type: List[Person]
print("Tense:", a_word_concurrunt.features["Tense"])  # type: List[Tense]
print("VerbForm:", a_word_concurrunt.features["VerbForm"])  # type: List[VerbForm]
print("Voice:", a_word_concurrunt.features["Voice"])  # type: List[Voice]

# Note: The values returned here are a list, though under normally only one 
# morphological form will be available

Mood: [indicative]
Number: [plural]
Person: [third]
Tense: [present]
VerbForm: [finite]
Voice: [active]


Looking a bit closer at `MorphosyntacticFeature`, we can see how its data type inherits from the Python builtin [IntEnu](https://docs.python.org/3/library/enum.html#enum.IntEnum).

In [20]:
a_mood_obj = a_word_concurrunt.features["Mood"][0]
# see type
print("type(a_mood_obj):", type(a_mood_obj))
print("")
# See inheritance
from enum import IntEnum
print("Is `IntEnum`?", isinstance(a_mood_obj, IntEnum))
print("")
# 
from cltk.morphology.morphosyntax import MorphosyntacticFeature
print("`Mood` inherits from `MorphosyntacticFeature`?", isinstance(a_mood_obj, MorphosyntacticFeature))

type(a_mood_obj): Mood

Is `IntEnum`? True

`Mood` inherits from `MorphosyntacticFeature`? True


In [21]:
# You can manipulate this object as any IntEnum plus a few extras

print("`MorphosyntacticFeature` accessors:", [x for x in dir(a_mood_obj) if not x.startswith("__")])
print("")
print("MorphosyntacticFeature.name:", a_mood_obj.name)  # type: str
# A stable int value is available, too, associated with this name
print("MorphosyntacticFeature.value:", a_mood_obj.value)  # type: int

`MorphosyntacticFeature` accessors: ['as_integer_ratio', 'bit_length', 'conjugate', 'denominator', 'from_bytes', 'imag', 'name', 'numerator', 'real', 'to_bytes', 'value']

MorphosyntacticFeature.name: indicative
MorphosyntacticFeature.value: 5


Users can create their own `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`:

In [22]:
from cltk.morphology.morphosyntax import MorphosyntacticFeatureBundle
from cltk.morphology.universal_dependencies_features import Mood, Number, Person, VerbForm, Voice

latin_word_sim = "sim"

mood = Mood.subjunctive
voice = Voice.active
person = Person.first
number = Number.singular
verb_form = VerbForm.finite

latin_word_sim_bundle = MorphosyntacticFeatureBundle(mood, voice, person, number, verb_form)
print(latin_word_sim_bundle)

{Mood: [subjunctive], Voice: [active], Person: [first], Number: [singular], VerbForm: [finite]}


Finally, we may even construct a `Word` with this information:

In [23]:
from cltk.core.data_types import Word

print(Word(string="sim", features=latin_word_sim_bundle))

Word(index_char_start=None, index_char_stop=None, index_token=None, index_sentence=None, string='sim', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={Mood: [subjunctive], Voice: [active], Person: [first], Number: [singular], VerbForm: [finite]}, category=None, embedding=None, stop=None, named_entity=None, syllables=None, phonetic_transcription=None)


In [24]:
# For more on this or any other CLTK class, use `help()`
# help(a_mood_obj)
# help(MorphosyntacticFeatureBundle)

In [25]:
# Note: Extra morphological info may be written in `str` type
# to to the values at `.upos` and `.xpos` for languages using
# Stanza project

# Note: The particular annoations at these are often inconsistent across
# languages or even treebanks within a single language; hence the benefit
# of the CLTK's modeling at `.pos`.
print("`Word.upos`:", a_word_concurrunt.upos)
print("`Word.xpos`:", a_word_concurrunt.xpos)

`Word.upos`: VERB
`Word.xpos`: L3|modA|tem1|gen9


# Modeling syntax with `Form` and `DependencyTree`  <a name="syntax"></a>

The CLTK uses the builtin `xml` library to build tree for modeling dependency parses. A `Word` is mapped into a `Form`, then `ElemntTree` is used to organize these `Form`s into a `DependencyTree`. With a tree, certain measurements are more efficient (counting depth, breadth, edge types).

In [26]:
from cltk.dependency.tree import DependencyTree

In [96]:
# Let's look at this sentence again
print(cltk_doc.sentences_strings[26])  # text form of `sentence_26`

Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aboriginesque qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .


In [29]:
a_tree = DependencyTree.to_tree(sentence_26)

In [94]:
pprint(a_tree.get_dependencies())

[advmod(egressi_1, Ibi_0),
 punct(superesset_15, ,_3),
 mark(superesset_15, ut_4),
 obl(superesset_15, quibus_5),
 obl(superesset_15, errore_9),
 obl(superesset_15, arma_12),
 case(immenso_7, ab_6),
 obl:arg(superesset_15, immenso_7),
 case(errore_9, prope_8),
 obl(superesset_15, quibus_5),
 obl(superesset_15, errore_9),
 obl(superesset_15, arma_12),
 nsubj(superesset_15, nihil_10),
 case(arma_12, praeter_11),
 cc(naues_14, et_13),
 conj(arma_12, naues_14),
 obl(superesset_15, quibus_5),
 obl(superesset_15, errore_9),
 obl(superesset_15, arma_12),
 punct(agerent_21, ,_16),
 punct(agerent_21, ,_22),
 mark(agerent_21, cum_17),
 obj(agerent_21, praedam_18),
 case(agris_20, ex_19),
 obl(agerent_21, agris_20),
 punct(agerent_21, ,_16),
 punct(agerent_21, ,_22),
 advcl(superesset_15, agerent_21),
 acl:relcl(Troiani_2, superesset_15),
 nsubj:pass(egressi_1, Troiani_2),
 amod(rex_24, Latinus_23),
 nsubj(tenebant_29, qui_26),
 advmod:emph(tenebant_29, tum_27),
 obj(tenebant_29, ea_28),
 obj(ten

In [95]:
a_tree.print_tree()

root | egressi_1
    └─ advmod | Ibi_0
    └─ nsubj:pass | Troiani_2
        └─ acl:relcl | superesset_15
            └─ punct | ,_3
            └─ mark | ut_4
            └─ obl | quibus_5
            └─ obl:arg | immenso_7
                └─ case | ab_6
            └─ obl | errore_9
                └─ case | prope_8
            └─ nsubj | nihil_10
            └─ obl | arma_12
                └─ case | praeter_11
                └─ conj | naues_14
                    └─ cc | et_13
            └─ advcl | agerent_21
                └─ punct | ,_16
                └─ mark | cum_17
                └─ obj | praedam_18
                └─ obl | agris_20
                    └─ case | ex_19
                └─ punct | ,_22
    └─ conj | rex_24
        └─ amod | Latinus_23
        └─ orphan | Aboriginesque_25
            └─ acl:relcl | tenebant_29
                └─ nsubj | qui_26
                └─ advmod:emph | tum_27
                └─ obj | ea_28
                └─ obj | loca_30
            

# Feature extraction <a name="features"></a>

The CLTK offers the function `cltk_doc_to_features_table()`, which assist users when preparing a `Doc` for training data for machine learning. It converts the list of `Word` objects at `Doc.words` into a tabular list of lists.

In [32]:
from cltk.utils.feature_extraction import cltk_doc_to_features_table

In [33]:
feature_names, list_of_list_features = cltk_doc_to_features_table(cltk_doc=cltk_doc)

In [34]:
# See here the names of the features extracted
print(feature_names)

['string', 'lemma', 'embedding', 'is_stop', 'lemma', 'pos', 'case', 'gender', 'animacy', 'number', 'definiteness', 'degree', 'strength', 'verbform', 'tense', 'mood', 'aspect', 'voice', 'person', 'polarity', 'politeness', 'clusivity', 'evidentiality', 'strength', 'governing_word', 'governor_case', 'governor_gender', 'governor_animacy', 'governor_number', 'governor_definiteness', 'governor_degree', 'governor_strength', 'governor_verbform', 'governor_tense', 'governor_mood', 'governor_aspect', 'governor_voice', 'governor_person', 'governor_polarity', 'governor_politeness', 'governor_clusivity', 'governor_evidentiality', 'governor_strength', 'governing_relationship']


In [99]:
# Number of "inner lists" matches number of tokens
print("Number tokens:", len(cltk_doc.words))
print("len() of feature instances (one for each token):", len(list_of_list_features))

Number tokens: 11735
len() of feature instances (one for each token): 11735


In [102]:
# Look at one row of data `(variable name, variable value)`
from pprint import pprint

pprint(list(zip(feature_names, list_of_list_features[108])))

[('string', 'annum'),
 ('lemma', 'annus'),
 ('embedding',
  array([ 0.17616  , -0.13633  , -0.090246 , -0.0033078, -0.10723  ,
       -0.46832  ,  0.048903 , -0.38701  ,  0.055101 ,  0.40824  ,
       -0.31564  ,  0.40112  , -0.28038  ,  0.11526  ,  0.24486  ,
        0.021453 ,  0.22076  ,  0.27146  ,  0.19096  , -0.1995   ,
       -0.1798   , -0.015442 ,  0.44574  , -0.23831  , -0.094208 ,
        0.039914 , -0.37496  ,  0.029465 ,  0.20246  , -0.038341 ,
       -0.23648  , -0.33928  ,  0.41969  ,  0.058407 ,  0.1202   ,
        0.31858  ,  0.24648  , -0.51506  , -0.16672  , -0.095096 ,
        0.23202  ,  0.55107  ,  0.195    , -0.044926 ,  0.030759 ,
        0.24807  , -0.081363 , -0.03986  , -0.28034  , -0.27438  ,
       -0.23777  , -0.0058744, -0.3847   , -0.099186 , -0.098886 ,
       -0.3086   ,  0.13187  ,  0.45063  , -0.084955 , -0.020178 ,
       -0.041306 , -0.028307 , -0.62158  ,  0.027616 ,  0.30818  ,
       -0.44131  , -0.25724  ,  0.022891 ,  0.2309   , -0.24465  ,
  

# Brief demonstration of `NLP()` for Ancient Greek <a name="greek-nlp"></a>

The API for Greek is the same as Latin.

In [93]:
# read the Ancient Greek file
with open("grc-thucydides.txt") as fo:
    thucydides_full = fo.read()

In [38]:
print("Text snippet:", thucydides_full[0:200])
print("Character count:", len(thucydides_full))
print("Approximate token count:", len(thucydides_full.split()))

Text snippet: Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τ
Character count: 470056
Approximate token count: 72619


In [39]:
len(thucydides_full) // 7

67150

In [40]:
# Cut this down to roughly 10k tokens for this demonstration's purposes
thucydides = thucydides_full[:len(thucydides_full) // 7]
print("Approximate token count:", len(thucydides.split()))

Approximate token count: 10309


In [41]:
thucydides[:200]

'Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τ'

In [42]:
cltk_nlp_grc = NLP(language="grc")

In [43]:
# Execution time is 57 sec on a 2015 Macbook Pro
%time cltk_doc_grc = cltk_nlp_grc.analyze(text=thucydides)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

CPU times: user 57.3 s, sys: 7 s, total: 1min 4s
Wall time: 58.3 s


In [44]:
print("`Doc.tokens`:", cltk_doc_grc.tokens[:20])

`Doc.tokens`: ['Θουκυδίδης', 'Ἀθηναῖος', 'ξυνέγραψε', 'τὸν', 'πόλεμον', 'τῶν', 'Πελοποννησίων', 'καὶ', 'Ἀθηναίων,', 'ὡς', 'ἐπολέμησαν', 'πρὸς', 'ἀλλήλους,', 'ἀρξάμενος', 'εὐθὺς', 'καθισταμένου', 'καὶ', 'ἐλπίσας', 'μέγαν', 'τε']


In [45]:
print(cltk_doc_grc.words[4])  # πόλεμον ('war')

Word(index_char_start=None, index_char_stop=None, index_token=4, index_sentence=0, string='πόλεμον', pos=noun, lemma='πόλεμος', stem=None, scansion=None, xpos='Nb', upos='NOUN', dependency_relation='obj', governor=2, features={Case: [accusative], Gender: [masculine], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, embedding=array([ 0.026979, -0.201583,  0.403359, -0.283483, -0.009569, -0.199355,
        0.045749,  0.004704,  0.142157,  0.256203,  0.597056,  0.051038,
       -0.163049, -0.373365,  0.036745, -0.091862,  0.297455,  0.215132,
        0.038652, -0.162991, -0.365283, -0.290551,  0.702512, -0.229625,
        0.36157 , -0.160029,  0.142917, -0.015726,  0.27188 ,  0.122487,
        0.138659,  0.046546,  0.257647,  0.176612, -0.379691, -0.106146,
       -0.111118, -0.351822, -0.149012,  0.266751,  0.07048 ,  0.036137,
       -0.087169, -0.609234,  0.015767,  0.18534 ,  0.090315, -0.163389,
        0.080256, -0.213097, -0.208633, -0.153822, -0.126109,  0.170786,
   

In [87]:
a_tree_grc = DependencyTree.to_tree(cltk_doc_grc.sentences[0])  #81

In [91]:
pprint(a_tree_grc.get_dependencies())

[appos(Θουκυδίδης_0, Ἀθηναῖος_1),
 nsubj(ξυνέγραψε_2, Θουκυδίδης_0),
 det(πόλεμον_4, τὸν_3),
 det(Πελοποννησίων_6, τῶν_5),
 cc(Πελοποννησίων_6, καὶ_7),
 conj(Πελοποννησίων_6, Ἀθηναίων,_8),
 nmod(πόλεμον_4, Πελοποννησίων_6),
 obj(ξυνέγραψε_2, πόλεμον_4),
 mark(ἐπολέμησαν_10, ὡς_9),
 case(ἀλλήλους,_12, πρὸς_11),
 obl(ἐπολέμησαν_10, ἀλλήλους,_12),
 cc(μέγαν_18, τε_19),
 cc(μέγαν_18, καὶ_21),
 cop(μέγαν_18, ἔσεσθαι_20),
 cc(μέγαν_18, τε_19),
 cc(μέγαν_18, καὶ_21),
 det(προγεγενημένων,_24, τῶν_23),
 iobj(ἀξιολογώτατον_22, προγεγενημένων,_24),
 conj(μέγαν_18, ἀξιολογώτατον_22),
 xcomp(ἐλπίσας_17, μέγαν_18),
 mark(ᾖσαν_29, ὅτι_26),
 advcl(ᾖσαν_29, ἀκμάζοντές_27),
 advcl(ᾖσαν_29, ὁρῶν_40),
 cc(ᾖσαν_29, τε_28),
 cc(ᾖσαν_29, καὶ_36),
 case(αὐτὸν_31, ἐς_30),
 obl(ᾖσαν_29, αὐτὸν_31),
 obl(ᾖσαν_29, παρασκευῇ_33),
 nsubj(ᾖσαν_29, ἀμφότεροι_32),
 det(παρασκευῇ_33, τῇ_34),
 det(παρασκευῇ_33, πάσῃ_35),
 det(παρασκευῇ_33, τῇ_34),
 det(παρασκευῇ_33, πάσῃ_35),
 obl(ᾖσαν_29, αὐτὸν_31),
 obl(ᾖσαν_29, παρασκ

In [92]:
print(cltk_doc_grc.sentences_strings[0])
print("")
print("Translation:", "Thucydides, an Athenian, wrote the history of the war between the Peloponnesians and the Athenians, beginning at the moment that it broke out, and believing that it would be a great war, and more worthy of relation than any that had preceded it. This belief was not without its grounds. The preparations of both the combatants were in every department in the last state of perfection; and he could see the rest of the Hellenic race taking sides in the quarrel; those who delayed doing so at once having it in contemplation.")
print("")
a_tree_grc.print_tree()

Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον.

Translation: Thucydides, an Athenian, wrote the history of the war between the Peloponnesians and the Athenians, beginning at the moment that it broke out, and believing that it would be a great war, and more worthy of relation than any that had preceded it. This belief was not without its grounds. The preparations of both the combatants were in every department in the last state of perfection; and he could see the rest of the Hellenic race taking sides in the quarrel; those who delayed doing so at once having it in contemplation.

root | ξυνέγραψε_2
    └─ nsubj | Θουκυδίδης_0
        └─ appos | Ἀθηναῖος_1
    └─ obj | πόλε

In [49]:
feature_names_grc, list_of_list_features_grc = cltk_doc_to_features_table(cltk_doc=cltk_doc_grc)

In [50]:
print(feature_names_grc)

['string', 'lemma', 'embedding', 'is_stop', 'lemma', 'pos', 'case', 'gender', 'animacy', 'number', 'definiteness', 'degree', 'strength', 'verbform', 'tense', 'mood', 'aspect', 'voice', 'person', 'polarity', 'politeness', 'clusivity', 'evidentiality', 'strength', 'governing_word', 'governor_case', 'governor_gender', 'governor_animacy', 'governor_number', 'governor_definiteness', 'governor_degree', 'governor_strength', 'governor_verbform', 'governor_tense', 'governor_mood', 'governor_aspect', 'governor_voice', 'governor_person', 'governor_polarity', 'governor_politeness', 'governor_clusivity', 'governor_evidentiality', 'governor_strength', 'governing_relationship']


In [51]:
print("len() of feature instances (one for each token):", len(list_of_list_features_grc))
print("")
print("Example of one instance row:", list_of_list_features_grc[4])

len() of feature instances (one for each token): 10337

Example of one instance row: ['πόλεμον', 'πόλεμος', array([ 0.026979, -0.201583,  0.403359, -0.283483, -0.009569, -0.199355,
        0.045749,  0.004704,  0.142157,  0.256203,  0.597056,  0.051038,
       -0.163049, -0.373365,  0.036745, -0.091862,  0.297455,  0.215132,
        0.038652, -0.162991, -0.365283, -0.290551,  0.702512, -0.229625,
        0.36157 , -0.160029,  0.142917, -0.015726,  0.27188 ,  0.122487,
        0.138659,  0.046546,  0.257647,  0.176612, -0.379691, -0.106146,
       -0.111118, -0.351822, -0.149012,  0.266751,  0.07048 ,  0.036137,
       -0.087169, -0.609234,  0.015767,  0.18534 ,  0.090315, -0.163389,
        0.080256, -0.213097, -0.208633, -0.153822, -0.126109,  0.170786,
        0.364725, -0.21801 ,  0.445456, -0.046112,  0.442441, -0.444369,
        0.045849, -0.107836,  0.046578, -0.171076,  0.377415,  0.554179,
        0.566092, -0.06977 ,  0.598988,  0.273217, -0.489505,  0.067709,
       -0.388921

In [90]:
# Putting these together for easier reading
from pprint import pprint

pprint(list(zip(feature_names_grc, list_of_list_features_grc[4])))

[('string', 'πόλεμον'),
 ('lemma', 'πόλεμος'),
 ('embedding',
  array([ 0.026979, -0.201583,  0.403359, -0.283483, -0.009569, -0.199355,
        0.045749,  0.004704,  0.142157,  0.256203,  0.597056,  0.051038,
       -0.163049, -0.373365,  0.036745, -0.091862,  0.297455,  0.215132,
        0.038652, -0.162991, -0.365283, -0.290551,  0.702512, -0.229625,
        0.36157 , -0.160029,  0.142917, -0.015726,  0.27188 ,  0.122487,
        0.138659,  0.046546,  0.257647,  0.176612, -0.379691, -0.106146,
       -0.111118, -0.351822, -0.149012,  0.266751,  0.07048 ,  0.036137,
       -0.087169, -0.609234,  0.015767,  0.18534 ,  0.090315, -0.163389,
        0.080256, -0.213097, -0.208633, -0.153822, -0.126109,  0.170786,
        0.364725, -0.21801 ,  0.445456, -0.046112,  0.442441, -0.444369,
        0.045849, -0.107836,  0.046578, -0.171076,  0.377415,  0.554179,
        0.566092, -0.06977 ,  0.598988,  0.273217, -0.489505,  0.067709,
       -0.388921,  0.130601, -0.258088,  0.148842, -0.084431