# Introduction

TODO

# Get data

The following obtain two plaintext documents of two Classical authors. A subset of each will be used to demonstrate the CLTK.

In [2]:
# Get Latin text
# https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370
# now at `notebooks/lat-livy.txt`
!curl -O https://gist.github.com/kylepjohnson/2f9376fcf15699c250a0d09b37683370/raw/4b98b15017b1bd31e77447309bd9b7cb9086349c/lat-livy.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0


In [1]:
# Get Ancient Greek text
# https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0
# now at `notebooks/grc-thucydides.txt`
!curl -O https://gist.github.com/kylepjohnson/9835c36fb06ca30ebf29b7f2c7bd29e0/raw/8f5aa440363dc66952bb1eb12effc7d3ada101a8/grc-thucydides.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0


In [75]:
# read the Latin file
# We'll run the full demonstration in the Latin language first
with open("lat-livy.txt") as fo:
    livy_full = fo.read()

In [76]:
print("Text snippet:", livy_full[200:400])
print("Character count:", len(livy_full))
print("Approximate token count:", len(livy_full.split()))

Text snippet: riptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem vetustatem superaturos credunt. utcumque erit, iuvabit tamen rerum gestarum memoriae principis terrarum populi pro virili pa
Character count: 3580331
Approximate token count: 503818


In [77]:
len(livy_full) // 50

71606

In [78]:
# Now let's cut this down to roughly 10k tokens for this demonstration's purposes
livy = livy_full[:len(livy_full) // 50]
print("Approximate token count:", len(livy.split()))

Approximate token count: 10209


# Run NLP pipeline with `NLP()`

In [79]:
# For most users, this is the only import required
from cltk import NLP

In [80]:
# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")

In [81]:
# Now execute NLP algorithms upon input text
# Execution time is 54 sec on a 2015 Macbook Pro
%time cltk_doc = cltk_nlp.analyze(text=livy)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

CPU times: user 50.8 s, sys: 7.84 s, total: 58.7 s
Wall time: 51.9 s


# Inspect CLTK `Doc`

In [82]:
# We can now inspect the result
print(type(cltk_doc))

<class 'cltk.core.data_types.Doc'>


In [83]:
# All accessors
print([x for x in dir(cltk_doc) if not x.startswith("__")])

['_get_words_attribute', 'embeddings', 'embeddings_model', 'language', 'lemmata', 'morphosyntactic_features', 'pipeline', 'pos', 'raw', 'sentences', 'sentences_strings', 'sentences_tokens', 'stanza_doc', 'stems', 'tokens', 'tokens_stops_filtered', 'words']


In [84]:
# Several of the more useful

# List of tokens
print(cltk_doc.tokens[:20])

['facturusne', 'operae', 'pretium', 'sim', ',', 'si', 'a', 'primordio', 'urbis', 'res', 'populi', 'Romani', 'perscripserim', ',', 'nec', 'satis', 'scio', 'nec', ',', 'si']


In [85]:
# List of lemmas
print(cltk_doc.lemmata[:20])

['facturusne', 'opus', 'pretium', 'sum', ',', 'si', 'ab', 'primordius', 'urbis', 'res', 'populus', 'momanum', 'perscribo', ',', 'nec', 'satis', 'scio', 'nec', ',', 'si']


In [86]:
# Basic part-of-speech info
print(cltk_doc.pos[:20])

['ADV', 'NOUN', 'NOUN', 'AUX', 'PUNCT', 'SCONJ', 'ADP', 'ADJ', 'NOUN', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'PUNCT', 'CCONJ', 'ADV', 'VERB', 'CCONJ', 'PUNCT', 'SCONJ']


In [92]:
# A list of list of tokens
print(cltk_doc.sentences_tokens[:2])

[['facturusne', 'operae', 'pretium', 'sim', ',', 'si', 'a', 'primordio', 'urbis', 'res', 'populi', 'Romani', 'perscripserim', ',', 'nec', 'satis', 'scio', 'nec', ',', 'si', 'sciam', ',', 'dicere', 'ausim', ',', 'quippe', 'qui', 'cum', 'veterem', 'tum', 'vulgatam', 'esse', 'rem', 'videam', ',', 'dum', 'novi', 'semper', 'scriptores', 'aut', 'in', 'rebus', 'certius', 'aliquid', 'allaturos', 'se', 'aut', 'scribendi', 'arte', 'rudem', 'vetustatem', 'superaturos', 'credunt', '.'], ['utcumque', 'erit', ',', 'iuvabit', 'tamen', 'rerum', 'gestarum', 'memoriae', 'principis', 'terrarum', 'populi', 'pro', 'virili', 'parte', 'et', 'ipsum', 'consuluisse', ';']]


# Inspect CLTK `Word`

Most powerful, though, is the ``Doc.words`` accessor, which is a list of ``Word`` objects. These ``Word`` objects contain all information that was generated during the NLP pipeline

In [87]:
# One ``Word`` object for each token
print(len(cltk_doc.words))

11735


Users can go token-by-token via ``Doc.words`` or via the intermediary step of looping through sentences.

In [123]:
# Let's look at a non-trivial sentence from Book 1
print("Original:", cltk_doc.sentences_strings[26])
print("")
print("Translation:", "Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.")
# source: http://www.perseus.tufts.edu/hopper/text?doc=Liv.+1+1+5&fromdoc=Perseus%3Atext%3A1999.02.0151
sentence_26 = cltk_doc.sentences[26]  # type: List[Word]

Original: Ibi egressi Troiani , ut quibus ab immenso prope errore nihil praeter arma et naues superesset , cum praedam ex agris agerent , Latinus rex Aboriginesque qui tum ea tenebant loca ad arcendam vim advenarum armati ex urbe atque agris concurrunt .

Translation: Landing there, the Trojans, as men who, after their all but immeasurable wanderings, had nothing left but their swords and ships, were driving booty from the fields, when King Latinus and the Aborigines, who then occupied that country, rushed down from their city and their fields to repel with arms the violence of the invaders.


In [129]:
# Looking at one Word, 'concurrunt' ('they run together')
sentence_26[40]

Word(index_char_start=None, index_char_stop=None, index_token=40, index_sentence=26, string='concurrunt', pos=verb, lemma='concurro', stem=None, scansion=None, xpos='L3|modA|tem1|gen9', upos='VERB', dependency_relation='acl:relcl', governor=33, features={Mood: [indicative], Number: [plural], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}, category={F: [neg], N: [neg], V: [pos]}, embedding=array([-0.16746  , -0.18548  ,  0.30632  , -0.29627  , -0.27262  ,
       -0.0767   ,  0.19405  ,  0.12386  , -0.0076342,  0.13037  ,
        0.17128  ,  0.1189   , -0.22169  , -0.57089  ,  0.28066  ,
       -0.14514  , -0.041256 , -0.021754 ,  0.02212  , -0.25983  ,
        0.53374  , -0.042267 ,  0.27314  ,  0.083616 ,  0.30746  ,
        0.087764 , -0.10098  ,  0.22689  , -0.17577  , -0.35894  ,
       -0.39609  ,  0.43406  ,  0.21306  ,  0.26909  ,  0.099561 ,
        0.26916  , -0.46547  ,  0.1416   , -0.21319  , -0.15126  ,
        0.36604  , -0.020737 ,  0.42397  ,  0.0

In this word, you can see information for lexicography (`.lemmata`), semantics (`.embedding`), morphology (`.pos`, `.features`), syntax (`.governor`, `.dependency_relation`), plus other information most users would find helpful (`.stop`, `.named_entity`).

# Modeling morphology

TODO

# Modeling syntax

TODO

# Feature extraction

The following give some examples of helpers which assist in preparing `Doc` information for machine learning.

TODO

## Extras

In [30]:
# View default processes
print(cltk_nlp.pipeline)

LatinPipeline(description='Pipeline for the Latin language', processes=[<class 'cltk.dependency.processes.LatinStanzaProcess'>, <class 'cltk.embeddings.processes.LatinEmbeddingsProcess'>, <class 'cltk.stops.processes.StopsProcess'>, <class 'cltk.ner.processes.LatinNERProcess'>], language=Language(name='Latin', glottolog_id='lati1261', latitude=41.9026, longitude=12.4502, dates=[], family_id='indo1319', parent_id='impe1234', level='language', iso_639_3_code='lat', type='a'))


In [114]:
#[(i, s) for i, s in enumerate(cltk_doc.sentences_strings[:100])]