# CLTK Readers demo notebook

*A note on using Jupyter notebooks in this live demo*  

- TLDR; "SHIFT + ENTER/RETURN" will do pretty much everything you will need to do for the demo.
- Look for the word "# Imports" below and click anywhere is its surrounding box to activate that input cell. These cells are where you will find the code that we will run in today's demo.
- If you press "SHIFT + ENTER/RETURN", you will *run* that cell and the output will appear in space below the cell; you will also advance to the next cell. If you press "CTRL + ENTER/RETURN", you will run the cell without advancing. If you press "OPTION (or ALT) + ENTER/RETURN", you will run the cell and insert a new empty cell below the running cell.
- Note the brackets on the left side of the cells. If empty—that is, if there is blank space between the cells—the cell has not yet been run. If there is a number between the brackets, this is an indicator that the cell has been run and denotes the order in which the cells have been run.

## A reader example—the LatinTesseraeCorpusReader

In [1]:
## Imports

from cltkreaders.lat import LatinTesseraeCorpusReader

from os.path import expanduser
from natsort import natsorted
from tqdm import tqdm
from pprint import pprint

In [2]:
## Set up reader
# NB: If you do not have the CLTK-Tesserae corpus already installed in CLTK_DATA, you will be prompted to download the corpus.

T = LatinTesseraeCorpusReader()

### Working with files

In [3]:
## First 10 filesnames

pprint(T.fileids()[:10])

['ammianus.rerum_gestarum.part.14.tess',
 'ammianus.rerum_gestarum.part.15.tess',
 'ammianus.rerum_gestarum.part.16.tess',
 'ammianus.rerum_gestarum.part.17.tess',
 'ammianus.rerum_gestarum.part.18.tess',
 'ammianus.rerum_gestarum.part.19.tess',
 'ammianus.rerum_gestarum.part.20.tess',
 'ammianus.rerum_gestarum.part.21.tess',
 'ammianus.rerum_gestarum.part.22.tess',
 'ammianus.rerum_gestarum.part.23.tess']


In [4]:
## First 10 works of Cicero

cicero = [file for file in T.fileids() if 'cicero' in file]
pprint(cicero[:10])

['cicero.academica.tess',
 'cicero.brutus.tess',
 'cicero.cum_populo_gratias_egit.tess',
 'cicero.de_amicitia.tess',
 'cicero.de_divinatione.tess',
 'cicero.de_domo_sua.tess',
 'cicero.de_fato.tess',
 'cicero.de_finibus_bonorum_et_malorum.part.1.tess',
 'cicero.de_finibus_bonorum_et_malorum.part.2.tess',
 'cicero.de_finibus_bonorum_et_malorum.part.3.tess']


In [5]:
## Books of the Aeneid, sorted

aeneid = natsorted([file for file in T.fileids() if 'aeneid' in file])
pprint(aeneid)

['vergil.aeneid.part.1.tess',
 'vergil.aeneid.part.2.tess',
 'vergil.aeneid.part.3.tess',
 'vergil.aeneid.part.4.tess',
 'vergil.aeneid.part.5.tess',
 'vergil.aeneid.part.6.tess',
 'vergil.aeneid.part.7.tess',
 'vergil.aeneid.part.8.tess',
 'vergil.aeneid.part.9.tess',
 'vergil.aeneid.part.10.tess',
 'vergil.aeneid.part.11.tess',
 'vergil.aeneid.part.12.tess']


### Working with doc structures

In [6]:
catullus = 'catullus.carmina.tess'

In [7]:
## Docs

catullus_doc = T.docs(catullus)
print(next(catullus_doc)[:446])

<cat. 1.1>	Cui dono lepidum novum libellum
<cat. 1.2>	arido modo pumice expolitum?
<cat. 1.3>	Corneli, tibi; namque tu solebas
<cat. 1.4>	meas esse aliquid putare nugas,
<cat. 1.5>	iam tum cum ausus es unus Italorum
<cat. 1.6>	omne aevum tribus explicare chartis,
<cat. 1.7>	doctis, Iuppiter, et laboriosis!
<cat. 1.8>	quare habe tibi quidquid hoc libelli
<cat. 1.9>	qualecumque, quod, o patrona virgo,
<cat. 1.10>	plus uno maneat perenne saeclo.


In [8]:
## Texts

catullus_text = T.texts(catullus)
print(next(catullus_text)[:335])

Cui dono lepidum novum libellum
arido modo pumice expolitum?
Corneli, tibi; namque tu solebas
meas esse aliquid putare nugas,
iam tum cum ausus es unus Italorum
omne aevum tribus explicare chartis,
doctis, Iuppiter, et laboriosis!
quare habe tibi quidquid hoc libelli
qualecumque, quod, o patrona virgo,
plus uno maneat perenne saeclo.


In [9]:
## Doc Rows

catullus_docrows = T.doc_rows(catullus)

print('This is a string representation of what the output dictionary looks like...')
print(f'{str(next(catullus_docrows))[:94]} etc. }}\n')


catullus_docrows = T.doc_rows(catullus)
print('Here are the first 10 items of the dict output...')
pprint(list(next(catullus_docrows).items())[:10])


This is a string representation of what the output dictionary looks like...
{'<cat. 1.1>': 'Cui dono lepidum novum libellum', '<cat. 1.2>': 'arido modo pumice expolitum?' etc. }

Here are the first 10 items of the dict output...
[('<cat. 1.1>', 'Cui dono lepidum novum libellum'),
 ('<cat. 1.2>', 'arido modo pumice expolitum?'),
 ('<cat. 1.3>', 'Corneli, tibi; namque tu solebas'),
 ('<cat. 1.4>', 'meas esse aliquid putare nugas,'),
 ('<cat. 1.5>', 'iam tum cum ausus es unus Italorum'),
 ('<cat. 1.6>', 'omne aevum tribus explicare chartis,'),
 ('<cat. 1.7>', 'doctis, Iuppiter, et laboriosis!'),
 ('<cat. 1.8>', 'quare habe tibi quidquid hoc libelli'),
 ('<cat. 1.9>', 'qualecumque, quod, o patrona virgo,'),
 ('<cat. 1.10>', 'plus uno maneat perenne saeclo.')]


### Working with doc units (i.e. philological units)

In [10]:
catilinam = 'cicero.in_catilinam.tess'

In [11]:
## Paras

catilinam_paras = T.paras(catilinam)

for i in range(1,6):
    print(f'Para {i}: {next(catilinam_paras)}')
    print()

Para 1: quo usque tandem abutere, Catilina, patientia nostra? quam diu etiam furor iste tuus nos eludet? quem ad finem sese effrenata iactabit audacia? nihilne te nocturnum praesidium Palati, nihil urbis vigiliae, nihil timor populi, nihil concursus bonorum omnium, nihil hic munitissimus habendi senatus locus, nihil horum ora voltusque moverunt? patere tua consilia non sentis, constrictam iam horum omnium scientia teneri coniurationem tuam non vides? quid proxima, quid superiore nocte egeris, ubi fueris, quos convocaveris, quid consili ceperis quem nostrum ignorare arbitraris?

Para 2: O tempora, o mores! senatus haec intellegit, consul videt; hic tamen vivit. vivit? immo vero etiam in senatum venit, fit publici consili particeps, notat et designat oculis ad caedem unum quemque nostrum. nos autem fortes viri satis facere rei publicae videmur, si istius furorem ac tela vitamus. ad mortem te, Catilina, duci iussu consulis iam pridem oportebat, in te conferri pestem quam tu in nos omnis i

Note that for the Tesserae texts, `paras` are *not* truly implemented, as they are not consistent marked in the original files. For prose texts, what is returned by paras in the entire section by citation. The verse texts (which are defined by default as cited sections less than 75 characters long) yield the entire text as a single para, as for *Aeneid* 1 below...

In [12]:
## Paras

aeneid_paras = T.paras('vergil.aeneid.part.1.tess')

for i in range(1,2):
    print(f'Para {i}: {next(aeneid_paras)}')
    print()

Para 1: Arma virumque cano, Troiae qui primus ab oris
Italiam, fato profugus, Laviniaque venit
litora, multum ille et terris iactatus et alto
vi superum saevae memorem Iunonis ob iram;
multa quoque et bello passus, dum conderet urbem,
inferretque deos Latio, genus unde Latinum,
Albanique patres, atque altae moenia Romae.
Musa, mihi causas memora, quo numine laeso,
quidve dolens, regina deum tot volvere casus
insignem pietate virum, tot adire labores
impulerit. Tantaene animis caelestibus irae?
Urbs antiqua fuit, Tyrii tenuere coloni,
Karthago, Italiam contra Tiberinaque longe
ostia, dives opum studiisque asperrima belli;
quam Iuno fertur terris magis omnibus unam
posthabita coluisse Samo; hic illius arma,
hic currus fuit; hoc regnum dea gentibus esse,
si qua fata sinant, iam tum tenditque fovetque.
Progeniem sed enim Troiano a sanguine duci
audierat, Tyrias olim quae verteret arces;
hinc populum late regem belloque superbum
venturum excidio Libyae: sic volvere Parcas.
Id metuens, veter

In [13]:
# Sents

# By default, segmentation, tokenization, and other tagging is done using the spaCy model 'la_dep_cltk_sm'

catilinam_sents = T.sents(catilinam)

for i in range(1,6):
    print(f'Sent {i}: {next(catilinam_sents)}')

Sent 1: quo usque tandem abutere, Catilina, patientia nostra?
Sent 2: quam diu etiam furor iste tuus nos eludet?
Sent 3: quem ad finem sese effrenata iactabit audacia?
Sent 4: nihilne te nocturnum praesidium Palati, nihil urbis vigiliae, nihil timor populi, nihil concursus bonorum omnium, nihil hic munitissimus habendi senatus locus, nihil horum ora voltusque moverunt?
Sent 5: patere tua consilia non sentis, constrictam iam horum omnium scientia teneri coniurationem tuam non vides?


In [14]:
# Words

# By default, segmentation, tokenization, and other tagging is done using the spaCy model 'la_dep_cltk_sm'

catilinam_words = T.words(catilinam)

for i in range(1,10):
    print(f'Word {i}: {next(catilinam_words)}')


Word 1: quo
Word 2: usque
Word 3: tandem
Word 4: abutere
Word 5: ,
Word 6: Catilina
Word 7: ,
Word 8: patientia
Word 9: nostra


In [15]:
# You can pass a preprocessor to `words` 

def custom_preprocess(text):
    text = text.lower()
    return text

catilinam_words = T.words(catilinam, preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(catilinam_words)}')


Word 1: quo
Word 2: usque
Word 3: tandem
Word 4: abutere
Word 5: ,
Word 6: catilina
Word 7: ,
Word 8: patientia
Word 9: nostra


In [16]:
# Tokenized sents

# i.e. Sents in the form of a list of tuples of the form `(token, lemma, tag)`

catilinam_tokenized_sents = T.tokenized_sents(catilinam)

for i in range(1,4):
    print(f'Tok Sent {i}: {next(catilinam_tokenized_sents)}')
    print()


Tok Sent 1: [('quo', 'qui', 'PRON'), ('usque', 'usque', 'ADV'), ('tandem', 'tandem', 'ADV'), ('abutere', 'abutor', 'VERB'), (',', ',', 'PUNCT'), ('Catilina', 'Catilina', 'NOUN'), (',', ',', 'PUNCT'), ('patientia', 'patientia', 'NOUN'), ('nostra', 'noster', 'DET'), ('?', '?', 'PUNCT')]

Tok Sent 2: [('quam', 'qui', 'PRON'), ('diu', 'diu', 'ADV'), ('etiam', 'etiam', 'ADV'), ('furor', 'furor', 'NOUN'), ('iste', 'iste', 'DET'), ('tuus', 'tuus', 'PRON'), ('nos', 'nos', 'PRON'), ('eludet', 'eludo', 'VERB'), ('?', '?', 'PUNCT')]

Tok Sent 3: [('quem', 'qui', 'PRON'), ('ad', 'ad', 'ADP'), ('finem', 'finis', 'NOUN'), ('sese', 'se', 'PRON'), ('effrenata', 'effreno', 'VERB'), ('iactabit', 'iacto', 'VERB'), ('audacia', 'audacia', 'NOUN'), ('?', '?', 'PUNCT')]



In [17]:
# Tokenized sents, simplified

# i.e. Sents in the form of a list of tokens

catilinam_tokenized_sents = T.tokenized_sents(catilinam, simple=True)

for i in range(1,4):
    print(f'Tok Sent {i}: {next(catilinam_tokenized_sents)}')
    print()


Tok Sent 1: ['quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']

Tok Sent 2: ['quam', 'diu', 'etiam', 'furor', 'iste', 'tuus', 'nos', 'eludet', '?']

Tok Sent 3: ['quem', 'ad', 'finem', 'sese', 'effrenata', 'iactabit', 'audacia', '?']



In [18]:
# POS-tagged sents

# i.e. Sents in the form of a list of strings of the form `token/POS`

catilinam_pos_sents = T.pos_sents(catilinam)

for i in range(1,2):
    print(f'POS Sent {i}: {next(catilinam_pos_sents)}')


POS Sent 1: ['quo/PRON', 'usque/ADV', 'tandem/ADV', 'abutere/VERB', ',/PUNCT', 'Catilina/NOUN', ',/PUNCT', 'patientia/NOUN', 'nostra/DET', '?/PUNCT']


### Working with doc descriptions

In [19]:
metamorphoses = natsorted([file for file in T.fileids() if 'ovid.metamorphoses' in file])
pprint(metamorphoses)

['ovid.metamorphoses.part.1.tess',
 'ovid.metamorphoses.part.2.tess',
 'ovid.metamorphoses.part.3.tess',
 'ovid.metamorphoses.part.4.tess',
 'ovid.metamorphoses.part.5.tess',
 'ovid.metamorphoses.part.6.tess',
 'ovid.metamorphoses.part.7.tess',
 'ovid.metamorphoses.part.8.tess',
 'ovid.metamorphoses.part.9.tess',
 'ovid.metamorphoses.part.10.tess',
 'ovid.metamorphoses.part.11.tess',
 'ovid.metamorphoses.part.12.tess',
 'ovid.metamorphoses.part.13.tess',
 'ovid.metamorphoses.part.14.tess',
 'ovid.metamorphoses.part.15.tess']


In [20]:
def custom_preprocess(text):
    from cltk.alphabet.lat import JVReplacer
    replacer = JVReplacer()

    text = text.lower() # Lowercase
    text = replacer.replace(text)  # Normalize u/v & i/j

    # Remove punctuation
    punctuation ="\"#$%&\'()*+,/:;<=>@[\]^_`{|}~.?!«»—“-”"
    misc = '¡£¤¥¦§¨©¯°±²³´µ¶·¸¹º¼½¾¿÷·–‘’†•ↄ∞⏑〈〉（）'
    misc += punctuation
    translator = str.maketrans({key: " " for key in misc})
    text = text.translate(translator)

    # Remove numbers
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)

    return " ".join(text.split()).strip()

In [21]:
## Concordance, using Tesserae citations

# NB: Concordancing is current only available for the Tesserae readers

metamorphoses_concordances = T.concordance(metamorphoses, preprocess=custom_preprocess)

met_conc_sample = next(metamorphoses_concordances)
pprint(list(met_conc_sample.items())[:3])

[('a', [('<ov. met. 1.145>', 2), ('<ov. met. 1.587>', 1)]),
 ('ab',
  [('<ov. met. 1.3>', 3),
   ('<ov. met. 1.23>', 4),
   ('<ov. met. 1.34>', 5),
   ('<ov. met. 1.40>', 5),
   ('<ov. met. 1.66>', 4),
   ('<ov. met. 1.80>', 5),
   ('<ov. met. 1.144>', 5),
   ('<ov. met. 1.185>', 7),
   ('<ov. met. 1.233>', 4),
   ('<ov. met. 1.254>', 6),
   ('<ov. met. 1.269>', 5),
   ('<ov. met. 1.313>', 4),
   ('<ov. met. 1.336>', 6),
   ('<ov. met. 1.417>', 6),
   ('<ov. met. 1.431>', 2),
   ('<ov. met. 1.568>', 6),
   ('<ov. met. 1.607>', 5),
   ('<ov. met. 1.672>', 6),
   ('<ov. met. 1.774>', 7)]),
 ('aberant', [('<ov. met. 1.91>', 2)])]


In [22]:
# Concordances are by default built on a file-by-file basis, but can easily be combined with the `compiled` parameter

metamorphoses_concordances = T.concordance(metamorphoses, compiled=True, preprocess=custom_preprocess)

full_met_conc_sample = next(metamorphoses_concordances)
pprint(list(full_met_conc_sample.items())[96:102])

[('accensis', [('<ov. met. 12.12>', 2)]),
 ('accensum', [('<ov. met. 2.228>', 1)]),
 ('accensus', [('<ov. met. 11.527>', 4)]),
 ('acceperat', [('<ov. met. 3.121>', 5)]),
 ('accepere', [('<ov. met. 9.719>', 0), ('<ov. met. 15.641>', 4)]),
 ('accepisse',
  [('<ov. met. 6.357>', 0),
   ('<ov. met. 14.844>', 4),
   ('<ov. met. 15.481>', 0)])]


In [23]:
# Since the concordances are output as dictionaries, you can retrieve location information using the token as a dict key...

metamorphoses_concordances = T.concordance(metamorphoses, compiled=True, preprocess=custom_preprocess)
full_met_conc_sample = next(metamorphoses_concordances)

print(f'\'corpus\' appears {len(full_met_conc_sample["corpus"])} times in the Metamorphoses.')
print('Here are the first five instances...')
print(full_met_conc_sample['corpus'][:5])

print()

print(f'\'corpora\' appears {len(full_met_conc_sample["corpora"])} times in the Metamorphoses.')
print('Here are the first five instances...')
print(full_met_conc_sample['corpora'][:5])

'corpus' appears 67 times in the Metamorphoses.
Here are the first five instances...
[('<ov. met. 1.190>', 5), ('<ov. met. 2.362>', 6), ('<ov. met. 2.611>', 0), ('<ov. met. 2.647>', 2), ('<ov. met. 2.648>', 2)]

'corpora' appears 86 times in the Metamorphoses.
Here are the first five instances...
[('<ov. met. 1.2>', 0), ('<ov. met. 1.156>', 4), ('<ov. met. 1.300>', 5), ('<ov. met. 1.527>', 5), ('<ov. met. 2.235>', 4)]


In [24]:
# ## Basic descriptive data; note takes several minutes to run

# tess_describe = T.describe()
# pprint(tess_describe)

Sample output:  

{'files': 748,  
 'lexdiv': 24.255701516259066,  
 'secs': 143.71532320976257,  
 'sents': 314436,  
 'vocab': 329693,  
 'words': 7996935}  

In [25]:
## This data can also be returned for individual files or lists of files

print('Stats on just the file \'catullus.carmina.tess\'')
pprint(T.describe(catullus))

Stats on just the file 'catullus.carmina.tess'
{'files': 1,
 'lexdiv': 2.6185304598676398,
 'secs': 9.656116008758545,
 'sents': 910,
 'vocab': 5893,
 'words': 15431}


In [26]:
# ## Basic descriptive data; note takes several minutes to run

# print('Stats on just the group of files assigned above to the variable `metamorphoses`')
# pprint(T.describe(metamorphoses))

Stats on just the group of files assigned above to the variable `metamorphoses`  

{'files': 15,  
 'lexdiv': 4.389574250997125,  
 'secs': 58.47328305244446,  
 'sents': 6140,  
 'vocab': 21562,  
 'words': 94648}  

## Another reader example—the GreekTesseraeReader

In [27]:
## Imports

from cltkreaders.grc import GreekTesseraeCorpusReader

In [28]:
## Set up reader
# NB: If you do not have the CLTK-Tesserae corpus already installed in CLTK_DATA, you will be prompted to download the corpus.

T = GreekTesseraeCorpusReader()

In [29]:
## First 10 filesnames

pprint(T.fileids()[:10])

['achilles_tatius.leucippe_et_clitophon.tess',
 'aelian.de_natura_animalium.tess',
 'aelian.epistulae_rusticae.tess',
 'aelian.varia_historia.tess',
 'aelius_aristides.ars_rhetorica.tess',
 'aelius_aristides.orationes.tess',
 'aeschines.against_ctesiphon.tess',
 'aeschines.against_timarchus.tess',
 'aeschines.on_the_embassy.tess',
 'aeschylus.agamemnon.tess']


In [30]:
apollonius = [file for file in T.fileids() if 'apollonius' in file]

pprint(natsorted(apollonius))

['apollonius.argonautica.part.1.tess',
 'apollonius.argonautica.part.2.tess',
 'apollonius.argonautica.part.3.tess',
 'apollonius.argonautica.part.4.tess']


In [31]:
# Sents

# By default, segmentation, tokenization, and other tagging is done using the cltk v1 pipeline

apollonius_sents = T.sents(apollonius)

for i in range(1,6):
    print(f'Sent {i}: {next(apollonius_sents)}')
    print()

Sent 1: ̓Αρχόμενος σέο, Φοῖβε, παλαιγενέων κλέα φωτῶν
μνήσομαι, οἳ Πόντοιο κατὰ στόμα καὶ διὰ πέτρας
Κυανέας βασιλῆος ἐφημοσύνῃ Πελίαο
χρύσειον μετὰ κῶας ἐύζυγον ἤλασαν ̓Αργώ.

Sent 2: Τοίην γὰρ Πελίης φάτιν ἔκλυεν, ὥς μιν ὀπίσσω
μοῖρα μένει στυγερή, τοῦδ' ἀνέρος, ὅντιν' ἴδοιτο
δημόθεν οἰοπέδιλον, ὑπ' ἐννεσίῃσι δαμῆναι.

Sent 3: δηρὸν δ' οὐ μετέπειτα τεὴν κατὰ βάξιν ̓Ιήσων
χειμερίοιο ῥέεθρα κιὼν διὰ ποσσὶν ̓Αναύρου
ἄλλο μὲν ἐξεσάωσεν ὑπ' ἰλύος, ἄλλο δ' ἔνερθεν
κάλλιπεν αὖθι πέδιλον ἐνισχόμενον προχοῇσιν.

Sent 4: ἵκετο δ' ἐς Πελίην αὐτοσχεδὸν ἀντιβολήσων
εἰλαπίνης, ἣν πατρὶ Ποσειδάωνι καὶ ἄλλοις
ῥέζε θεοῖς, ̔́Ηρης δὲ Πελασγίδος οὐκ ἀλέγιζεν.

Sent 5: αἶψα δὲ τόνγ' ἐσιδὼν ἐφράσσατο, καί οἱ ἄεθλον
ἔντυε ναυτιλίης πολυκηδέος, ὄφρ' ἐνὶ πόντῳ
ἠὲ καὶ ἀλλοδαποῖσι μετ' ἀνδράσι νόστον ὀλέσσῃ.



In [32]:
# Sents, with unline

# By default, segmentation, tokenization, and other tagging is done using the cltk v1 pipeline

apollonius_sents = T.sents(apollonius, unline=True)

for i in range(1,6):
    print(f'Sent {i}: {next(apollonius_sents)}')
    print()

Sent 1: ̓Αρχόμενος σέο, Φοῖβε, παλαιγενέων κλέα φωτῶν μνήσομαι, οἳ Πόντοιο κατὰ στόμα καὶ διὰ πέτρας Κυανέας βασιλῆος ἐφημοσύνῃ Πελίαο χρύσειον μετὰ κῶας ἐύζυγον ἤλασαν ̓Αργώ.

Sent 2: Τοίην γὰρ Πελίης φάτιν ἔκλυεν, ὥς μιν ὀπίσσω μοῖρα μένει στυγερή, τοῦδ' ἀνέρος, ὅντιν' ἴδοιτο δημόθεν οἰοπέδιλον, ὑπ' ἐννεσίῃσι δαμῆναι.

Sent 3: δηρὸν δ' οὐ μετέπειτα τεὴν κατὰ βάξιν ̓Ιήσων χειμερίοιο ῥέεθρα κιὼν διὰ ποσσὶν ̓Αναύρου ἄλλο μὲν ἐξεσάωσεν ὑπ' ἰλύος, ἄλλο δ' ἔνερθεν κάλλιπεν αὖθι πέδιλον ἐνισχόμενον προχοῇσιν.

Sent 4: ἵκετο δ' ἐς Πελίην αὐτοσχεδὸν ἀντιβολήσων εἰλαπίνης, ἣν πατρὶ Ποσειδάωνι καὶ ἄλλοις ῥέζε θεοῖς, ̔́Ηρης δὲ Πελασγίδος οὐκ ἀλέγιζεν.

Sent 5: αἶψα δὲ τόνγ' ἐσιδὼν ἐφράσσατο, καί οἱ ἄεθλον ἔντυε ναυτιλίης πολυκηδέος, ὄφρ' ἐνὶ πόντῳ ἠὲ καὶ ἀλλοδαποῖσι μετ' ἀνδράσι νόστον ὀλέσσῃ.



In [33]:
# Words

# By default, segmentation, tokenization, and other tagging is done using the cltk v1 pipeline

apollonius_words = T.words(apollonius, preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(apollonius_words)}')


Word 1: ̓αρχόμενος
Word 2: σέο
Word 3: φοῖβε
Word 4: παλαιγενέων
Word 5: κλέα
Word 6: φωτῶν
Word 7: μνήσομαι
Word 8: οἳ
Word 9: πόντοιο


## Another reader example—the UDCorpusReader

In [34]:
from cltkreaders.readers import UDCorpusReader

In [35]:
## Set up reader

UD = UDCorpusReader('../data')

In [36]:
# Print fileids

print(UD.fileids())

['la_perseus-ud-sample.conllu']


In [37]:
# Raw text; sample from Phaedrus 2 prologue

print(UD.raw()[:1710])
print('etc...')

# newdoc id = phi0975.phi001.perseus-lat1.tb.xml
# sent_id = phi0975.phi001.perseus-lat1.tb.xml@226
# text = Cuius verbosa ne sit commendatio, Attende, cur negare cupidis debeas, Modestis etiam offerre quod non petierint.
1	Cuius	qui	PRON	p-s---fg-	Case=Gen|Gender=Fem|Number=Sing	5	nmod	_	LId=qui1
2	verbosa	verbosus	ADJ	a-s---fn-	Case=Nom|Gender=Fem|Number=Sing	7	advcl	_	_
3	ne	ne	SCONJ	c--------	_	2	mark	_	LId=ne1
4	sit	sum	AUX	v3spsa---	Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	2	cop	_	LId=sum1
5	commendatio	commendatio	NOUN	n-s---fn-	Case=Nom|Gender=Fem|Number=Sing	2	nsubj	_	SpaceAfter=No
6	,	,	PUNCT	u--------	_	2	punct	_	_
7	Attende	attendo	VERB	v2spma---	Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	SpaceAfter=No
8	,	,	PUNCT	u--------	_	9	punct	_	_
9	cur	cur	ADV	d--------	_	12	mark	_	_
10	negare	nego	VERB	v--pna---	Tense=Pres|VerbForm=Inf|Voice=Act	12	xcomp	_	_
11	cupidis	cupidus	ADJ	a-p---md-	Case=Dat|Gender=Masc|Number=Plur	10	obj

In [38]:
# Words

# This reader reads also annotations directly from data files

ud_words = UD.words()

for i in range(1,10):
    print(f'Word {i}: {next(ud_words)}')


Word 1: Cuius
Word 2: verbosa
Word 3: ne
Word 4: sit
Word 5: commendatio
Word 6: ,
Word 7: Attende
Word 8: ,
Word 9: cur


In [39]:
# Words, with preprocessing

# This reader reads also annotations directly from data files

ud_words = UD.words(preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(ud_words)}')


Word 1: cuius
Word 2: uerbosa
Word 3: ne
Word 4: sit
Word 5: commendatio
Word 6: attende
Word 7: cur
Word 8: negare
Word 9: cupidis


In [40]:
# Annotated sents, directly from data files

annotated_sents = UD.annotated_sents()

for sent in annotated_sents:
    pprint(sent)
    break

[('Cuius', 'qui', 'PRON'),
 ('verbosa', 'verbosus', 'ADJ'),
 ('ne', 'ne', 'SCONJ'),
 ('sit', 'sum', 'AUX'),
 ('commendatio', 'commendatio', 'NOUN'),
 (',', ',', 'PUNCT'),
 ('Attende', 'attendo', 'VERB'),
 (',', ',', 'PUNCT'),
 ('cur', 'cur', 'ADV'),
 ('negare', 'nego', 'VERB'),
 ('cupidis', 'cupidus', 'ADJ'),
 ('debeas', 'debeo', 'VERB'),
 (',', ',', 'PUNCT'),
 ('Modestis', 'modestus', 'ADJ'),
 ('etiam', 'etiam', 'CCONJ'),
 ('offerre', 'offero', 'VERB'),
 ('quod', 'qui', 'PRON'),
 ('non', 'non', 'ADV'),
 ('petierint', 'peto', 'VERB'),
 ('.', '.', 'PUNCT')]


In [41]:
# conllu files transformed to Python dicts

next(UD.sent_dicts())

[{'ID': '1',
  'FORM': 'Cuius',
  'LEMMA': 'qui',
  'UPOS': 'PRON',
  'XPOS': 'p-s---fg-',
  'FEATS': 'Case=Gen|Gender=Fem|Number=Sing',
  'HEAD': '5',
  'DEPREL': 'nmod',
  'DEPS': '_',
  'MISC': 'LId=qui1'},
 {'ID': '2',
  'FORM': 'verbosa',
  'LEMMA': 'verbosus',
  'UPOS': 'ADJ',
  'XPOS': 'a-s---fn-',
  'FEATS': 'Case=Nom|Gender=Fem|Number=Sing',
  'HEAD': '7',
  'DEPREL': 'advcl',
  'DEPS': '_',
  'MISC': '_'},
 {'ID': '3',
  'FORM': 'ne',
  'LEMMA': 'ne',
  'UPOS': 'SCONJ',
  'XPOS': 'c--------',
  'FEATS': '_',
  'HEAD': '2',
  'DEPREL': 'mark',
  'DEPS': '_',
  'MISC': 'LId=ne1'},
 {'ID': '4',
  'FORM': 'sit',
  'LEMMA': 'sum',
  'UPOS': 'AUX',
  'XPOS': 'v3spsa---',
  'FEATS': 'Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act',
  'HEAD': '2',
  'DEPREL': 'cop',
  'DEPS': '_',
  'MISC': 'LId=sum1'},
 {'ID': '5',
  'FORM': 'commendatio',
  'LEMMA': 'commendatio',
  'UPOS': 'NOUN',
  'XPOS': 'n-s---fn-',
  'FEATS': 'Case=Nom|Gender=Fem|Number=Sing',
  'HEAD': '2',
