# Introduction to CLTK Readers

CLTK Readers is a corpus reader extension written for use with the Classical Language Toolkit, LatinCy, etc. It gives single-line api access to philologically sensible units (i.e. words, sentences, paragraph, documents, etc.) for basic processing and analyses of Latin text collections. Some readers have extended, higher-order functionality, such as the concordancer for the CLTK Tesserae texts as shown below. The repository for CLTK readers can be found [here](https://github.com/diyclassics/cltk_readers).

In [1]:
## Imports

from cltkreaders.lat import LatinTesseraeCorpusReader
from os.path import expanduser
from natsort import natsorted
from pprint import pprint

In [2]:
## Set up reader
# NB: If you do not have the CLTK-Tesserae corpus already installed in CLTK_DATA, you will be prompted to download the corpus.

T = LatinTesseraeCorpusReader()

## Fileids

In [3]:
## First 10 filesnames

pprint(T.fileids()[:10])

['ammianus.rerum_gestarum.part.14.tess',
 'ammianus.rerum_gestarum.part.15.tess',
 'ammianus.rerum_gestarum.part.16.tess',
 'ammianus.rerum_gestarum.part.17.tess',
 'ammianus.rerum_gestarum.part.18.tess',
 'ammianus.rerum_gestarum.part.19.tess',
 'ammianus.rerum_gestarum.part.20.tess',
 'ammianus.rerum_gestarum.part.21.tess',
 'ammianus.rerum_gestarum.part.22.tess',
 'ammianus.rerum_gestarum.part.23.tess']


In [4]:
## First 10 works of Cicero

cicero = [file for file in T.fileids() if 'cicero' in file]
pprint(cicero[:10])

['cicero.academica.tess',
 'cicero.brutus.tess',
 'cicero.cum_populo_gratias_egit.tess',
 'cicero.de_amicitia.tess',
 'cicero.de_divinatione.tess',
 'cicero.de_domo_sua.tess',
 'cicero.de_fato.tess',
 'cicero.de_finibus_bonorum_et_malorum.part.1.tess',
 'cicero.de_finibus_bonorum_et_malorum.part.2.tess',
 'cicero.de_finibus_bonorum_et_malorum.part.3.tess']


In [5]:
## Books of the Aeneid, sorted

aeneid = natsorted([file for file in T.fileids() if 'aeneid' in file])
pprint(aeneid)

['vergil.aeneid.part.1.tess',
 'vergil.aeneid.part.2.tess',
 'vergil.aeneid.part.3.tess',
 'vergil.aeneid.part.4.tess',
 'vergil.aeneid.part.5.tess',
 'vergil.aeneid.part.6.tess',
 'vergil.aeneid.part.7.tess',
 'vergil.aeneid.part.8.tess',
 'vergil.aeneid.part.9.tess',
 'vergil.aeneid.part.10.tess',
 'vergil.aeneid.part.11.tess',
 'vergil.aeneid.part.12.tess']


## Doc structures

In [6]:
catullus = 'catullus.carmina.tess'

In [7]:
## Docs

catullus_doc = T.docs(catullus)
print(next(catullus_doc)[:446])

<cat. 1.1>	Cui dono lepidum novum libellum
<cat. 1.2>	arido modo pumice expolitum?
<cat. 1.3>	Corneli, tibi; namque tu solebas
<cat. 1.4>	meas esse aliquid putare nugas,
<cat. 1.5>	iam tum cum ausus es unus Italorum
<cat. 1.6>	omne aevum tribus explicare chartis,
<cat. 1.7>	doctis, Iuppiter, et laboriosis!
<cat. 1.8>	quare habe tibi quidquid hoc libelli
<cat. 1.9>	qualecumque, quod, o patrona virgo,
<cat. 1.10>	plus uno maneat perenne saeclo.


In [8]:
## Texts

catullus_text = T.texts(catullus)
print(next(catullus_text)[:335])

Cui dono lepidum novum libellum
arido modo pumice expolitum?
Corneli, tibi; namque tu solebas
meas esse aliquid putare nugas,
iam tum cum ausus es unus Italorum
omne aevum tribus explicare chartis,
doctis, Iuppiter, et laboriosis!
quare habe tibi quidquid hoc libelli
qualecumque, quod, o patrona virgo,
plus uno maneat perenne saeclo.


In [9]:
## Doc Rows

catullus_docrows = T.doc_rows(catullus)

print('This is a string representation of what the output dictionary looks like...')
print(f'{str(next(catullus_docrows))[:94]} etc. }}\n')


catullus_docrows = T.doc_rows(catullus)
print('Here are the first 10 items of the dict output...')
pprint(list(next(catullus_docrows).items())[:10])


This is a string representation of what the output dictionary looks like...
{'<cat. 1.1>': 'Cui dono lepidum novum libellum', '<cat. 1.2>': 'arido modo pumice expolitum?' etc. }

Here are the first 10 items of the dict output...
[('<cat. 1.1>', 'Cui dono lepidum novum libellum'),
 ('<cat. 1.2>', 'arido modo pumice expolitum?'),
 ('<cat. 1.3>', 'Corneli, tibi; namque tu solebas'),
 ('<cat. 1.4>', 'meas esse aliquid putare nugas,'),
 ('<cat. 1.5>', 'iam tum cum ausus es unus Italorum'),
 ('<cat. 1.6>', 'omne aevum tribus explicare chartis,'),
 ('<cat. 1.7>', 'doctis, Iuppiter, et laboriosis!'),
 ('<cat. 1.8>', 'quare habe tibi quidquid hoc libelli'),
 ('<cat. 1.9>', 'qualecumque, quod, o patrona virgo,'),
 ('<cat. 1.10>', 'plus uno maneat perenne saeclo.')]


## Doc units

In [10]:
catilinam = 'cicero.in_catilinam.tess'

In [11]:
## Paras

print("Note that for the Tesserae texts, `paras` are *not* implemented. As they are not consistently marked in the original files.")

Note that for the Tesserae texts, `paras` are *not* implemented. As they are not consistently marked in the original files.


In [12]:
# Sents

# NB: Sents are segmented by default with the CLTK LatinPunktSentenceTokenizer

catilinam_sents = T.sents(catilinam)

for i in range(1,6):
    print(f'Sent {i}: {next(catilinam_sents)}')

Sent 1: quo usque tandem abutere, Catilina, patientia nostra?
Sent 2: quam diu etiam furor iste tuus nos eludet?
Sent 3: quem ad finem sese effrenata iactabit audacia?
Sent 4: nihilne te nocturnum praesidium Palati, nihil urbis vigiliae, nihil timor populi, nihil concursus bonorum omnium, nihil hic munitissimus habendi senatus locus, nihil horum ora voltusque moverunt?
Sent 5: patere tua consilia non sentis, constrictam iam horum omnium scientia teneri coniurationem tuam non vides?


In [13]:
# Words

# NB: Words are tokenized by default with the CLTK LatinWordTokenizer

catilinam_words = T.words(catilinam)

for i in range(1,10):
    print(f'Word {i}: {next(catilinam_words)}')


Word 1: quo
Word 2: usque
Word 3: tandem
Word 4: abutere
Word 5: ,
Word 6: Catilina
Word 7: ,
Word 8: patientia
Word 9: nostra


In [14]:
# You can pass a preprocessor to `words` 

def custom_preprocess(text):
    text = text.lower()
    return text

catilinam_words = T.words(catilinam, preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(catilinam_words)}')


Word 1: quo
Word 2: usque
Word 3: tandem
Word 4: abutere
Word 5: ,
Word 6: catilina
Word 7: ,
Word 8: patientia
Word 9: nostra


In [15]:
# Tokenized sents

# A combination of the two structures above; convenient for many applications that require lists of tokenized sentences

catilinam_tokenized_sents = T.tokenized_sents(catilinam)

for i in range(1,10):
    print(f'Tok Sent {i}: {next(catilinam_tokenized_sents)}')


Tok Sent 1: [('quo', 'qui', 'PRON'), ('usque', 'usque', 'ADV'), ('tandem', 'tandem', 'ADV'), ('abutere', 'abutor', 'VERB'), (',', ',', 'PUNCT'), ('Catilina', 'Catilina', 'NOUN'), (',', ',', 'PUNCT'), ('patientia', 'patientia', 'NOUN'), ('nostra', 'noster', 'ADJ'), ('?', '?', 'PUNCT')]
Tok Sent 2: [('quam', 'qui', 'PRON'), ('diu', 'diu', 'ADV'), ('etiam', 'etiam', 'ADV'), ('furor', 'furor', 'NOUN'), ('iste', 'iste', 'DET'), ('tuus', 'tuus', 'ADJ'), ('nos', 'nos', 'PRON'), ('eludet', 'eludo', 'VERB'), ('?', '?', 'PUNCT')]
Tok Sent 3: [('quem', 'qui', 'PRON'), ('ad', 'ad', 'ADP'), ('finem', 'finis', 'NOUN'), ('sese', 'se', 'PRON'), ('effrenata', 'effreno', 'VERB'), ('iactabit', 'iacto', 'VERB'), ('audacia', 'audacia', 'NOUN'), ('?', '?', 'PUNCT')]
Tok Sent 4: [('nihilne', 'nihilne', 'ADJ'), ('te', 'tu', 'PRON'), ('nocturnum', 'nocturnus', 'ADJ'), ('praesidium', 'praesidium', 'NOUN'), ('Palati', 'Palati', 'NOUN'), (',', ',', 'PUNCT'), ('nihil', 'nihil', ''), ('urbis', 'urbs', 'NOUN'), ('vi

## Doc description

In [16]:
metamorphoses = natsorted([file for file in T.fileids() if 'ovid.metamorphoses' in file])

In [17]:
def custom_preprocess(text):
    from cltk.alphabet.lat import JVReplacer
    replacer = JVReplacer()

    text = text.lower() # Lowercase
    text = replacer.replace(text)  # Normalize u/v & i/j

    # Remove punctuation
    punctuation ="\"#$%&\'()*+,/:;<=>@[\]^_`{|}~.?!«»—“-”"
    misc = '¡£¤¥¦§¨©¯°±²³´µ¶·¸¹º¼½¾¿÷·–‘’†•ↄ∞⏑〈〉（）'
    misc += punctuation
    translator = str.maketrans({key: " " for key in misc})
    text = text.translate(translator)

    # Remove numbers
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)

    return " ".join(text.split()).strip()

In [18]:
## Concordance, using Tesserae citations

metamorphoses_concordances = T.concordance(metamorphoses, preprocess=custom_preprocess)

met_conc_sample = next(metamorphoses_concordances)
pprint(list(met_conc_sample.items())[:3])

[('a', [('<ov. met. 1.145>', 2), ('<ov. met. 1.587>', 1)]),
 ('ab',
  [('<ov. met. 1.3>', 3),
   ('<ov. met. 1.23>', 4),
   ('<ov. met. 1.34>', 5),
   ('<ov. met. 1.40>', 5),
   ('<ov. met. 1.66>', 4),
   ('<ov. met. 1.80>', 5),
   ('<ov. met. 1.144>', 5),
   ('<ov. met. 1.185>', 7),
   ('<ov. met. 1.233>', 4),
   ('<ov. met. 1.254>', 6),
   ('<ov. met. 1.269>', 5),
   ('<ov. met. 1.313>', 4),
   ('<ov. met. 1.336>', 6),
   ('<ov. met. 1.417>', 6),
   ('<ov. met. 1.431>', 2),
   ('<ov. met. 1.568>', 6),
   ('<ov. met. 1.607>', 5),
   ('<ov. met. 1.672>', 6),
   ('<ov. met. 1.774>', 7)]),
 ('aberant', [('<ov. met. 1.91>', 2)])]


In [19]:
# Concordances are by default built on a file-by-file basis, but can easily be combined with the `compiled` parameter

metamorphoses_concordances = T.concordance(metamorphoses, compiled=True, preprocess=custom_preprocess)

full_met_conc_sample = next(metamorphoses_concordances)
pprint(list(full_met_conc_sample.items())[96:102])

[('accensis', [('<ov. met. 12.12>', 2)]),
 ('accensum', [('<ov. met. 2.228>', 1)]),
 ('accensus', [('<ov. met. 11.527>', 4)]),
 ('acceperat', [('<ov. met. 3.121>', 5)]),
 ('accepere', [('<ov. met. 9.719>', 0), ('<ov. met. 15.641>', 4)]),
 ('accepisse',
  [('<ov. met. 6.357>', 0),
   ('<ov. met. 14.844>', 4),
   ('<ov. met. 15.481>', 0)])]


In [20]:
# Since the concordances are output as dictionaries, you can retrieve location information using the token as a dict key...

metamorphoses_concordances = T.concordance(metamorphoses, compiled=True, preprocess=custom_preprocess)
full_met_conc_sample = next(metamorphoses_concordances)

print(f'\'corpus\' appears {len(full_met_conc_sample["corpus"])} times in the Metamorphoses.')
print('Here are the first five instances...')
print(full_met_conc_sample['corpus'][:5])

print()

print(f'\'corpora\' appears {len(full_met_conc_sample["corpora"])} times in the Metamorphoses.')
print('Here are the first five instances...')
print(full_met_conc_sample['corpora'][:5])

'corpus' appears 67 times in the Metamorphoses.
Here are the first five instances...
[('<ov. met. 1.190>', 5), ('<ov. met. 2.362>', 6), ('<ov. met. 2.611>', 0), ('<ov. met. 2.647>', 2), ('<ov. met. 2.648>', 2)]

'corpora' appears 86 times in the Metamorphoses.
Here are the first five instances...
[('<ov. met. 1.2>', 0), ('<ov. met. 1.156>', 4), ('<ov. met. 1.300>', 5), ('<ov. met. 1.527>', 5), ('<ov. met. 2.235>', 4)]


In [21]:
## Basic descriptive data; this data can also be returned for individual files or lists of files
# Here just Catullus...

print('Stats on just the file \'catullus.carmina.tess\'')
pprint(T.describe(catullus))

Stats on just the file 'catullus.carmina.tess'
{'files': 1,
 'lexdiv': 2.676677535610091,
 'secs': 7.7393670082092285,
 'sents': 822,
 'vocab': 5827,
 'words': 15597}


In [22]:
print('Stats on just the group of files assigned above to the variable `metamorphoses`')
pprint(T.describe(metamorphoses))

Stats on just the group of files assigned above to the variable `metamorphoses`
{'files': 15,
 'lexdiv': 5.033443456162643,
 'secs': 49.04771399497986,
 'sents': 5372,
 'vocab': 19675,
 'words': 99033}


In [23]:
# ## Basic descriptive data; note takes several minutes to run

# tess_describe = T.describe()
# pprint(tess_describe)