# CLTK Readers demo notebook

*A note on using Jupyter notebooks in this live demo*  

- TLDR; "SHIFT + ENTER/RETURN" will do pretty much everything you will need to do for the demo.
- Look for the word "# Imports" below and click anywhere is its surrounding box to activate that input cell. These cells are where you will find the code that we will run in today's demo.
- If you press "SHIFT + ENTER/RETURN", you will *run* that cell and the output will appear in space below the cell; you will also advance to the next cell. If you press "CTRL + ENTER/RETURN", you will run the cell without advancing. If you press "OPTION (or ALT) + ENTER/RETURN", you will run the cell and insert a new empty cell below the running cell.
- Note the brackets on the left side of the cells. If empty—that is, if there is blank space between the cells—the cell has not yet been run. If there is a number between the brackets, this is an indicator that the cell has been run and denotes the order in which the cells have been run.

## A reader example—the LatinTesseraeCorpusReader

In [None]:
## Imports

from cltkreaders.lat import LatinTesseraeCorpusReader

from os.path import expanduser
from natsort import natsorted
from tqdm import tqdm
from pprint import pprint

In [None]:
## Set up reader
# NB: If you do not have the CLTK-Tesserae corpus already installed in CLTK_DATA, you will be prompted to download the corpus.

T = LatinTesseraeCorpusReader()

### Working with files

In [None]:
## First 10 filesnames

pprint(T.fileids()[:10])

In [None]:
## First 10 works of Cicero

cicero = [file for file in T.fileids() if 'cicero' in file]
pprint(cicero[:10])

In [None]:
## Books of the Aeneid, sorted

aeneid = natsorted([file for file in T.fileids() if 'aeneid' in file])
pprint(aeneid)

### Working with doc structures

In [None]:
catullus = 'catullus.carmina.tess'

In [None]:
## Docs

catullus_doc = T.docs(catullus)
print(next(catullus_doc)[:446])

In [None]:
## Texts

catullus_text = T.texts(catullus)
print(next(catullus_text)[:335])

In [None]:
## Doc Rows

catullus_docrows = T.doc_rows(catullus)

print('This is a string representation of what the output dictionary looks like...')
print(f'{str(next(catullus_docrows))[:94]} etc. }}\n')


catullus_docrows = T.doc_rows(catullus)
print('Here are the first 10 items of the dict output...')
pprint(list(next(catullus_docrows).items())[:10])


### Working with doc units (i.e. philological units)

In [None]:
catilinam = 'cicero.in_catilinam.tess'

In [None]:
## Paras

catilinam_paras = T.paras(catilinam)

for i in range(1,6):
    print(f'Para {i}: {next(catilinam_paras)}')
    print()

Note that for the Tesserae texts, `paras` are *not* truly implemented, as they are not consistent marked in the original files. For prose texts, what is returned by paras in the entire section by citation. The verse texts (which are defined by default as cited sections less than 75 characters long) yield the entire text as a single para, as for *Aeneid* 1 below...

In [None]:
## Paras

aeneid_paras = T.paras('vergil.aeneid.part.1.tess')

for i in range(1,2):
    print(f'Para {i}: {next(aeneid_paras)}')
    print()

In [None]:
# Sents

# By default, segmentation, tokenization, and other tagging is done using the spaCy model 'la_dep_cltk_sm'

catilinam_sents = T.sents(catilinam)

for i in range(1,6):
    print(f'Sent {i}: {next(catilinam_sents)}')

In [None]:
# Words

# By default, segmentation, tokenization, and other tagging is done using the spaCy model 'la_dep_cltk_sm'

catilinam_words = T.words(catilinam)

for i in range(1,10):
    print(f'Word {i}: {next(catilinam_words)}')


In [None]:
# You can pass a preprocessor to `words` 

def custom_preprocess(text):
    text = text.lower()
    return text

catilinam_words = T.words(catilinam, preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(catilinam_words)}')


In [None]:
# Tokenized sents

# i.e. Sents in the form of a list of tuples of the form `(token, lemma, tag)`

catilinam_tokenized_sents = T.tokenized_sents(catilinam)

for i in range(1,4):
    print(f'Tok Sent {i}: {next(catilinam_tokenized_sents)}')
    print()


In [None]:
# Tokenized sents, simplified

# i.e. Sents in the form of a list of tokens

catilinam_tokenized_sents = T.tokenized_sents(catilinam, simple=True)

for i in range(1,4):
    print(f'Tok Sent {i}: {next(catilinam_tokenized_sents)}')
    print()


In [None]:
# POS-tagged sents

# i.e. Sents in the form of a list of strings of the form `token/POS`

catilinam_pos_sents = T.pos_sents(catilinam)

for i in range(1,2):
    print(f'POS Sent {i}: {next(catilinam_pos_sents)}')


### Working with doc descriptions

In [None]:
metamorphoses = natsorted([file for file in T.fileids() if 'ovid.metamorphoses' in file])
pprint(metamorphoses)

In [None]:
def custom_preprocess(text):
    from cltk.alphabet.lat import JVReplacer
    replacer = JVReplacer()

    text = text.lower() # Lowercase
    text = replacer.replace(text)  # Normalize u/v & i/j

    # Remove punctuation
    punctuation ="\"#$%&\'()*+,/:;<=>@[\]^_`{|}~.?!«»—“-”"
    misc = '¡£¤¥¦§¨©¯°±²³´µ¶·¸¹º¼½¾¿÷·–‘’†•ↄ∞⏑〈〉（）'
    misc += punctuation
    translator = str.maketrans({key: " " for key in misc})
    text = text.translate(translator)

    # Remove numbers
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)

    return " ".join(text.split()).strip()

In [None]:
## Concordance, using Tesserae citations

# NB: Concordancing is current only available for the Tesserae readers

metamorphoses_concordances = T.concordance(metamorphoses, preprocess=custom_preprocess)

met_conc_sample = next(metamorphoses_concordances)
pprint(list(met_conc_sample.items())[:3])

In [None]:
# Concordances are by default built on a file-by-file basis, but can easily be combined with the `compiled` parameter

metamorphoses_concordances = T.concordance(metamorphoses, compiled=True, preprocess=custom_preprocess)

full_met_conc_sample = next(metamorphoses_concordances)
pprint(list(full_met_conc_sample.items())[96:102])

In [None]:
# Since the concordances are output as dictionaries, you can retrieve location information using the token as a dict key...

metamorphoses_concordances = T.concordance(metamorphoses, compiled=True, preprocess=custom_preprocess)
full_met_conc_sample = next(metamorphoses_concordances)

print(f'\'corpus\' appears {len(full_met_conc_sample["corpus"])} times in the Metamorphoses.')
print('Here are the first five instances...')
print(full_met_conc_sample['corpus'][:5])

print()

print(f'\'corpora\' appears {len(full_met_conc_sample["corpora"])} times in the Metamorphoses.')
print('Here are the first five instances...')
print(full_met_conc_sample['corpora'][:5])

In [None]:
# ## Basic descriptive data; note takes several minutes to run

# tess_describe = T.describe()
# pprint(tess_describe)

Sample output:  

{'files': 748,  
 'lexdiv': 24.255701516259066,  
 'secs': 143.71532320976257,  
 'sents': 314436,  
 'vocab': 329693,  
 'words': 7996935}  

In [None]:
## This data can also be returned for individual files or lists of files

print('Stats on just the file \'catullus.carmina.tess\'')
pprint(T.describe(catullus))

In [None]:
# ## Basic descriptive data; note takes several minutes to run

# print('Stats on just the group of files assigned above to the variable `metamorphoses`')
# pprint(T.describe(metamorphoses))

Stats on just the group of files assigned above to the variable `metamorphoses`  

{'files': 15,  
 'lexdiv': 4.389574250997125,  
 'secs': 58.47328305244446,  
 'sents': 6140,  
 'vocab': 21562,  
 'words': 94648}  

## Another reader example—the GreekTesseraeReader

In [None]:
## Imports

from cltkreaders.grc import GreekTesseraeCorpusReader

In [None]:
## Set up reader
# NB: If you do not have the CLTK-Tesserae corpus already installed in CLTK_DATA, you will be prompted to download the corpus.

T = GreekTesseraeCorpusReader()

In [None]:
## First 10 filesnames

pprint(T.fileids()[:10])

In [None]:
apollonius = [file for file in T.fileids() if 'apollonius' in file]

pprint(natsorted(apollonius))

In [None]:
# Sents

# By default, segmentation, tokenization, and other tagging is done using the cltk v1 pipeline

apollonius_sents = T.sents(apollonius)

for i in range(1,6):
    print(f'Sent {i}: {next(apollonius_sents)}')
    print()

In [None]:
# Sents, with unline

# By default, segmentation, tokenization, and other tagging is done using the cltk v1 pipeline

apollonius_sents = T.sents(apollonius, unline=True)

for i in range(1,6):
    print(f'Sent {i}: {next(apollonius_sents)}')
    print()

In [None]:
# Words

# By default, segmentation, tokenization, and other tagging is done using the cltk v1 pipeline

apollonius_words = T.words(apollonius, preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(apollonius_words)}')


## Another reader example—the UDCorpusReader

In [None]:
from cltkreaders.readers import UDCorpusReader

In [None]:
## Set up reader

UD = UDCorpusReader('../data')

In [None]:
# Print fileids

print(UD.fileids())

In [None]:
# Raw text; sample from Phaedrus 2 prologue

print(UD.raw()[:1710])
print('etc...')

In [None]:
# Words

# This reader reads also annotations directly from data files

ud_words = UD.words()

for i in range(1,10):
    print(f'Word {i}: {next(ud_words)}')


In [None]:
# Words, with preprocessing

# This reader reads also annotations directly from data files

ud_words = UD.words(preprocess=custom_preprocess)

for i in range(1,10):
    print(f'Word {i}: {next(ud_words)}')


In [None]:
# Annotated sents, directly from data files

annotated_sents = UD.annotated_sents()

for sent in annotated_sents:
    pprint(sent)
    break

In [None]:
# conllu files transformed to Python dicts

next(UD.sent_dicts())