# LatinLibrary Reader Demo

This notebook demonstrates the `LatinLibraryReader` for working with the [Latin Library](https://thelatinlibrary.com/) corpus.

**Key differences from TesseraeReader:**
- Plain text files (`.txt`) instead of citation-annotated `.tess` files
- No built-in citation system (uses `fileid:sentN` fallback)
- Paragraph extraction via `paras()` method
- Title metadata extracted from first line of each file

In [1]:
## Imports

from latincyreaders import LatinLibraryReader, AnnotationLevel

from pprint import pprint

In [2]:
## Set up reader

# Auto-downloads corpus on first use if not found
L = LatinLibraryReader()

## Fileids and metadata

In [3]:
## First 10 filenames

files = L.fileids()[:10]
pprint(files)

['12tables.txt',
 '1644.txt',
 'abbofloracensis.txt',
 'abelard/dialogus.txt',
 'abelard/epistola.txt',
 'abelard/historia.txt',
 'addison/barometri.txt',
 'addison/burnett.txt',
 'addison/hannes.txt',
 'addison/machinae.txt']


In [4]:
# Get files by pattern match (regex)
files = L.fileids(match='horace')
pprint(files)

['horace/arspoet.txt',
 'horace/carm1.txt',
 'horace/carm2.txt',
 'horace/carm3.txt',
 'horace/carm4.txt',
 'horace/carmsaec.txt',
 'horace/ep.txt',
 'horace/epist1.txt',
 'horace/epist2.txt',
 'horace/serm1.txt',
 'horace/serm2.txt',
 'suetonius/suet.horace.txt']


In [5]:
# Get files by pattern - supports regex
files = L.fileids(match=r'vergil.*aen')
pprint(files)

['vergil/aen1.txt',
 'vergil/aen2.txt',
 'vergil/aen3.txt',
 'vergil/aen4.txt',
 'vergil/aen5.txt',
 'vergil/aen6.txt',
 'vergil/aen7.txt',
 'vergil/aen8.txt',
 'vergil/aen9.txt',
 'vergil/aen10.txt',
 'vergil/aen11.txt',
 'vergil/aen12.txt']


In [6]:
# Get files by partial match
files = L.fileids(match='cicero')[:10]
pprint(files)

['cicero/acad.txt',
 'cicero/adbrutum1.txt',
 'cicero/adbrutum2.txt',
 'cicero/amic.txt',
 'cicero/arch.txt',
 'cicero/att1.txt',
 'cicero/att2.txt',
 'cicero/att3.txt',
 'cicero/att4.txt',
 'cicero/att5.txt']


In [7]:
# Case-insensitive regex matching
files = L.fileids(match=r'ovid')[:15]
pprint(files)

['ovid/ovid.amor1.txt',
 'ovid/ovid.amor2.txt',
 'ovid/ovid.amor3.txt',
 'ovid/ovid.artis1.txt',
 'ovid/ovid.artis2.txt',
 'ovid/ovid.artis3.txt',
 'ovid/ovid.fasti1.txt',
 'ovid/ovid.fasti2.txt',
 'ovid/ovid.fasti3.txt',
 'ovid/ovid.fasti4.txt',
 'ovid/ovid.fasti5.txt',
 'ovid/ovid.fasti6.txt',
 'ovid/ovid.her1.txt',
 'ovid/ovid.her2.txt',
 'ovid/ovid.her3.txt']


In [8]:
# Get all files
all_files = L.fileids()
print(f"Total files: {len(all_files)}")

Total files: 2141


### Metadata

LatinLibrary files have minimal metadata compared to Tesserae. The `title` is extracted from the first line of each file.

In [9]:
# Get metadata for a specific file
# Note: LatinLibrary extracts title from first line

sample_file = L.fileids(match='cicero')[0]
meta = L.get_metadata(sample_file)

print(f"Metadata for {sample_file}:")
pprint(meta)

Metadata for cicero/acad.txt:
{}


In [10]:
# Iterate through metadata
for fileid, meta in list(L.metadata())[:5]:
    title = meta.get('title', 'No title')
    print(f"{fileid}: {title[:50]}...")

12tables.txt: No title...
1644.txt: No title...
abbofloracensis.txt: No title...
abelard/dialogus.txt: No title...
abelard/epistola.txt: No title...


## Doc structures

In [11]:
# Define files to work with - using Lucan's Pharsalia
lucan_files = L.fileids(match='lucan')
print(f"Lucan files: {lucan_files}")
lucan = lucan_files[0] if lucan_files else None

Lucan files: ['lucan/lucan1.txt', 'lucan/lucan2.txt', 'lucan/lucan3.txt', 'lucan/lucan4.txt', 'lucan/lucan5.txt', 'lucan/lucan6.txt', 'lucan/lucan7.txt', 'lucan/lucan8.txt', 'lucan/lucan9.txt', 'lucan/lucan10.txt', 'suetonius/suet.lucan.txt']


In [12]:
## Docs - spaCy Doc objects with NLP annotations

if lucan:
    lucan_doc = next(L.docs(lucan))
    # Show characters 200-700 to skip title
    print(lucan_doc.text[200:700])

sque acies, et rupto foedere regni certatum totis concussi uiribus orbis 5 in commune nefas, infestisque obuia signis signa, pares aquilas et pila minantia pilis. quis furor, o ciues, quae tanta licentia ferri? gentibus inuisis Latium praebere cruorem cumque superba foret Babylon spolianda tropaeis 10 Ausoniis umbraque erraret Crassus inulta bella geri placuit nullos habitura triumphos? heu, quantum terrae potuit pelagique parari hoc quem ciuiles hauserunt sanguine dextrae, unde uenit Titan et n


In [13]:
## Texts - raw strings (zero NLP overhead)

if lucan:
    lucan_text = next(L.texts(lucan))
    # Show characters 200-600 to skip title
    pprint(lucan_text[200:600])

('asque acies, et rupto foedere regni\n'
 'certatum totis concussi uiribus orbis                  5\n'
 'in commune nefas, infestisque obuia signis\n'
 'signa, pares aquilas et pila minantia pilis.\n'
 '      quis furor, o ciues, quae tanta licentia ferri?\n'
 'gentibus inuisis Latium praebere cruorem\n'
 'cumque superba foret Babylon spolianda tropaeis                  10\n'
 'Ausoniis umbraque erraret Crassus inulta\n'
 'bella geri plac')


### Note: No doc_rows() or lines()

Unlike TesseraeReader, LatinLibraryReader does not have citation-annotated lines. Use `sents()` or `paras()` instead.

## Doc units

In [14]:
# Using Lucan for doc unit examples
print(f"Using file: {lucan}")

Using file: lucan/lucan1.txt


In [15]:
## Paras - paragraph spans (available in LatinLibrary!)
# Note: Skip early paragraphs to avoid paratextual material (titles, headers)

if lucan:
    paras = list(L.paras(lucan))
    print(f"Total paragraphs: {len(paras)}")
    print()
    # Show paragraphs 5-7 (skipping header material)
    for i, para in enumerate(paras[4:7], start=5):
        print(f"Para {i}: {para.text[:100]}...")
        print()

Total paragraphs: 4



In [16]:
# Sents - spaCy Span objects
# Skip early sentences to avoid paratextual material
from itertools import islice

if lucan:
    sents = L.sents(lucan)
    # Show sentences 10-15
    for i, sent in enumerate(islice(sents, 10, 16), start=11):
        print(f'Sent {i}: {sent}')

Sent 11: scelera ipsa nefasque hac mercede placent.
Sent 12: diros Pharsalia campos inpleat et Poeni saturentur sanguine manes, ultima funesta concurrant proelia Munda, 40 his, Caesar, Perusina fames Mutinaeque labores accedant fatis et quas premit aspera classes Leucas et ardenti seruilia bella sub Aetna, multum Roma tamen debet ciuilibus armis quod tibi res acta est.
Sent 13: te, cum statione peracta 45 astra petes serus, praelati regia caeli excipiet gaudente polo:
Sent 14: seu sceptra tenere seu te flammigeros Phoebi conscendere currus telluremque nihil mutato sole timentem igne uago lustrare iuuet, tibi numine ab omni 50 cedetur, iurisque tui natura relinquet quis deus esse uelis, ubi regnum ponere mundi.
Sent 15: sed neque in Arctoo sedem tibi legeris orbe nec polus auersi calidus qua uergitur Austri, unde tuam uideas obliquo sidere Romam.
Sent 16: 55 aetheris inmensi partem si presseris unam, sentiet axis onus.


In [17]:
# Tokens - spaCy Token objects
# Skip early tokens to show actual Latin content

if lucan:
    tokens = L.tokens(lucan)
    # Show tokens 50-60
    for i, token in enumerate(islice(tokens, 50, 61), start=51):
        print(f'Word {i}: {token}')

Word 51: infestis
Word 52: que
Word 53: obuia
Word 54: signis
Word 55: signa
Word 56: ,
Word 57: pares
Word 58: aquilas
Word 59: et
Word 60: pila
Word 61: minantia


In [18]:
# Token linguistic attributes (BASIC level: text, lemma, POS, tag)

if lucan:
    tokens = L.tokens(lucan)
    # Skip to token 50 to avoid header material
    t = next(islice(tokens, 50, 51))
    print(f"text: {t.text}, lemma: {t.lemma_}, pos: {t.pos_}, tag: {t.tag_}")

text: infestis, lemma: infestus, pos: ADJ, tag: adjective


In [19]:
# For custom text processing, work with the raw text or spaCy Doc

if lucan:
    # Get tokens as strings, skip header material
    tokens = islice(L.tokens(lucan, as_text=True), 50, 55)
    for token_text in tokens:
        processed = token_text.lower()
        print(processed, end=' ')

infestis que obuia signis signa 

In [20]:
# Tokenized sents - use spaCy directly
# Get (token, lemma, tag) tuples from sentences

if lucan:
    sents = L.sents(lucan)
    # Show sentences 10-12
    for i, sent in enumerate(islice(sents, 10, 13), start=11):
        tok_sent = [(t.text, t.lemma_, t.tag_) for t in sent]
        print(f'Tok Sent {i}: {tok_sent}')
        print()

Tok Sent 11: [('scelera', 'scelus', 'noun'), ('ipsa', 'ipse', 'determiner'), ('nefas', 'nefas', 'noun'), ('que', 'que', 'conjunction'), ('hac', 'hic', 'adjective'), ('mercede', 'merces', 'noun'), ('placent', 'placeo', 'verb'), ('.', '.', 'punc')]

Tok Sent 12: [('diros', 'dirus', 'adjective'), ('Pharsalia', 'Pharsalia', 'adjective'), ('campos', 'campus', 'noun'), ('inpleat', 'impleo', 'verb'), ('et', 'et', 'conjunction'), ('Poeni', 'Poeni', 'proper_noun'), ('saturentur', 'saturo', 'verb'), ('sanguine', 'sanguis', 'noun'), ('manes', 'maneo', 'noun'), (',', ',', 'punc'), ('ultima', 'ultimus', 'adjective'), ('funesta', 'funestus', 'adjective'), ('concurrant', 'concurro', 'verb'), ('proelia', 'proelium', 'noun'), ('Munda', 'Munda', 'proper_noun'), (',', ',', 'punc'), ('40', '40', 'number'), ('his', 'hic', 'adjective'), (',', ',', 'punc'), ('Caesar', 'Caesar', 'proper_noun'), (',', ',', 'punc'), ('Perusina', 'perusinus', 'adjective'), ('fames', 'fames', 'noun'), ('Mutinae', 'Mutina', 'prope

In [21]:
# POS-tagged sents - token/POS pairs

if lucan:
    sents = L.sents(lucan)
    # Show sentences 10-11
    for i, sent in enumerate(islice(sents, 10, 12), start=11):
        pos_sent = [f"{t.text}/{t.pos_}" for t in sent]
        print(f'POS Sent {i}: {" ".join(pos_sent)}')

POS Sent 11: scelera/NOUN ipsa/DET nefas/NOUN que/CCONJ hac/DET mercede/NOUN placent/VERB ./PUNCT
POS Sent 12: diros/ADJ Pharsalia/ADJ campos/NOUN inpleat/VERB et/CCONJ Poeni/PROPN saturentur/VERB sanguine/NOUN manes/VERB ,/PUNCT ultima/ADJ funesta/ADJ concurrant/VERB proelia/NOUN Munda/PROPN ,/PUNCT 40/NUM his/DET ,/PUNCT Caesar/PROPN ,/PUNCT Perusina/ADJ fames/NOUN Mutinae/ADJ que/CCONJ labores/NOUN accedant/VERB fatis/NOUN et/CCONJ quas/PRON premit/VERB aspera/ADJ classes/NOUN Leucas/ADJ et/CCONJ ardenti/ADJ seruilia/ADJ bella/NOUN sub/ADP Aetna/PROPN ,/PUNCT multum/ADV Roma/PROPN tamen/ADV debet/VERB ciuilibus/ADJ armis/NOUN quod/SCONJ tibi/PRON res/NOUN acta/VERB est/AUX ./PUNCT


In [22]:
# spaCy Token objects by default
if lucan:
    tokens = L.tokens(lucan)
    # Skip to token 50
    token = next(islice(tokens, 50, 51))
    print(token)
    print(type(token))

infestis
<class 'spacy.tokens.token.Token'>


In [23]:
# Tokens as plain strings with as_text=True

if lucan:
    plaintext_tokens = L.tokens(lucan, as_text=True)
    # Skip to token 50
    plaintext_token = next(islice(plaintext_tokens, 50, 51))
    print(plaintext_token)
    print(type(plaintext_token))

infestis
<class 'str'>


## Concordance

In [24]:
# Build a concordance: word -> list of citations where it appears
# Note: Without Tesserae citations, uses fileid:sentN format

# Use Catullus for concordance examples
catullus_files = L.fileids(match='catullus')
catullus = catullus_files[0] if catullus_files else None

if catullus:
    conc = L.concordance(fileids=catullus, basis="lemma")
    print(f"Unique lemmas in {catullus}: {len(conc)}")
    print()

    # Look up a specific lemma
    if "amor" in conc:
        print("Citations for 'amor':")
        for cit in conc["amor"][:10]:
            print(f"  {cit}")
        if len(conc["amor"]) > 10:
            print(f"  ... and {len(conc['amor']) - 10} more")

Unique lemmas in catullus.txt: 3446

Citations for 'amor':
  catullus.txt:746
  catullus.txt:801
  catullus.txt:1051
  catullus.txt:1399
  catullus.txt:1590
  catullus.txt:1812
  catullus.txt:2282
  catullus.txt:3297
  catullus.txt:3750
  catullus.txt:4080
  ... and 43 more


In [25]:
# Concordance by surface text form (exact spelling)
if catullus:
    conc_text = L.concordance(fileids=catullus, basis="text")

    # Different forms of 'puella' (girl)
    puella_forms = ["puella", "puellae", "puellam", "puellas", "puellis"]
    print("Occurrences of 'puella' forms:")
    for form in puella_forms:
        if form in conc_text:
            count = len(conc_text[form])
            print(f"  {form}: {count} occurrences")

Occurrences of 'puella' forms:
  puella: 19 occurrences
  puellae: 18 occurrences
  puellam: 2 occurrences
  puellis: 2 occurrences


## KWIC (Keyword in Context)

Find words with surrounding context - useful for studying word usage patterns.

In [26]:
# Basic KWIC search - find "amor" with 5 tokens of context on each side
if catullus:
    for hit in L.kwic("amor", fileids=catullus, window=5, limit=5):
        print(f"{hit['left']} [{hit['match']}] {hit['right']}")
        print(f"  -- {hit['citation']}")
        print()

' hoc ut dixit , [Amor] sinistra ut ante dextra sternuit
  -- catullus.txt:4778

' hoc ut dixit , [Amor] sinistra ut ante dextra sternuit
  -- catullus.txt:4831

umquam contexit amores , nullus [amor] tali coniunxit foedere amantes ,
  -- catullus.txt:10675

sub Latmia saxa relegans dulcis [amor] gyro deuocet aereo : idem
  -- catullus.txt:11377

semper concordia uestras , semper [amor] sedes incolat assiduus . tu
  -- catullus.txt:11976



In [27]:
# KWIC by lemma - finds all forms of a word (e.g., amo, amat, amant, amavit)

if catullus:
    for hit in L.kwic("amo", fileids=catullus, by_lemma=True, window=4, limit=5):
        print(f"{hit['left']} [{hit['match']}] {hit['right']}")
        print(f"  -- {hit['citation']}")
        print()

plus illa oculis suis [amabat] . nam mellitus erat
  -- catullus.txt:307

mea Lesbia , atque [amemus] , rumores que senum
  -- catullus.txt:565

uentitabas quo puella ducebat [amata] nobis quantum amabitur nulla
  -- catullus.txt:854

ducebat amata nobis quantum [amabitur] nulla . ibi illa
  -- catullus.txt:857

bella ? quem nunc [amabis] ? cuius esse diceris
  -- catullus.txt:950



## N-grams and Skipgrams

Extract contiguous token sequences (n-grams) or sequences with gaps (skipgrams) for collocation analysis and language modeling.

In [28]:
# Extract bigrams (2-word sequences)
from itertools import islice

if catullus:
    bigrams = list(islice(L.ngrams(n=2, fileids=catullus), 20))
    print("First 20 bigrams:")
    pprint(bigrams)

First 20 bigrams:
['Catullus C.',
 'C. VALERIVS',
 'VALERIVS CATVLLVS',
 'CATVLLVS 1',
 '1 2',
 '2 2b',
 '2b 3',
 '3 4',
 '4 5',
 '5 6',
 '6 7',
 '7 8',
 '8 9',
 '9 10',
 '10 11',
 '11 12',
 '12 13',
 '13 14',
 '14 14b',
 '14b 15']


In [29]:
# Trigrams (3-word sequences)
if catullus:
    trigrams = list(islice(L.ngrams(n=3, fileids=catullus), 10))
    print("First 10 trigrams:")
    pprint(trigrams)

First 10 trigrams:
['Catullus C. VALERIVS',
 'C. VALERIVS CATVLLVS',
 'VALERIVS CATVLLVS 1',
 'CATVLLVS 1 2',
 '1 2 2b',
 '2 2b 3',
 '2b 3 4',
 '3 4 5',
 '4 5 6',
 '5 6 7']


In [30]:
# Get n-grams as token tuples for linguistic analysis

if catullus:
    for gram in islice(L.ngrams(n=2, fileids=catullus, as_tuples=True), 5):
        print([(t.text, t.lemma_, t.pos_) for t in gram])

[('Catullus', 'Catullus', 'PROPN'), ('C.', 'Gaius', 'PROPN')]
[('C.', 'Gaius', 'PROPN'), ('VALERIVS', 'UALERIVS', 'X')]
[('VALERIVS', 'UALERIVS', 'X'), ('CATVLLVS', '', 'X')]
[('CATVLLVS', '', 'X'), ('1', '1', 'NUM')]
[('1', '1', 'NUM'), ('2', '2', 'NUM')]


In [31]:
# Bigram frequency analysis - find most common word pairs
from collections import Counter

if catullus:
    bigram_counts = Counter(L.ngrams(n=2, fileids=catullus))
    print("Most common bigrams:")
    for bigram, count in bigram_counts.most_common(10):
        print(f"  {bigram}: {count}")

Most common bigrams:
  Hymen Hymenaee: 26
  o Hymenaee: 22
  io Hymen: 20
  currite ducentes: 12
  ducentes subtegmina: 12
  Hymenaee io: 11
  ad Lesbiam: 10
  non est: 9
  Hymen o: 9
  Hymen ades: 9


In [32]:
# Skipgrams - word pairs with gaps between them

if catullus:
    skipgrams = list(islice(L.skipgrams(n=2, k=1, fileids=catullus), 15))
    print("First 15 skipgrams (bigrams with 1 skip):")
    pprint(skipgrams)

First 15 skipgrams (bigrams with 1 skip):
['Catullus C.',
 'Catullus VALERIVS',
 'C. VALERIVS',
 'C. CATVLLVS',
 'VALERIVS CATVLLVS',
 'VALERIVS 1',
 'CATVLLVS 1',
 'CATVLLVS 2',
 '1 2',
 '1 2b',
 '2 2b',
 '2 3',
 '2b 3',
 '2b 4',
 '3 4']


In [33]:
# N-grams by lemma - normalize inflected forms

if catullus:
    print("Bigrams by surface text:")
    text_bigrams = list(islice(L.ngrams(n=2, fileids=catullus, basis="text"), 5))
    pprint(text_bigrams)

    print("\nBigrams by lemma:")
    lemma_bigrams = list(islice(L.ngrams(n=2, fileids=catullus, basis="lemma"), 5))
    pprint(lemma_bigrams)

    print("\nMost common lemma bigrams:")
    lemma_counts = Counter(L.ngrams(n=2, fileids=catullus, basis="lemma"))
    for bigram, count in lemma_counts.most_common(10):
        print(f"  {bigram}: {count}")

Bigrams by surface text:
['Catullus C.', 'C. VALERIVS', 'VALERIVS CATVLLVS', 'CATVLLVS 1', '1 2']

Bigrams by lemma:
['Catullus Gaius', 'Gaius UALERIVS', 'UALERIVS ', ' 1', '1 2']

Most common lemma bigrams:
   : 56
  Hymen hymenaeus: 26
  qui tu: 24
  o hymenaeus: 22
  io Hymen: 20
  qui ego: 13
  curro duco: 12
  duco subtegmen: 12
  hic sum: 11
  non sum: 11


## Basic descriptive stats

In [34]:
# Quick corpus overview
files = L.fileids()
print(f"Total files: {len(files)}")

# Sample stats from one file
sample_file = files[0]
sample_text = next(L.texts(sample_file))
print(f"\nSample file: {sample_file}")
print(f"Character count: {len(sample_text)}")
print(f"Word count (approx): {len(sample_text.split())}")

Total files: 2141

Sample file: 12tables.txt
Character count: 4810
Word count (approx): 798


In [35]:
## Stats for a specific file

# Use Cicero for stats example
cicero_files = L.fileids(match='cicero')
cicero_file = cicero_files[0] if cicero_files else None

if cicero_file:
    doc = next(L.docs(cicero_file))
    print(f'Stats for {cicero_file}:')
    print(f'  Sentences: {len(list(doc.sents))}')
    print(f'  Tokens: {len(doc)}')

Stats for cicero/acad.txt:
  Sentences: 236
  Tokens: 5730


## Search Features

All search methods from `BaseCorpusReader` are available.

In [36]:
# find_sents() - find sentences containing specific words
# Works with pattern, forms, lemma, or matcher_pattern
# Restrict to Lucan's texts

lucan_fileids = L.fileids(match="lucan")

for hit in islice(L.find_sents(forms=["Caesar", "Caesarem", "Caesaris"], fileids=lucan_fileids), 5):
    print(f"{hit['citation']}: {hit['sentence'][:80]}...")
    print(f"  Matched: {hit['matches']}")
    print()

lucan/lucan1.txt:sent11: diros Pharsalia campos inpleat et Poeni saturentur sanguine manes, ultima funest...
  Matched: ['Caesar']

lucan/lucan1.txt:sent68: iam gelidas Caesar cursu superauerat Alpes ingentisque animo motus bellumque fut...
  Matched: ['Caesar']

lucan/lucan1.txt:sent75: en, adsum uictor terraque marique Caesar, ubique tuus (liceat modo, nunc quoque)...
  Matched: ['Caesar']

lucan/lucan1.txt:sent83: Caesar, ut aduersam superato gurgite ripam attigit, Hesperiae uetitis et constit...
  Matched: ['Caesar']

lucan/lucan1.txt:sent92: ut notae fulsere aquilae Romanaque signa et celsus medio conspectus in agmine Ca...
  Matched: ['Caesar']



In [37]:
# find_sents() by lemma - slower but finds ALL forms
# Searching in Lucan for "bellum" (war) - fitting for the Pharsalia!

for hit in islice(L.find_sents(lemma="bellum", fileids=lucan_fileids), 5):
    print(f"{hit['citation']}: {hit['sentence'][:80]}...")
    print(f"  Matched forms: {hit['matches']}")
    print()

lucan/lucan1.txt:sent2: gentibus inuisis Latium praebere cruorem cumque superba foret Babylon spolianda ...
  Matched forms: ['bella']

lucan/lucan1.txt:sent4: sub iuga iam Seres, iam barbarus isset Araxes et gens siqua iacet nascenti consc...
  Matched forms: ['belli']

lucan/lucan1.txt:sent9: quod si non aliam uenturo fata Neroni inuenere uiam magnoque aeterna parantur re...
  Matched forms: ['bella']

lucan/lucan1.txt:sent11: diros Pharsalia campos inpleat et Poeni saturentur sanguine manes, ultima funest...
  Matched forms: ['bella']

lucan/lucan1.txt:sent36: nam sola futuri Crassus erat belli medius mora....
  Matched forms: ['belli']



In [38]:
# find_sents() with spaCy Matcher patterns
# Search for ADJ + NOUN sequences in Lucan

pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]
for hit in islice(L.find_sents(matcher_pattern=pattern, fileids=lucan_fileids), 5):
    print(f"{hit['citation']}: {hit['sentence'][:80]}...")
    print(f"  Matched: {hit['matches']}")
    print()

lucan/lucan1.txt:sent0: Lucan Liber I M. ANNAEI LVCANI BELLI CIVILIS LIBER PRIMVS Bella per Emathios plu...
  Matched: ['ciuilia campos', 'commune nefas', 'obuia signis', 'pares aquilas']

lucan/lucan1.txt:sent2: gentibus inuisis Latium praebere cruorem cumque superba foret Babylon spolianda ...
  Matched: ['Ausoniis umbra', 'inulta bella']

lucan/lucan1.txt:sent3: heu, quantum terrae potuit pelagique parari hoc quem ciuiles hauserunt sanguine ...
  Matched: ['glacialem frigore']

lucan/lucan1.txt:sent4: sub iuga iam Seres, iam barbarus isset Araxes et gens siqua iacet nascenti consc...
  Matched: ['Latias leges', 'miseris orbem']

lucan/lucan1.txt:sent6: at nunc semirutis pendent quod moenia tectis urbibus Italiae lapsisque ingentia ...
  Matched: ['ingentia muris', 'antiquis habitator']



In [39]:
# More complex Matcher patterns
# Find sentences with "magnus" followed by a noun in Lucan

pattern = [{"LEMMA": "magnus"}, {"POS": "NOUN"}]
for hit in islice(L.find_sents(matcher_pattern=pattern, fileids=lucan_fileids), 5):
    print(f"{hit['citation']}: {hit['matches']}")

lucan/lucan1.txt:sent53: ['magni nominis']
lucan/lucan1.txt:sent251: ['magnorum fata']


lucan/lucan2.txt:sent134: ['magna senatus']


lucan/lucan3.txt:sent36: ['magno populis']
lucan/lucan3.txt:sent63: ['magnam uictor']


## Annotation Levels

Control NLP processing overhead with `AnnotationLevel`.

In [40]:
# AnnotationLevel controls how much NLP processing to apply

# NONE - use texts() for raw strings (fastest)
# TOKENIZE - tokenization + sentence boundaries only
# BASIC - adds lemmatization and POS tagging (default)
# FULL - full pipeline including NER and dependency parsing

reader_fast = LatinLibraryReader(annotation_level=AnnotationLevel.TOKENIZE)
reader_full = LatinLibraryReader(annotation_level=AnnotationLevel.FULL)

print("Available annotation levels:")
for level in AnnotationLevel:
    print(f"  {level.name}: {level.value}")

Available annotation levels:
  NONE: none
  TOKENIZE: tokenize
  BASIC: basic
  FULL: full


## Document Caching

Documents are cached by default for better performance.

In [41]:
# Check cache statistics
print("Cache stats:", L.cache_stats())

# Clear the cache if needed
# L.clear_cache()

Cache stats: {'hits': 23, 'misses': 5, 'size': 5, 'maxsize': 128}


## FileSelector API

Fluent file filtering with `select()`.

In [42]:
# Filter by filename pattern
selection = L.select().match(r"vergil")
print(f"Vergil files: {len(selection)}")
print(selection.preview(5))

Vergil files: 26
['vergil/aen1.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt']


In [43]:
# Chain multiple filters
selection = L.select().match(r"cicero")
print(f"Cicero files: {len(selection)}")

# Use with docs()
for doc in islice(L.docs(selection), 2):
    print(f"{doc._.fileid}: {len(list(doc.sents))} sentences")

Cicero files: 138


cicero/acad.txt: 236 sentences


cicero/adbrutum1.txt: 620 sentences


## Comparison with TesseraeReader

| Feature | TesseraeReader | LatinLibraryReader |
|---------|----------------|--------------------|
| File format | `.tess` | `.txt` |
| Citation system | Built-in (`<author. work. line>`) | Fallback (`fileid:sentN`) |
| `lines()` | Yes (citation-annotated spans) | No |
| `doc_rows()` | Yes (citation -> span mapping) | No |
| `paras()` | No (format doesn't support) | Yes |
| `texts_by_line()` | Yes | No |
| `search()` | Yes (fast regex on lines) | No |
| `find_lines()` | Yes | No |
| `find_sents()` | Yes | Yes |
| Metadata | Rich (author, date, genre) | Minimal (title from first line) |
| Auto-download | Yes | Yes |