# LatinLibrary Reader Demo

This notebook demonstrates the `LatinLibraryReader` for working with the [Latin Library](https://thelatinlibrary.com/) corpus.

**Key differences from TesseraeReader:**
- Plain text files (`.txt`) instead of citation-annotated `.tess` files
- No built-in citation system (uses `fileid:sentN` fallback)
- Paragraph extraction via `paras()` method
- Title metadata extracted from first line of each file

In [None]:
## Imports

from latincyreaders import LatinLibraryReader, AnnotationLevel

from pprint import pprint

In [None]:
## Set up reader

# Auto-downloads corpus on first use if not found
L = LatinLibraryReader()

## Fileids and metadata

In [None]:
## First 10 filenames

files = L.fileids()[:10]
pprint(files)

In [None]:
# Get files by pattern match (regex)
files = L.fileids(match='horace')
pprint(files)

In [None]:
# Get files by pattern - supports regex
files = L.fileids(match=r'vergil.*aen')
pprint(files)

In [None]:
# Get files by partial match
files = L.fileids(match='cicero')[:10]
pprint(files)

In [None]:
# Case-insensitive regex matching
files = L.fileids(match=r'ovid')[:15]
pprint(files)

In [None]:
# Get all files
all_files = L.fileids()
print(f"Total files: {len(all_files)}")

### Metadata

LatinLibrary files have minimal metadata compared to Tesserae. The `title` is extracted from the first line of each file.

In [None]:
# Get metadata for a specific file
# Note: LatinLibrary extracts title from first line

sample_file = L.fileids(match='cicero')[0]
meta = L.get_metadata(sample_file)

print(f"Metadata for {sample_file}:")
pprint(meta)

In [None]:
# Iterate through metadata
for fileid, meta in list(L.metadata())[:5]:
    title = meta.get('title', 'No title')
    print(f"{fileid}: {title[:50]}...")

## Doc structures

In [None]:
# Define a file to work with
catullus_files = L.fileids(match='catullus')
print(f"Catullus files: {catullus_files}")
catullus = catullus_files[0] if catullus_files else None

In [None]:
## Docs - spaCy Doc objects with NLP annotations

if catullus:
    catullus_doc = next(L.docs(catullus))
    print(catullus_doc[:500])

In [None]:
## Texts - raw strings (zero NLP overhead)

if catullus:
    catullus_text = next(L.texts(catullus))
    pprint(catullus_text[:400])

### Note: No doc_rows() or lines()

Unlike TesseraeReader, LatinLibraryReader does not have citation-annotated lines. Use `sents()` or `paras()` instead.

## Doc units

In [None]:
# Get a Cicero file for examples
cicero_files = L.fileids(match='cicero')
catilinam = cicero_files[0] if cicero_files else None
print(f"Using file: {catilinam}")

In [None]:
## Paras - paragraph spans (available in LatinLibrary!)

if catilinam:
    paras = list(L.paras(catilinam))
    print(f"Total paragraphs: {len(paras)}")
    print()
    for i, para in enumerate(paras[:3]):
        print(f"Para {i+1}: {para.text[:100]}...")
        print()

In [None]:
# Sents - spaCy Span objects

if catilinam:
    sents = L.sents(catilinam)
    for i in range(1, 6):
        print(f'Sent {i}: {next(sents)}')

In [None]:
# Tokens - spaCy Token objects

if catilinam:
    tokens = L.tokens(catilinam)
    for i in range(1, 10):
        print(f'Word {i}: {next(tokens)}')

In [None]:
# Token linguistic attributes (BASIC level: text, lemma, POS, tag)

if catilinam:
    tokens = L.tokens(catilinam)
    t = next(tokens)
    print(f"text: {t.text}, lemma: {t.lemma_}, pos: {t.pos_}, tag: {t.tag_}")

In [None]:
# For custom text processing, work with the raw text or spaCy Doc

if catilinam:
    # Get text as strings
    for token_text in L.tokens(catilinam, as_text=True):
        processed = token_text.lower()
        print(processed, end=' ')
        break  # Just show first token

In [None]:
# Tokenized sents - use spaCy directly
# Get (token, lemma, tag) tuples from sentences

if catilinam:
    sents = L.sents(catilinam)
    for i in range(1, 4):
        sent = next(sents)
        tok_sent = [(t.text, t.lemma_, t.tag_) for t in sent]
        print(f'Tok Sent {i}: {tok_sent}')
        print()

In [None]:
# POS-tagged sents - token/POS pairs

if catilinam:
    sents = L.sents(catilinam)
    for i in range(1, 3):
        sent = next(sents)
        pos_sent = [f"{t.text}/{t.pos_}" for t in sent]
        print(f'POS Sent {i}: {" ".join(pos_sent)}')

In [None]:
# spaCy Token objects by default
if catilinam:
    tokens = L.tokens(catilinam)
    token = next(tokens)
    print(token)
    print(type(next(tokens)))

In [None]:
# Tokens as plain strings with as_text=True

if catilinam:
    plaintext_tokens = L.tokens(catilinam, as_text=True)
    plaintext_token = next(plaintext_tokens)
    print(plaintext_token)
    print(type(plaintext_token))

## Concordance

In [None]:
# Build a concordance: word -> list of citations where it appears
# Note: Without Tesserae citations, uses fileid:sentN format

if catullus:
    conc = L.concordance(fileids=catullus, basis="lemma")
    print(f"Unique lemmas: {len(conc)}")
    print()

    # Look up a specific lemma
    if "amor" in conc:
        print("Citations for 'amor':")
        for cit in conc["amor"][:10]:
            print(f"  {cit}")
        if len(conc["amor"]) > 10:
            print(f"  ... and {len(conc['amor']) - 10} more")

In [None]:
# Concordance by surface text form (exact spelling)
if catullus:
    conc_text = L.concordance(fileids=catullus, basis="text")

    # Different forms of 'puella' (girl)
    puella_forms = ["puella", "puellae", "puellam", "puellas", "puellis"]
    print("Occurrences of 'puella' forms:")
    for form in puella_forms:
        if form in conc_text:
            count = len(conc_text[form])
            print(f"  {form}: {count} occurrences")

## KWIC (Keyword in Context)

Find words with surrounding context - useful for studying word usage patterns.

In [None]:
# Basic KWIC search - find "amor" with 5 tokens of context on each side
if catullus:
    for hit in L.kwic("amor", fileids=catullus, window=5, limit=5):
        print(f"{hit['left']} [{hit['match']}] {hit['right']}")
        print(f"  -- {hit['citation']}")
        print()

In [None]:
# KWIC by lemma - finds all forms of a word (e.g., amo, amat, amant, amavit)

if catullus:
    for hit in L.kwic("amo", fileids=catullus, by_lemma=True, window=4, limit=5):
        print(f"{hit['left']} [{hit['match']}] {hit['right']}")
        print(f"  -- {hit['citation']}")
        print()

## N-grams and Skipgrams

Extract contiguous token sequences (n-grams) or sequences with gaps (skipgrams) for collocation analysis and language modeling.

In [None]:
# Extract bigrams (2-word sequences)
from itertools import islice

if catullus:
    bigrams = list(islice(L.ngrams(n=2, fileids=catullus), 20))
    print("First 20 bigrams:")
    pprint(bigrams)

In [None]:
# Trigrams (3-word sequences)
if catullus:
    trigrams = list(islice(L.ngrams(n=3, fileids=catullus), 10))
    print("First 10 trigrams:")
    pprint(trigrams)

In [None]:
# Get n-grams as token tuples for linguistic analysis

if catullus:
    for gram in islice(L.ngrams(n=2, fileids=catullus, as_tuples=True), 5):
        print([(t.text, t.lemma_, t.pos_) for t in gram])

In [None]:
# Bigram frequency analysis - find most common word pairs
from collections import Counter

if catullus:
    bigram_counts = Counter(L.ngrams(n=2, fileids=catullus))
    print("Most common bigrams:")
    for bigram, count in bigram_counts.most_common(10):
        print(f"  {bigram}: {count}")

In [None]:
# Skipgrams - word pairs with gaps between them

if catullus:
    skipgrams = list(islice(L.skipgrams(n=2, k=1, fileids=catullus), 15))
    print("First 15 skipgrams (bigrams with 1 skip):")
    pprint(skipgrams)

In [None]:
# N-grams by lemma - normalize inflected forms

if catullus:
    print("Bigrams by surface text:")
    text_bigrams = list(islice(L.ngrams(n=2, fileids=catullus, basis="text"), 5))
    pprint(text_bigrams)

    print("\nBigrams by lemma:")
    lemma_bigrams = list(islice(L.ngrams(n=2, fileids=catullus, basis="lemma"), 5))
    pprint(lemma_bigrams)

    print("\nMost common lemma bigrams:")
    lemma_counts = Counter(L.ngrams(n=2, fileids=catullus, basis="lemma"))
    for bigram, count in lemma_counts.most_common(10):
        print(f"  {bigram}: {count}")

## Basic descriptive stats

In [None]:
# Quick corpus overview
files = L.fileids()
print(f"Total files: {len(files)}")

# Sample stats from one file
sample_file = files[0]
sample_text = next(L.texts(sample_file))
print(f"\nSample file: {sample_file}")
print(f"Character count: {len(sample_text)}")
print(f"Word count (approx): {len(sample_text.split())}")

In [None]:
## Stats for a specific file

if catullus:
    doc = next(L.docs(catullus))
    print(f'Stats for {catullus}:')
    print(f'  Sentences: {len(list(doc.sents))}')
    print(f'  Tokens: {len(doc)}')

## Search Features

All search methods from `BaseCorpusReader` are available.

In [None]:
# find_sents() - find sentences containing specific words
# Works with pattern, forms, lemma, or matcher_pattern

for hit in islice(L.find_sents(forms=["Caesar", "Caesarem", "Caesaris"]), 5):
    print(f"{hit['citation']}: {hit['sentence'][:80]}...")
    print(f"  Matched: {hit['matches']}")
    print()

In [None]:
# find_sents() by lemma - slower but finds ALL forms

for hit in islice(L.find_sents(lemma="bellum"), 5):
    print(f"{hit['citation']}: {hit['sentence'][:80]}...")
    print(f"  Matched forms: {hit['matches']}")
    print()

In [None]:
# find_sents() with spaCy Matcher patterns
# Search for ADJ + NOUN sequences

pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]
for hit in islice(L.find_sents(matcher_pattern=pattern, fileids=L.fileids(match="catullus")), 5):
    print(f"{hit['citation']}: {hit['sentence'][:80]}...")
    print(f"  Matched: {hit['matches']}")
    print()

In [None]:
# More complex Matcher patterns
# Find sentences with a specific lemma followed by a noun

pattern = [{"LEMMA": "magnus"}, {"POS": "NOUN"}]
for hit in islice(L.find_sents(matcher_pattern=pattern), 5):
    print(f"{hit['citation']}: {hit['matches']}")

## Annotation Levels

Control NLP processing overhead with `AnnotationLevel`.

In [None]:
# AnnotationLevel controls how much NLP processing to apply

# NONE - use texts() for raw strings (fastest)
# TOKENIZE - tokenization + sentence boundaries only
# BASIC - adds lemmatization and POS tagging (default)
# FULL - full pipeline including NER and dependency parsing

reader_fast = LatinLibraryReader(annotation_level=AnnotationLevel.TOKENIZE)
reader_full = LatinLibraryReader(annotation_level=AnnotationLevel.FULL)

print("Available annotation levels:")
for level in AnnotationLevel:
    print(f"  {level.name}: {level.value}")

## Document Caching

Documents are cached by default for better performance.

In [None]:
# Check cache statistics
print("Cache stats:", L.cache_stats())

# Clear the cache if needed
# L.clear_cache()

## FileSelector API

Fluent file filtering with `select()`.

In [None]:
# Filter by filename pattern
selection = L.select().match(r"vergil")
print(f"Vergil files: {len(selection)}")
print(selection.preview(5))

In [None]:
# Chain multiple filters
selection = L.select().match(r"cicero")
print(f"Cicero files: {len(selection)}")

# Use with docs()
for doc in islice(L.docs(selection), 2):
    print(f"{doc._.fileid}: {len(list(doc.sents))} sentences")

## Comparison with TesseraeReader

| Feature | TesseraeReader | LatinLibraryReader |
|---------|----------------|--------------------|
| File format | `.tess` | `.txt` |
| Citation system | Built-in (`<author. work. line>`) | Fallback (`fileid:sentN`) |
| `lines()` | Yes (citation-annotated spans) | No |
| `doc_rows()` | Yes (citation -> span mapping) | No |
| `paras()` | No (format doesn't support) | Yes |
| `texts_by_line()` | Yes | No |
| `search()` | Yes (fast regex on lines) | No |
| `find_lines()` | Yes | No |
| `find_sents()` | Yes | Yes |
| Metadata | Rich (author, date, genre) | Minimal (title from first line) |
| Auto-download | Yes | Yes |