# Counting Words in Homer

Word frequency analysis of Homer's *Iliad* and *Odyssey* using the Greek Tesserae corpus and `GreekTesseraeReader`.

## Setup

```bash
# Install latincy-readers
pip install latincy-readers

# Install OdyCy model (for lemmatization and POS tagging)
pip install https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl
```

In [None]:
from collections import Counter

from latincyreaders import GreekTesseraeReader, AnnotationLevel

## Load the Greek Tesserae Corpus

The corpus will be downloaded automatically on first use from the [CLTK Greek Tesserae repository](https://github.com/cltk/grc_text_tesserae).

In [None]:
# Use TOKENIZE for fast iteration (no OdyCy model needed)
reader = GreekTesseraeReader(annotation_level=AnnotationLevel.TOKENIZE)

# List available files
all_files = reader.fileids()
print(f"Total files: {len(all_files)}")
print("\nHomer files:")
homer_files = reader.fileids(match=r"homer")
for f in homer_files:
    print(f"  {f}")

## Word Frequency with `texts_by_line()`

The fastest approach: iterate over raw text lines with zero NLP overhead.

In [None]:
# Simple word counting from raw text
word_counts = Counter()

for citation, text in reader.texts_by_line(fileids=homer_files):
    words = text.split()
    word_counts.update(words)

print(f"Total word tokens: {sum(word_counts.values())}")
print(f"Unique word types: {len(word_counts)}")
print("\nMost common words:")
for word, count in word_counts.most_common(20):
    print(f"  {word:20s} {count:>6d}")

## Search with `search()` and `find_lines()`

Find specific words and patterns across the corpus.

In [None]:
# Search for Achilles (various forms)
print("Lines mentioning Achilles:")
for fileid, citation, text, matches in reader.search(r"Ἀχιλ", fileids=homer_files):
    print(f"  {citation}: {text[:80]}...")
    if len(list(reader.search(r"Ἀχιλ", fileids=homer_files))) > 10:
        print("  ...")
        break

In [None]:
# Count occurrences of key characters
characters = {
    "Achilles": r"Ἀχιλ",
    "Hector": r"Ἕκτ",
    "Odysseus": r"Ὀδυσ",
    "Zeus": r"Ζε[υύ]",
    "Athena": r"Ἀθην",
}

print("Character mentions in Homer:")
for name, pattern in characters.items():
    count = len(list(reader.search(pattern, fileids=homer_files)))
    print(f"  {name:15s} {count:>4d} lines")

## KWIC (Keyword in Context)

See words in their surrounding context.

In [None]:
# KWIC for a key term
print("KWIC for 'μῆνιν' (wrath):")
for hit in reader.kwic("μῆνιν", fileids=homer_files, window=5, limit=10):
    print(f"  {hit['left']:>40s} [{hit['match']}] {hit['right']:<40s}")
    print(f"  {'':>40s}  {hit['citation']}")

## N-grams

Extract bigrams and trigrams for collocational analysis.

In [None]:
# Bigram frequency
bigram_counts = Counter(reader.ngrams(n=2, fileids=homer_files))

print("Most common bigrams:")
for bigram, count in bigram_counts.most_common(15):
    print(f"  {bigram:30s} {count:>4d}")

## Concordance (with OdyCy)

Build a concordance keyed by lemma. This requires the OdyCy model for lemmatization.

```bash
pip install https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl
```

In [None]:
# Reload with BASIC annotation level for lemmatization
reader_nlp = GreekTesseraeReader(annotation_level=AnnotationLevel.BASIC)

# Build concordance for Homer
conc = reader_nlp.concordance(fileids=homer_files, basis="lemma")

print(f"Unique lemmas: {len(conc)}")
print("\nMost cited lemmas:")
top_lemmas = sorted(conc.items(), key=lambda x: len(x[1]), reverse=True)[:15]
for lemma, citations in top_lemmas:
    print(f"  {lemma:20s} {len(citations):>4d} occurrences")

## Export Results

Export search results for further analysis.

In [None]:
# Export search results as TSV
results = reader.find_sents(pattern=r"Ἀχιλ", fileids=homer_files)
tsv = reader.export_search_results(results, format="tsv")
print(tsv[:500])