# Universal Dependencies Reader Demo

This notebook demonstrates the `UDReader` and `LatinUDReader` for working with Latin Universal Dependencies treebanks.

**Key features:**
- Parses CoNLL-U format files
- Constructs spaCy Docs directly from gold-standard UD annotations
- Preserves all UD data in `token._.ud` extension
- Sentence spans with `sent_id` as citations
- Auto-download for 6 Latin treebanks

In [None]:
## Imports

from latincyreaders import UDReader, LatinUDReader, PROIELReader
from pprint import pprint

## Available Latin Treebanks

There are 6 Latin UD treebanks available for auto-download.

In [None]:
# See all available Latin UD treebanks

treebanks = LatinUDReader.available_treebanks()
print("Available Latin UD Treebanks:")
print()
for name, description in treebanks.items():
    print(f"  {name:10} - {description}")

In [None]:
# Download a specific treebank (PROIEL - contains Caesar, Cicero, Vulgate)
# This will prompt for confirmation if not already downloaded

reader = PROIELReader()

## File Discovery

In [None]:
# List available files

files = reader.fileids()
print(f"Total files: {len(files)}")
print()
pprint(files)

In [None]:
# Filter by pattern (regex)

train_files = reader.fileids(match='train')
print("Training files:")
pprint(train_files)

## Working with Documents

Unlike other readers, `UDReader` constructs spaCy Docs directly from the gold-standard UD annotations. It does **not** run the spaCy NLP pipeline.

In [None]:
# Get the first document

doc = next(reader.docs())

print(f"Fileid: {doc._.fileid}")
print(f"Metadata: {doc._.metadata}")
print(f"Tokens: {len(doc)}")
print(f"Sentences: {len(doc.spans.get('ud_sents', []))}")

## Sentences with Citations

UD sentence boundaries are preserved in `doc.spans["ud_sents"]`. Each span has:
- `span._.citation` - the `sent_id` from the CoNLL-U file
- `span._.metadata` - includes the original `# text = ...` comment

In [None]:
# Iterate sentences with citations

for sent in list(doc.spans["ud_sents"])[:10]:
    print(f"{sent._.citation}: {sent.text[:60]}...")

In [None]:
# Access sentence metadata

sent = doc.spans["ud_sents"][0]
print(f"Citation: {sent._.citation}")
print(f"Metadata: {sent._.metadata}")

In [None]:
# Use ud_sents() for convenient iteration across files

from itertools import islice

for sent in islice(reader.ud_sents(), 5):
    print(f"{sent._.citation}: {sent.text[:70]}...")

## UD Annotations (token._.ud)

All 10 CoNLL-U columns are preserved in `token._.ud`:
- `id`, `form`, `lemma`, `upos`, `xpos`
- `feats` (parsed dict), `head`, `deprel`, `deps`, `misc` (parsed dict)

In [None]:
# Examine token UD annotations

token = doc[0]
print(f"Token: {token.text}")
print()
print("UD annotations (token._.ud):")
pprint(token._.ud)

In [None]:
# Compare UD data with spaCy attributes
# Both are populated from the gold UD annotations

print(f"{'Token':<12} {'lemma_':<12} {'pos_':<8} {'dep_':<10} {'ud[feats]'}")
print("-" * 70)

for token in doc[:10]:
    feats = token._.ud.get('feats', {})
    feats_str = ', '.join(f"{k}={v}" for k, v in feats.items()) if feats else '-'
    print(f"{token.text:<12} {token.lemma_:<12} {token.pos_:<8} {token.dep_:<10} {feats_str}")

In [None]:
# Access morphological features

print("Tokens with Case feature:")
for token in doc[:20]:
    feats = token._.ud.get('feats', {})
    if 'Case' in feats:
        print(f"  {token.text}: {feats['Case']}")

## spaCy Integration

Standard spaCy attributes are populated from UD data, so you can use familiar spaCy patterns.

In [None]:
# Find all proper nouns (named entities candidates)

propn_tokens = [t for t in doc if t.pos_ == "PROPN"]
print(f"Proper nouns in document: {len(propn_tokens)}")
print()
print("First 20:")
for t in propn_tokens[:20]:
    print(f"  {t.text} (sent: {t._.ud.get('id', '?')})")

In [None]:
# Dependency structure is preserved

sent = doc.spans["ud_sents"][0]
print(f"Sentence: {sent.text}")
print()
print(f"{'Token':<12} {'Head':<12} {'Deprel':<10}")
print("-" * 35)
for token in sent:
    print(f"{token.text:<12} {token.head.text:<12} {token.dep_:<10}")

## LatinUDReader: All Treebanks at Once

Use `LatinUDReader` to access multiple treebanks through a single interface.

In [None]:
# Create a reader for specific treebanks
# (Set auto_download=False to skip download prompts in demo)

# unified = LatinUDReader(treebanks=["proiel", "perseus"])
# for sent in islice(unified.ud_sents(), 10):
#     print(f"{sent._.citation}: {sent.text[:60]}...")

In [None]:
# Download all treebanks at once (run manually when ready)

# LatinUDReader.download_all()

## Use Case: Bootstrapping NER Datasets

The UD reader is useful for bootstrapping NER annotation projects:
1. Gold-standard tokenization and sentence boundaries
2. PROPN tags as a **heuristic** for finding candidate sentences (not ground truth!)
3. Morphological features may help with entity classification
4. Sentence citations provide traceability back to source

**Note:** PROPN â‰  named entity. This is a starting point for finding sentences worth annotating, not a labeled dataset.

In [None]:
# Find sentences containing proper nouns (candidates for annotation)
# PROPN is a heuristic - these need human review!

ner_candidates = []

for sent in reader.ud_sents():
    propns = [t for t in sent if t.pos_ == "PROPN"]
    if propns:
        ner_candidates.append({
            'citation': sent._.citation,
            'text': sent.text,
            'propn_hints': [t.text for t in propns],  # hints, not labels
        })

print(f"Sentences with PROPN tokens (candidates for annotation): {len(ner_candidates)}")

In [None]:
# Preview candidates for annotation

for item in ner_candidates[:10]:
    print(f"{item['citation']}")
    print(f"  Text: {item['text'][:70]}...")
    print(f"  PROPN hints: {item['propn_hints']}")
    print()

In [None]:
# Export candidates for annotation (e.g., to JSONL for Label Studio, Prodigy, etc.)

import json

# Sample export
for item in ner_candidates[:3]:
    print(json.dumps(item, ensure_ascii=False))

## Raw Text Access

Use `texts()` for raw strings with zero NLP overhead.

In [None]:
# Raw text iteration (reads from # text = comments)

for text in islice(reader.texts(), 5):
    print(text)

In [None]:
# Sentences as strings

for text in islice(reader.sents(as_text=True), 5):
    print(text)

## Corpus Statistics

In [None]:
# Basic stats for a treebank

total_sents = 0
total_tokens = 0

for doc in reader.docs():
    total_sents += len(doc.spans.get('ud_sents', []))
    total_tokens += len(doc)

print(f"PROIEL Treebank Statistics:")
print(f"  Files: {len(reader.fileids())}")
print(f"  Sentences: {total_sents:,}")
print(f"  Tokens: {total_tokens:,}")

In [None]:
# POS tag distribution

from collections import Counter

pos_counts = Counter()
for doc in reader.docs():
    for token in doc:
        pos_counts[token.pos_] += 1

print("POS Tag Distribution:")
for pos, count in pos_counts.most_common(15):
    print(f"  {pos:<8} {count:>8,}")