# UDReader Demo

This notebook demonstrates the `UDReader` for working with Universal Dependencies format files.

**Supported formats:**
- `.conllu` - Full CoNLL-U format with dependency annotations
- `.conllup` - LASLA variant without dependency columns (HEAD, DEPREL)

**Key features:**
- Annotations come directly from the file by default (not LatinCy)
- Access UD-specific fields via `Token._.ud_*` extensions
- Detect format type with `has_dependencies()`
- Raw token data access with `tokens_with_annotations()`

In [None]:
## Imports

from latincyreaders import UDReader, AnnotationLevel

from pprint import pprint

In [None]:
## Set up reader

# Point to your CONLLU/CONLLUP corpus directory
# Example: UD Latin treebanks, LASLA corpus, etc.

UD_PATH = "/path/to/your/ud/corpus"  # Adjust this path

# For this demo, we'll use the test fixtures
from pathlib import Path
UD_PATH = Path("../tests/fixtures/ud")

U = UDReader(root=UD_PATH)

## Fileids and format detection

In [None]:
## List available files

files = U.fileids()
print(f"Total files: {len(files)}")
pprint(files)

In [None]:
# Create readers for specific formats

# CONLLU only (full UD with dependencies)
U_conllu = UDReader(root=UD_PATH, fileids="*.conllu")
print(f"CONLLU files: {U_conllu.fileids()}")

# CONLLUP only (LASLA format, no dependencies)
U_conllup = UDReader(root=UD_PATH, fileids="*.conllup")
print(f"CONLLUP files: {U_conllup.fileids()}")

In [None]:
# Detect format: CONLLU vs CONLLUP
# CONLLU has dependency annotations, CONLLUP does not

print(f"CONLLU has dependencies: {U_conllu.has_dependencies()}")
print(f"CONLLUP has dependencies: {U_conllup.has_dependencies()}")

## Annotation source: File vs LatinCy

By default, UDReader uses annotations directly from the CONLLU/CONLLUP file. This preserves the original treebank annotations.

Set `use_file_annotations=False` to use LatinCy instead, storing originals in `Token._.ud_*` extensions.

In [None]:
# Default: use_file_annotations=True
# Annotations come directly from the UD file

print(f"use_file_annotations: {U.use_file_annotations}")

In [None]:
# Compare: file annotations vs LatinCy
# (Requires LatinCy model to be installed for the second case)

# File annotations (default)
doc_file = next(U_conllu.docs())
print("Using file annotations:")
for token in doc_file[:5]:
    print(f"  {token.text}: lemma={token.lemma_}, pos={token.pos_}")

## Doc structures

In [None]:
# Define a file to work with
sample_file = U_conllu.fileids()[0]
print(f"Working with: {sample_file}")

In [None]:
## Docs - spaCy Doc objects with UD annotations

doc = next(U_conllu.docs(sample_file))
print(f"Doc text: {doc.text}")
print(f"Number of tokens: {len(doc)}")

In [None]:
## Texts - raw strings (zero NLP overhead)

text = next(U_conllu.texts(sample_file))
print(f"Raw text: {text}")

## UD-specific Token extensions

UDReader provides access to all CoNLL-U fields via custom extensions:

| Extension | Description |
|-----------|-------------|
| `Token._.ud_id` | Original token ID from file |
| `Token._.ud_lemma` | Lemma from UD file |
| `Token._.ud_upos` | Universal POS tag |
| `Token._.ud_xpos` | Language-specific POS tag |
| `Token._.ud_feats` | Morphological features (dict) |
| `Token._.ud_head` | Head token index |
| `Token._.ud_deprel` | Dependency relation |
| `Token._.ud_deps` | Enhanced dependencies |
| `Token._.ud_misc` | Miscellaneous field (dict) |

In [None]:
# Access UD extensions on tokens

doc = next(U_conllu.docs())

print("Token details from UD file:")
for token in doc[:5]:
    print(f"\n{token.text}:")
    print(f"  ud_id: {token._.ud_id}")
    print(f"  ud_lemma: {token._.ud_lemma}")
    print(f"  ud_upos: {token._.ud_upos}")
    print(f"  ud_xpos: {token._.ud_xpos}")
    print(f"  ud_feats: {token._.ud_feats}")

In [None]:
# Morphological features are stored as dictionaries

doc = next(U_conllu.docs())

print("Morphological analysis:")
for token in doc:
    if token._.ud_feats:  # If token has features
        print(f"\n{token.text} ({token._.ud_upos}):")
        for feat, value in token._.ud_feats.items():
            print(f"  {feat}: {value}")

In [None]:
# Dependency information (CONLLU only, not CONLLUP)

doc = next(U_conllu.docs())

print("Dependency structure:")
for token in doc[:8]:
    if token._.ud_head is not None:
        print(f"{token._.ud_id} {token.text} --{token._.ud_deprel}--> {token._.ud_head}")

In [None]:
# Compare CONLLU (with deps) vs CONLLUP (without deps)

doc_conllu = next(U_conllu.docs())
doc_conllup = next(U_conllup.docs())

print("CONLLU token (has dependencies):")
t = doc_conllu[0]
print(f"  {t.text}: head={t._.ud_head}, deprel={t._.ud_deprel}")

print("\nCONLLUP token (no dependencies):")
t = doc_conllup[0]
print(f"  {t.text}: head={t._.ud_head}, deprel={t._.ud_deprel}")

## Sentences

Sentences preserve `sent_id` from the CoNLL-U metadata as citations.

In [None]:
# Iterate over sentences

for sent in U_conllu.sentences():
    print(f"{sent._.citation}: {sent.text}")

In [None]:
# Sentences are stored in doc.spans["sentences"]

doc = next(U_conllu.docs())

print(f"Number of sentences: {len(doc.spans.get('sentences', []))}")
for sent in doc.spans.get("sentences", []):
    print(f"  {sent._.citation}: {sent.text[:50]}...")

## Raw token access with tokens_with_annotations()

For maximum performance when you need all UD fields, use `tokens_with_annotations()` which returns dictionaries instead of spaCy objects.

In [None]:
# Get tokens as dictionaries with all UD fields

from itertools import islice

tokens = list(islice(U_conllu.tokens_with_annotations(), 5))

print("First 5 tokens as dicts:")
for tok in tokens:
    pprint(tok)
    print()

In [None]:
# Extract specific fields across the corpus

# Get all unique UPOS tags
upos_tags = set(
    tok["upos"] for tok in U.tokens_with_annotations()
    if tok["upos"]
)
print(f"UPOS tags in corpus: {sorted(upos_tags)}")

In [None]:
# Count morphological features

from collections import Counter

case_counts = Counter()
for tok in U.tokens_with_annotations():
    if "Case" in tok["feats"]:
        case_counts[tok["feats"]["Case"]] += 1

print("Case distribution:")
for case, count in case_counts.most_common():
    print(f"  {case}: {count}")

## Standard token access

In [None]:
# Tokens - spaCy Token objects

tokens = list(islice(U_conllu.tokens(), 10))

for i, t in enumerate(tokens, 1):
    print(f"Token {i}: {t.text} (lemma: {t.lemma_}, pos: {t.pos_})")

In [None]:
# Token attributes come from the UD file

token = next(U_conllu.tokens())

print(f"Standard spaCy attributes (from UD file):")
print(f"  text: {token.text}")
print(f"  lemma_: {token.lemma_}")
print(f"  pos_: {token.pos_}")
print(f"  tag_: {token.tag_}")
print(f"  morph: {token.morph}")

## POS tag analysis

In [None]:
# POS distribution in the corpus

from collections import Counter

pos_counts = Counter(t.pos_ for t in U.tokens() if t.pos_)

print("POS tag distribution:")
for pos, count in pos_counts.most_common():
    print(f"  {pos}: {count}")

In [None]:
# Find all nouns with their cases

doc = next(U_conllu.docs())

print("Nouns and their cases:")
for token in doc:
    if token.pos_ == "NOUN" and token._.ud_feats:
        case = token._.ud_feats.get("Case", "?")
        print(f"  {token.text} ({token.lemma_}): {case}")

## Concordance and search

In [None]:
# Build a concordance by lemma

conc = U.concordance(basis="lemma")

print(f"Unique lemmas: {len(conc)}")
print("\nSample entries:")
for lemma in list(conc.keys())[:5]:
    print(f"  {lemma}: {conc[lemma]}")

In [None]:
# find_sents() works with UDReader too

# Search by exact form
for hit in U.find_sents(forms=["augur", "narrare"]):
    print(f"{hit['citation']}: {hit['sentence']}")
    print(f"  Matched: {hit['matches']}")
    print()

## Metadata

In [None]:
# Get metadata for a file

doc = next(U.docs())

print(f"File: {doc._.fileid}")
print(f"Metadata: {doc._.metadata}")

In [None]:
# Metadata includes format detection

for fileid, meta in U.metadata():
    print(f"{fileid}:")
    print(f"  format: {meta.get('format', 'unknown')}")
    print(f"  n_sentences: {meta.get('n_sentences', '?')}")

## Caching

In [None]:
# Documents are cached by default

# First access - cache miss
doc1 = next(U.docs())
print(f"After first access: {U.cache_stats()}")

# Second access - cache hit
doc2 = next(U.docs())
print(f"After second access: {U.cache_stats()}")

In [None]:
# Disable caching for memory-constrained environments

U_nocache = UDReader(root=UD_PATH, cache=False)
print(f"Caching enabled: {U_nocache.cache_enabled}")

## CONLLU vs CONLLUP format comparison

| Feature | CONLLU | CONLLUP (LASLA) |
|---------|--------|------------------|
| File extension | `.conllu` | `.conllup` |
| Dependency annotations | Yes | No |
| `Token._.ud_head` | Populated | `None` |
| `Token._.ud_deprel` | Populated | `None` |
| Lemma, POS, Morph | Yes | Yes |
| `has_dependencies()` | `True` | `False` |

## Summary

The `UDReader` is ideal when:

- You have pre-annotated corpora in CoNLL-U or CONLLUP format
- You want to preserve original treebank annotations
- You need access to full morphological analysis
- You're working with LASLA or other CONLLUP data

Use `use_file_annotations=False` when you want to compare treebank annotations with LatinCy's predictions.