# Universal Dependencies Reader Demo

This notebook demonstrates the `UDReader` and `LatinUDReader` for working with Latin Universal Dependencies treebanks.

**Key features:**
- Parses CoNLL-U format files
- Constructs spaCy Docs directly from gold-standard UD annotations
- Preserves all UD data in `token._.ud` extension
- Sentence spans with `sent_id` as citations
- Auto-download for 6 Latin treebanks

In [1]:
## Imports

from latincyreaders import UDReader, LatinUDReader, PROIELReader
from pprint import pprint

## Available Latin Treebanks

There are 6 Latin UD treebanks available for auto-download.

In [2]:
# See all available Latin UD treebanks

treebanks = LatinUDReader.available_treebanks()
print("Available Latin UD Treebanks:")
print()
for name, description in treebanks.items():
    print(f"  {name:10} - {description}")

Available Latin UD Treebanks:

  proiel     - Vulgate, Caesar, Cicero, Palladius
  perseus    - Classical texts from Perseus Digital Library
  ittb       - Index Thomisticus (Thomas Aquinas)
  llct       - Late Latin Charter Treebank
  udante     - Dante's Latin works
  circse     - CIRCSE Latin treebank


In [3]:
# Download a specific treebank (PROIEL - contains Caesar, Cicero, Vulgate)
# This will prompt for confirmation if not already downloaded

reader = PROIELReader()

## File Discovery

In [4]:
# List available files

files = reader.fileids()
print(f"Total files: {len(files)}")
print()
pprint(files)

Total files: 3

['la_proiel-ud-dev.conllu',
 'la_proiel-ud-test.conllu',
 'la_proiel-ud-train.conllu']


In [5]:
# Filter by pattern (regex)

train_files = reader.fileids(match='train')
print("Training files:")
pprint(train_files)

Training files:
['la_proiel-ud-train.conllu']


## Working with Documents

Unlike other readers, `UDReader` constructs spaCy Docs directly from the gold-standard UD annotations. It does **not** run the spaCy NLP pipeline.

In [6]:
# Get the first document

doc = next(reader.docs())

print(f"Fileid: {doc._.fileid}")
print(f"Metadata: {doc._.metadata}")
print(f"Tokens: {len(doc)}")
print(f"Sentences: {len(doc.spans.get('ud_sents', []))}")

Fileid: la_proiel-ud-dev.conllu
Metadata: {'source': 'universal_dependencies'}
Tokens: 13917
Sentences: 1233


## Sentences with Citations

UD sentence boundaries are preserved in `doc.spans["ud_sents"]`. Each span has:
- `span._.citation` - the `sent_id` from the CoNLL-U file
- `span._.metadata` - includes the original `# text = ...` comment

In [7]:
# Iterate sentences with citations

for sent in list(doc.spans["ud_sents"])[:10]:
    print(f"{sent._.citation}: {sent.text[:60]}...")

12841: videns autem turbas ascendit in montem et cum sedisset acces...
12842: beati pauperes spiritu quoniam ipsorum est regnum caelorum...
12844: beati mites quoniam ipsi possidebunt terram...
12846: beati qui lugent quoniam ipsi consolabuntur...
12848: beati qui esuriunt et sitiunt iustitiam quoniam ipsi saturab...
12850: beati misericordes quia ipsi misericordiam consequentur...
12852: beati mundo corde quoniam ipsi Deum videbunt...
12854: beati pacifici quoniam filii Dei vocabuntur...
12856: beati qui persecutionem patiuntur propter iustitiam quoniam ...
12858: beati estis cum maledixerint vobis et persecuti vos fuerint ...


In [8]:
# Access sentence metadata

sent = doc.spans["ud_sents"][0]
print(f"Citation: {sent._.citation}")
print(f"Metadata: {sent._.metadata}")

Citation: 12841
Metadata: {'text': 'videns autem turbas ascendit in montem et cum sedisset accesserunt ad eum discipuli eius et aperiens os suum docebat eos dicens', 'source': "Jerome's Vulgate, Matthew 5", 'sent_id': '12841'}


In [9]:
# Use ud_sents() for convenient iteration across files

from itertools import islice

for sent in islice(reader.ud_sents(), 5):
    print(f"{sent._.citation}: {sent.text[:70]}...")

12841: videns autem turbas ascendit in montem et cum sedisset accesserunt ad ...
12842: beati pauperes spiritu quoniam ipsorum est regnum caelorum...
12844: beati mites quoniam ipsi possidebunt terram...
12846: beati qui lugent quoniam ipsi consolabuntur...
12848: beati qui esuriunt et sitiunt iustitiam quoniam ipsi saturabuntur...


## UD Annotations (token._.ud)

All 10 CoNLL-U columns are preserved in `token._.ud`:
- `id`, `form`, `lemma`, `upos`, `xpos`
- `feats` (parsed dict), `head`, `deprel`, `deps`, `misc` (parsed dict)

In [10]:
# Examine token UD annotations

token = doc[0]
print(f"Token: {token.text}")
print()
print("UD annotations (token._.ud):")
pprint(token._.ud)

Token: videns

UD annotations (token._.ud):
{'deprel': 'advcl',
 'deps': None,
 'feats': {'Case': 'Nom',
           'Gender': 'Neut',
           'Number': 'Sing',
           'Tense': 'Pres',
           'VerbForm': 'Part',
           'Voice': 'Act'},
 'form': 'videns',
 'head': 4,
 'id': 1,
 'lemma': 'video',
 'misc': {'Ref': 'MATT_5.1'},
 'upos': 'VERB',
 'xpos': 'V-'}


In [11]:
# Compare UD data with spaCy attributes
# Both are populated from the gold UD annotations

print(f"{'Token':<12} {'lemma_':<12} {'pos_':<8} {'dep_':<10} {'ud[feats]'}")
print("-" * 70)

for token in doc[:10]:
    feats = token._.ud.get('feats', {})
    feats_str = ', '.join(f"{k}={v}" for k, v in feats.items()) if feats else '-'
    print(f"{token.text:<12} {token.lemma_:<12} {token.pos_:<8} {token.dep_:<10} {feats_str}")

Token        lemma_       pos_     dep_       ud[feats]
----------------------------------------------------------------------
videns       video        VERB     advcl      Case=Nom, Gender=Neut, Number=Sing, Tense=Pres, VerbForm=Part, Voice=Act
autem        autem        ADV      discourse  -
turbas       turba        NOUN     obj        Case=Acc, Gender=Fem, Number=Plur
ascendit     ascendo      VERB     root       Mood=Ind, Number=Sing, Person=3, Tense=Pres, VerbForm=Fin, Voice=Act
in           in           ADP      case       -
montem       mons         NOUN     obl        Case=Acc, Gender=Masc, Number=Sing
et           et           CCONJ    cc         -
cum          cum          SCONJ    mark       -
sedisset     sedeo        VERB     advcl      Mood=Sub, Number=Sing, Person=3, Tense=Pqp, VerbForm=Fin, Voice=Act
accesserunt  accedo       VERB     conj       Aspect=Perf, Mood=Ind, Number=Plur, Person=3, Tense=Past, VerbForm=Fin, Voice=Act


In [12]:
# Access morphological features

print("Tokens with Case feature:")
for token in doc[:20]:
    feats = token._.ud.get('feats', {})
    if 'Case' in feats:
        print(f"  {token.text}: {feats['Case']}")

Tokens with Case feature:
  videns: Nom
  turbas: Acc
  montem: Acc
  eum: Acc
  discipuli: Nom
  eius: Gen
  aperiens: Nom
  os: Acc
  suum: Acc
  eos: Acc


## spaCy Integration

Standard spaCy attributes are populated from UD data, so you can use familiar spaCy patterns.

In [13]:
# Find all proper nouns (named entities candidates)

propn_tokens = [t for t in doc if t.pos_ == "PROPN"]
print(f"Proper nouns in document: {len(propn_tokens)}")
print()
print("First 20:")
for t in propn_tokens[:20]:
    print(f"  {t.text} (sent: {t._.ud.get('id', '?')})")

Proper nouns in document: 449

First 20:
  Hierosolymam (sent: 28)
  Iesum (sent: 3)
  Iesu (sent: 5)
  Legio (sent: 1)
  Iesus (sent: 5)
  Iesum (sent: 4)
  Decapoli (sent: 7)
  Iesus (sent: 11)
  Iesus (sent: 4)
  Iairus (sent: 7)
  Iesu (sent: 32)
  Iesus (sent: 3)
  Iesus (sent: 1)
  Petrum (sent: 8)
  Iacobum (sent: 10)
  Iohannem (sent: 12)
  Iacobi (sent: 14)
  Gennesareth (sent: 18)
  Simonis (sent: 8)
  Simonem (sent: 7)


In [14]:
# Dependency structure is preserved

sent = doc.spans["ud_sents"][0]
print(f"Sentence: {sent.text}")
print()
print(f"{'Token':<12} {'Head':<12} {'Deprel':<10}")
print("-" * 35)
for token in sent:
    print(f"{token.text:<12} {token.head.text:<12} {token.dep_:<10}")

Sentence: videns autem turbas ascendit in montem et cum sedisset accesserunt ad eum discipuli eius et aperiens os suum docebat eos dicens

Token        Head         Deprel    
-----------------------------------
videns       ascendit     advcl     
autem        ascendit     discourse 
turbas       videns       obj       
ascendit     ascendit     root      
in           montem       case      
montem       ascendit     obl       
et           accesserunt  cc        
cum          sedisset     mark      
sedisset     accesserunt  advcl     
accesserunt  ascendit     conj      
ad           eum          case      
eum          accesserunt  obl       
discipuli    accesserunt  nsubj     
eius         discipuli    det       
et           docebat      cc        
aperiens     docebat      advcl     
os           aperiens     obj       
suum         os           det       
docebat      ascendit     conj      
eos          docebat      obj       
dicens       docebat      advcl     


## LatinUDReader: All Treebanks at Once

Use `LatinUDReader` to access multiple treebanks through a single interface.

In [15]:
# Create a reader for specific treebanks
# (Set auto_download=False to skip download prompts in demo)

# unified = LatinUDReader(treebanks=["proiel", "perseus"])
# for sent in islice(unified.ud_sents(), 10):
#     print(f"{sent._.citation}: {sent.text[:60]}...")

In [16]:
# Download all treebanks at once (run manually when ready)

# LatinUDReader.download_all()

## Use Case: Bootstrapping NER Datasets

The UD reader is useful for bootstrapping NER annotation projects:
1. Gold-standard tokenization and sentence boundaries
2. PROPN tags as a **heuristic** for finding candidate sentences (not ground truth!)
3. Morphological features may help with entity classification
4. Sentence citations provide traceability back to source

**Note:** PROPN â‰  named entity. This is a starting point for finding sentences worth annotating, not a labeled dataset.

In [17]:
# Find sentences containing proper nouns (candidates for annotation)
# PROPN is a heuristic - these need human review!

ner_candidates = []

for sent in reader.ud_sents():
    propns = [t for t in sent if t.pos_ == "PROPN"]
    if propns:
        ner_candidates.append({
            'citation': sent._.citation,
            'text': sent.text,
            'propn_hints': [t.text for t in propns],  # hints, not labels
        })

print(f"Sentences with PROPN tokens (candidates for annotation): {len(ner_candidates)}")

Sentences with PROPN tokens (candidates for annotation): 4518


In [18]:
# Preview candidates for annotation

for item in ner_candidates[:10]:
    print(f"{item['citation']}")
    print(f"  Text: {item['text'][:70]}...")
    print(f"  PROPN hints: {item['propn_hints']}")
    print()

12906
  Text: ego autem dico vobis non iurare omnino ne que per caelum quia thronus ...
  PROPN hints: ['Hierosolymam']

10555
  Text: videns autem Iesum a longe cucurrit et adoravit eum...
  PROPN hints: ['Iesum']

10557
  Text: quid mihi et tibi Iesu Fili Dei summi...
  PROPN hints: ['Iesu']

10564
  Text: Legio nomen mihi est quia multi sumus...
  PROPN hints: ['Legio']

10569
  Text: et concessit eis statim Iesus...
  PROPN hints: ['Iesus']

10574
  Text: et veniunt ad Iesum...
  PROPN hints: ['Iesum']

10580
  Text: et abiit et coepit praedicare in Decapoli quanta sibi fecisset Iesus...
  PROPN hints: ['Decapoli', 'Iesus']

10582
  Text: et cum transcendisset Iesus in navi rursus trans fretum convenit turba...
  PROPN hints: ['Iesus']

10583
  Text: et venit quidam de archisynagogis nomine Iairus et videns eum procidit...
  PROPN hints: ['Iairus']

10586
  Text: et mulier quae erat in profluvio sanguinis annis duodecim et fuerat mu...
  PROPN hints: ['Iesu']



In [19]:
# Export candidates for annotation (e.g., to JSONL for Label Studio, Prodigy, etc.)

import json

# Sample export
for item in ner_candidates[:3]:
    print(json.dumps(item, ensure_ascii=False))

{"citation": "12906", "text": "ego autem dico vobis non iurare omnino ne que per caelum quia thronus Dei est ne que per terram quia scabillum est pedum eius ne que per Hierosolymam quia civitas est magni Regis ne que per caput tuum iuraveris quia non potes unum capillum album facere aut nigrum", "propn_hints": ["Hierosolymam"]}
{"citation": "10555", "text": "videns autem Iesum a longe cucurrit et adoravit eum", "propn_hints": ["Iesum"]}
{"citation": "10557", "text": "quid mihi et tibi Iesu Fili Dei summi", "propn_hints": ["Iesu"]}


## Raw Text Access

Use `texts()` for raw strings with zero NLP overhead.

In [20]:
# Raw text iteration (reads from # text = comments)

for text in islice(reader.texts(), 5):
    print(text)

videns autem turbas ascendit in montem et cum sedisset accesserunt ad eum discipuli eius et aperiens os suum docebat eos dicens
beati pauperes spiritu quoniam ipsorum est regnum caelorum
beati mites quoniam ipsi possidebunt terram
beati qui lugent quoniam ipsi consolabuntur
beati qui esuriunt et sitiunt iustitiam quoniam ipsi saturabuntur


In [21]:
# Sentences as strings

for text in islice(reader.sents(as_text=True), 5):
    print(text)

videns autem turbas ascendit in montem et cum sedisset accesserunt ad eum discipuli eius et aperiens os suum docebat eos dicens
beati pauperes spiritu quoniam ipsorum est regnum caelorum
beati mites quoniam ipsi possidebunt terram
beati qui lugent quoniam ipsi consolabuntur
beati qui esuriunt et sitiunt iustitiam quoniam ipsi saturabuntur


## Corpus Statistics

In [22]:
# Basic stats for a treebank

total_sents = 0
total_tokens = 0

for doc in reader.docs():
    total_sents += len(doc.spans.get('ud_sents', []))
    total_tokens += len(doc)

print(f"PROIEL Treebank Statistics:")
print(f"  Files: {len(reader.fileids())}")
print(f"  Sentences: {total_sents:,}")
print(f"  Tokens: {total_tokens:,}")

PROIEL Treebank Statistics:
  Files: 3
  Sentences: 18,689
  Tokens: 205,566


In [23]:
# POS tag distribution

from collections import Counter

pos_counts = Counter()
for doc in reader.docs():
    for token in doc:
        pos_counts[token.pos_] += 1

print("POS Tag Distribution:")
for pos, count in pos_counts.most_common(15):
    print(f"  {pos:<8} {count:>8,}")

POS Tag Distribution:
  VERB       41,826
  NOUN       41,501
  PRON       25,411
  ADV        21,952
  ADP        16,020
  CCONJ      15,030
  ADJ        11,475
  AUX         8,101
  DET         7,701
  PROPN       7,273
  SCONJ       6,914
  NUM         1,712
  INTJ          547
  X             103
