# Dependency Parsing for Russian EDUs (.rs3)

This notebook demonstrates how to:
- Extract Elementary Discourse Units (EDUs) from a Russian RST .rs3 file
- Parse each EDU using dependency parsing (spaCy)
- Visualize the dependency trees for the first 5 EDUs

All comments and code are in English. Adjust file paths as needed.


In [6]:
import spacy
import glob
from spacy import displacy
import xml.etree.ElementTree as ET


## 2. Load the Russian spaCy Model


In [7]:
nlp_ru = spacy.load('ru_core_news_sm')


## 3. Extract EDUs from the Russian .rs3 File


In [10]:
nlp_ru = spacy.load('ru_core_news_sm')

rs3_files = glob.glob('RuRsTreebank_full/**/*.rs3', recursive=True)

for rs3_path in rs3_files:
    print(f"Processing file: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip() if segment.text else ''
        if edu_text:
            edus.append(edu_text)
    print(f"Total EDUs in this file: {len(edus)}")
    for idx, edu in enumerate(edus[:2]):
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        for token in doc:
            print(token.text, token.dep_, token.head.text)
        print('-' * 20)


Processing file: RuRsTreebank_full/blogs/test/blogs_36.rs3
Total EDUs in this file: 97
EDU 1: ##### https://ff-mag.livejournal.com/61921.html
# ROOT #
# appos #
# punct #
# punct #
# appos #
https://ff-mag.livejournal.com/61921.html appos #
--------------------
EDU 2: ##### Завтрак: обязательное условие для всех, кто хочет быть в форме, или уловка маркетологов?
# ROOT #
# appos #
# punct #
# punct #
# punct #
Завтрак appos #
: punct условие
обязательное amod условие
условие parataxis Завтрак
для case всех
всех nmod условие
, punct хочет
кто nsubj хочет
хочет acl:relcl всех
быть cop форме
в case форме
форме obl хочет
, punct уловка
или cc уловка
уловка conj условие
маркетологов nmod уловка
? punct Завтрак
--------------------
Processing file: RuRsTreebank_full/blogs/test/blogs_1.rs3
Total EDUs in this file: 102
EDU 1: ##### https://kosmetista.ru/blog/uhodovaya-kosmetika/97696.html
# ROOT #
# appos #
# punct #
# punct #
# appos #
https://kosmetista.ru/blog/uhodovaya-kosmetika/97696.html 

ParseError: not well-formed (invalid token): line 44, column 125 (<string>)

## 4. Dependency Parsing and Visualization for the First 5 EDUs


In [11]:
# Visualize dependency trees for the first 5 EDUs (in Jupyter only)
for idx, edu in enumerate(edus[:5]):
    print(f'EDU {idx+1}: {edu}')
    doc = nlp_ru(edu)
    displacy.render(doc, style='dep', jupyter=True)


EDU 1: УДК 004.056


ImportError: cannot import name 'display' from 'IPython.core.display' (/Users/arturbegichev/miniconda3/lib/python3.13/site-packages/IPython/core/display.py)