# Russian EDU Dependency Parsing
This notebook extracts Elementary Discourse Units (EDUs) from Russian `.rs3` files and performs syntactic dependency parsing using spaCy.

The dataset is based on the **Ru-RSTreebank**, a Russian corpus annotated according to Rhetorical Structure Theory.

## Step 1: Import libraries
We import libraries needed for XML parsing, file handling, and dependency analysis.

In [1]:
# Import required libraries
import spacy
import glob
import os
import xml.etree.ElementTree as ET

## Step 2: Load the spaCy language model
We use the small Russian language model `ru_core_news_sm` for tokenization and dependency parsing.

In [2]:
# Load Russian spaCy model
nlp_ru = spacy.load('ru_core_news_sm')

## Step 3: Load `.rs3` files
We recursively search for `.rs3` files in the `RuRsTreebank_full` folder, which contains Russian discourse-annotated texts.

In [4]:
# Find all .rs3 files from the Russian Treebank
rs3_files = glob.glob('../RuRsTreebank_full/**/*.rs3', recursive=True)
print(f'Found {len(rs3_files)} files.')

Found 333 files.


## Step 4: Extract and analyze EDUs
From each `.rs3` file, we extract segments that represent EDUs, and parse each one using spaCy to inspect dependency relations.

In [7]:
# Extract EDUs and print their dependency structure
for rs3_path in rs3_files[:5]:  # limit to first 5 files for demo
    print(f"\n📂 File: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text:
            edus.append(edu_text)
    print(f"🔹 Total EDUs: {len(edus)}\n")

    for idx, edu in enumerate(edus[:3]):  # show first 3 EDUs only
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        for token in doc:
            print(f"  {token.text} → {token.dep_} → {token.head.text}")
        print('-' * 30)



📂 File: ../RuRsTreebank_full\blogs\dev\blogs_0.rs3
🔹 Total EDUs: 75

EDU 1:  https://kosmetista.ru/blog/otzivi/96908.html
    → dep →  
  https://kosmetista.ru/blog/otzivi/96908.html → ROOT → https://kosmetista.ru/blog/otzivi/96908.html
------------------------------
EDU 2:  Мои непомадные помады.
    → dep → Мои
  Мои → det → помады
  непомадные → amod → помады
  помады → ROOT → помады
  . → punct → помады
------------------------------
EDU 3: Relouis Velvet metallic 08 и Nyx Liquid Suede 02
  Relouis → ROOT → Relouis
  Velvet → flat:foreign → Relouis
  metallic → flat:foreign → Relouis
  08 → appos → Relouis
  и → cc → Nyx
  Nyx → conj → Relouis
  Liquid → flat:foreign → Nyx
  Suede → flat:foreign → Nyx
  02 → flat:foreign → Relouis
------------------------------

📂 File: ../RuRsTreebank_full\blogs\dev\blogs_14.rs3
🔹 Total EDUs: 216

EDU 1:  https://my-first-time.livejournal.com/570846.html
    → dep →  
  https://my-first-time.livejournal.com/570846.html → ROOT → https://my-first-t

## Step 5: Visualize dependency trees
We use `displacy.render` to visualize the syntactic structure of a few selected EDUs in Jupyter.

In [None]:
# Dependency visualization with displacy (Jupyter only)
from spacy import displacy

for rs3_path in rs3_files[:1]:  # для одного файла
    print(f"\n📊 Visualization for file: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text and "https://" not in edu_text and "IMG" not in edu_text: 
            edus.append(edu_text)

    for idx, edu in enumerate(edus[:5]):  
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        displacy.render(doc, style='dep', jupyter=True)



📊 Visualization for file: ../RuRsTreebank_full\blogs\dev\blogs_0.rs3
EDU 1:  Мои непомадные помады.


EDU 2: Relouis Velvet metallic 08 и Nyx Liquid Suede 02


EDU 3:  В этом посте я расскажу о помадах, которые я приобрела специально


EDU 4: для использования не по прямому назначению.


EDU 5:  IMG
