# Russian EDU Dependency Parsing
This notebook extracts Elementary Discourse Units (EDUs) from Russian `.rs3` files and performs syntactic dependency parsing using spaCy.

The dataset is based on the **Ru-RSTreebank**, a Russian corpus annotated according to Rhetorical Structure Theory.

## Step 1: Import libraries
We import libraries needed for XML parsing, file handling, and dependency analysis.

In [16]:
# Import required libraries
import spacy
import glob
import os
import xml.etree.ElementTree as ET

## Step 2: Install Russian spaCy model
Before loading the model, we need to ensure that the Russian language model is installed. This step only needs to be run once.

In [17]:
# Install Russian spaCy model if not already installed
import subprocess
import sys

try:
    import ru_core_news_sm
    print("Russian spaCy model 'ru_core_news_sm' is already installed")
except ImportError:
    print("Installing Russian spaCy model 'ru_core_news_sm'...")
    subprocess.check_call([
        sys.executable, "-m", "pip", "install", 
        "https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.8.0/ru_core_news_sm-3.8.0-py3-none-any.whl"
    ])
    print("Russian spaCy model installed successfully!")

Russian spaCy model 'ru_core_news_sm' is already installed


## Step 3: Load the spaCy language model
We use the small Russian language model `ru_core_news_sm` for tokenization and dependency parsing.

In [18]:
# Load Russian spaCy model
nlp_ru = spacy.load('ru_core_news_sm')

## Step 4: Load `.rs3` files
We recursively search for `.rs3` files in the `RuRsTreebank_full` folder, which contains Russian discourse-annotated texts.

In [19]:
# Find all .rs3 files from the Russian Treebank
rs3_files = glob.glob('../RuRsTreebank_full/**/*.rs3', recursive=True)
print(f'Found {len(rs3_files)} files.')

Found 333 files.


## Step 5: Extract and analyze EDUs
From each `.rs3` file, we extract segments that represent EDUs, and parse each one using spaCy to inspect dependency relations.

In [20]:
# Extract EDUs and print their dependency structure
for rs3_path in rs3_files[:5]:  # limit to first 5 files for demo
    print(f"\nüìÇ File: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text:
            edus.append(edu_text)
    print(f"üîπ Total EDUs: {len(edus)}\n")

    for idx, edu in enumerate(edus[:3]):  # show first 3 EDUs only
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        for token in doc:
            print(f"  {token.text} ‚Üí {token.dep_} ‚Üí {token.head.text}")
        print('-' * 30)



üìÇ File: ../RuRsTreebank_full/blogs/test/blogs_36.rs3
üîπ Total EDUs: 97

EDU 1:  https://ff-mag.livejournal.com/61921.html
    ‚Üí dep ‚Üí  
  https://ff-mag.livejournal.com/61921.html ‚Üí ROOT ‚Üí https://ff-mag.livejournal.com/61921.html
------------------------------
EDU 2:  –ó–∞–≤—Ç—Ä–∞–∫: –æ–±—è–∑–∞—Ç–µ–ª—å–Ω–æ–µ —É—Å–ª–æ–≤–∏–µ –¥–ª—è –≤—Å–µ—Ö, –∫—Ç–æ —Ö–æ—á–µ—Ç –±—ã—Ç—å –≤ —Ñ–æ—Ä–º–µ, –∏–ª–∏ —É–ª–æ–≤–∫–∞ –º–∞—Ä–∫–µ—Ç–æ–ª–æ–≥–æ–≤?
    ‚Üí dep ‚Üí –ó–∞–≤—Ç—Ä–∞–∫
  –ó–∞–≤—Ç—Ä–∞–∫ ‚Üí nsubj ‚Üí —É—Å–ª–æ–≤–∏–µ
  : ‚Üí punct ‚Üí —É—Å–ª–æ–≤–∏–µ
  –æ–±—è–∑–∞—Ç–µ–ª—å–Ω–æ–µ ‚Üí amod ‚Üí —É—Å–ª–æ–≤–∏–µ
  —É—Å–ª–æ–≤–∏–µ ‚Üí ROOT ‚Üí —É—Å–ª–æ–≤–∏–µ
  –¥–ª—è ‚Üí case ‚Üí –≤—Å–µ—Ö
  –≤—Å–µ—Ö ‚Üí nmod ‚Üí —É—Å–ª–æ–≤–∏–µ
  , ‚Üí punct ‚Üí —Ö–æ—á–µ—Ç
  –∫—Ç–æ ‚Üí nsubj ‚Üí —Ö–æ—á–µ—Ç
  —Ö–æ—á–µ—Ç ‚Üí acl:relcl ‚Üí –≤—Å–µ—Ö
  –±—ã—Ç—å ‚Üí cop ‚Üí —Ñ–æ—Ä–º–µ
  –≤ ‚Üí case ‚Üí —Ñ–æ—Ä–º–µ
  —Ñ–æ—Ä–º–µ ‚Üí obl ‚Üí —Ö–æ—á–µ—Ç
  , ‚Üí punct ‚Üí —É–ª–æ–≤–∫–∞
  –∏–ª–∏ ‚Üí cc ‚Üí —É–ª–æ–≤–∫–∞
  —É–ª–æ–

## Step 6: Visualize dependency trees
We use `displacy.render` to visualize the syntactic structure of a few selected EDUs in Jupyter.

In [21]:
# Dependency visualization with displacy (Jupyter only)
from spacy import displacy

for rs3_path in rs3_files[:1]:  # –¥–ª—è –æ–¥–Ω–æ–≥–æ —Ñ–∞–π–ª–∞
    print(f"\nüìä Visualization for file: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text and "https://" not in edu_text and "IMG" not in edu_text: 
            edus.append(edu_text)

    for idx, edu in enumerate(edus[:5]):  
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        displacy.render(doc, style='dep', jupyter=True)



üìä Visualization for file: ../RuRsTreebank_full/blogs/test/blogs_36.rs3
EDU 1:  –ó–∞–≤—Ç—Ä–∞–∫: –æ–±—è–∑–∞—Ç–µ–ª—å–Ω–æ–µ —É—Å–ª–æ–≤–∏–µ –¥–ª—è –≤—Å–µ—Ö, –∫—Ç–æ —Ö–æ—á–µ—Ç –±—ã—Ç—å –≤ —Ñ–æ—Ä–º–µ, –∏–ª–∏ —É–ª–æ–≤–∫–∞ –º–∞—Ä–∫–µ—Ç–æ–ª–æ–≥–æ–≤?


EDU 2:  –ó–∞–≤—Ç—Ä–∞–∫–∏ –ª—é–±—è—Ç


EDU 3: –∏ –Ω–µ–Ω–∞–≤–∏–¥—è—Ç,


EDU 4: –º–µ—á—Ç–∞—é—Ç –æ –Ω–∏—Ö –ø–µ—Ä–µ–¥ —Å–Ω–æ–º


EDU 5: –∏ –ø–æ—Å–≤—è—â–∞—é—Ç –∏–º –¥–ª–∏–Ω–Ω—ã–µ –ø–æ—Å—Ç—ã –≤ –∏–Ω—Å—Ç–∞–≥—Ä–∞–º–µ,
