# Russian EDU Dependency Parsing
This notebook extracts Elementary Discourse Units (EDUs) from Russian `.rs3` files and performs syntactic dependency parsing using spaCy.

The dataset is based on the **Ru-RSTreebank**, a Russian corpus annotated according to Rhetorical Structure Theory.

## Step 1: Import libraries
We import libraries needed for XML parsing, file handling, and dependency analysis.

In [1]:
# Import required libraries
import spacy
import glob
import os
import xml.etree.ElementTree as ET

## Step 2: Load the spaCy language model
We use the small Russian language model `ru_core_news_sm` for tokenization and dependency parsing.

In [2]:
# Load Russian spaCy model
nlp_ru = spacy.load('ru_core_news_sm')

## Step 3: Load `.rs3` files
We recursively search for `.rs3` files in the `RuRsTreebank_full` folder, which contains Russian discourse-annotated texts.

In [4]:
# Find all .rs3 files from the Russian Treebank
rs3_files = glob.glob('../RuRsTreebank_full/**/*.rs3', recursive=True)
print(f'Found {len(rs3_files)} files.')

Found 333 files.


## Step 4: Extract and analyze EDUs
From each `.rs3` file, we extract segments that represent EDUs, and parse each one using spaCy to inspect dependency relations.

In [7]:
# Extract EDUs and print their dependency structure
for rs3_path in rs3_files[:5]:  # limit to first 5 files for demo
    print(f"\nüìÇ File: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text:
            edus.append(edu_text)
    print(f"üîπ Total EDUs: {len(edus)}\n")

    for idx, edu in enumerate(edus[:3]):  # show first 3 EDUs only
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        for token in doc:
            print(f"  {token.text} ‚Üí {token.dep_} ‚Üí {token.head.text}")
        print('-' * 30)



üìÇ File: ../RuRsTreebank_full\blogs\dev\blogs_0.rs3
üîπ Total EDUs: 75

EDU 1:  https://kosmetista.ru/blog/otzivi/96908.html
    ‚Üí dep ‚Üí  
  https://kosmetista.ru/blog/otzivi/96908.html ‚Üí ROOT ‚Üí https://kosmetista.ru/blog/otzivi/96908.html
------------------------------
EDU 2:  –ú–æ–∏ –Ω–µ–ø–æ–º–∞–¥–Ω—ã–µ –ø–æ–º–∞–¥—ã.
    ‚Üí dep ‚Üí –ú–æ–∏
  –ú–æ–∏ ‚Üí det ‚Üí –ø–æ–º–∞–¥—ã
  –Ω–µ–ø–æ–º–∞–¥–Ω—ã–µ ‚Üí amod ‚Üí –ø–æ–º–∞–¥—ã
  –ø–æ–º–∞–¥—ã ‚Üí ROOT ‚Üí –ø–æ–º–∞–¥—ã
  . ‚Üí punct ‚Üí –ø–æ–º–∞–¥—ã
------------------------------
EDU 3: Relouis Velvet metallic 08 –∏ Nyx Liquid Suede 02
  Relouis ‚Üí ROOT ‚Üí Relouis
  Velvet ‚Üí flat:foreign ‚Üí Relouis
  metallic ‚Üí flat:foreign ‚Üí Relouis
  08 ‚Üí appos ‚Üí Relouis
  –∏ ‚Üí cc ‚Üí Nyx
  Nyx ‚Üí conj ‚Üí Relouis
  Liquid ‚Üí flat:foreign ‚Üí Nyx
  Suede ‚Üí flat:foreign ‚Üí Nyx
  02 ‚Üí flat:foreign ‚Üí Relouis
------------------------------

üìÇ File: ../RuRsTreebank_full\blogs\dev\blogs_14.rs3
üîπ Total EDUs: 216

EDU 1:  

## Step 5: Visualize dependency trees
We use `displacy.render` to visualize the syntactic structure of a few selected EDUs in Jupyter.

In [None]:
# Dependency visualization with displacy (Jupyter only)
from spacy import displacy

for rs3_path in rs3_files[:1]:  # –¥–ª—è –æ–¥–Ω–æ–≥–æ —Ñ–∞–π–ª–∞
    print(f"\nüìä Visualization for file: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text and "https://" not in edu_text and "IMG" not in edu_text: 
            edus.append(edu_text)

    for idx, edu in enumerate(edus[:5]):  
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        displacy.render(doc, style='dep', jupyter=True)



üìä Visualization for file: ../RuRsTreebank_full\blogs\dev\blogs_0.rs3
EDU 1:  –ú–æ–∏ –Ω–µ–ø–æ–º–∞–¥–Ω—ã–µ –ø–æ–º–∞–¥—ã.


EDU 2: Relouis Velvet metallic 08 –∏ Nyx Liquid Suede 02


EDU 3:  –í —ç—Ç–æ–º –ø–æ—Å—Ç–µ —è —Ä–∞—Å—Å–∫–∞–∂—É –æ –ø–æ–º–∞–¥–∞—Ö, –∫–æ—Ç–æ—Ä—ã–µ —è –ø—Ä–∏–æ–±—Ä–µ–ª–∞ —Å–ø–µ—Ü–∏–∞–ª—å–Ω–æ


EDU 4: –¥–ª—è –∏—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏—è –Ω–µ –ø–æ –ø—Ä—è–º–æ–º—É –Ω–∞–∑–Ω–∞—á–µ–Ω–∏—é.


EDU 5:  IMG
