# Russian EDU Dependency Parsing
This notebook extracts Elementary Discourse Units (EDUs) from Russian `.rs3` files and performs syntactic dependency parsing using spaCy.

The dataset is based on the **Ru-RSTreebank**, a Russian corpus annotated according to Rhetorical Structure Theory.

## Step 1: Import libraries
We import libraries needed for XML parsing, file handling, and dependency analysis.

In [16]:
# Import required libraries
import spacy
import glob
import os
import xml.etree.ElementTree as ET

## Step 1.5: Create output directory
We create a directory to store all analysis results including dependency structures and visualizations.

In [22]:
# Create output directory for results
import datetime
import json

# Create timestamped output directory
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"../results/russian_edu_analysis_{timestamp}"
os.makedirs(output_dir, exist_ok=True)

# Create subdirectories
dependency_dir = os.path.join(output_dir, "dependency_structures")
visualization_dir = os.path.join(output_dir, "visualizations")
summary_dir = os.path.join(output_dir, "summaries")

os.makedirs(dependency_dir, exist_ok=True)
os.makedirs(visualization_dir, exist_ok=True)
os.makedirs(summary_dir, exist_ok=True)

print(f"📁 Created output directory: {output_dir}")
print(f"📁 Subdirectories created:")
print(f"  - {dependency_dir}")
print(f"  - {visualization_dir}")
print(f"  - {summary_dir}")

📁 Created output directory: ../results/russian_edu_analysis_20250816_194212
📁 Subdirectories created:
  - ../results/russian_edu_analysis_20250816_194212/dependency_structures
  - ../results/russian_edu_analysis_20250816_194212/visualizations
  - ../results/russian_edu_analysis_20250816_194212/summaries


## Step 2: Install Russian spaCy model
Before loading the model, we need to ensure that the Russian language model is installed. This step only needs to be run once.

In [17]:
# Install Russian spaCy model if not already installed
import subprocess
import sys

try:
    import ru_core_news_sm
    print("Russian spaCy model 'ru_core_news_sm' is already installed")
except ImportError:
    print("Installing Russian spaCy model 'ru_core_news_sm'...")
    subprocess.check_call([
        sys.executable, "-m", "pip", "install", 
        "https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.8.0/ru_core_news_sm-3.8.0-py3-none-any.whl"
    ])
    print("Russian spaCy model installed successfully!")

Russian spaCy model 'ru_core_news_sm' is already installed


## Step 3: Load the spaCy language model
We use the small Russian language model `ru_core_news_sm` for tokenization and dependency parsing.

In [18]:
# Load Russian spaCy model
nlp_ru = spacy.load('ru_core_news_sm')

## Step 4: Load `.rs3` files
We recursively search for `.rs3` files in the `RuRsTreebank_full` folder, which contains Russian discourse-annotated texts.

In [19]:
# Find all .rs3 files from the Russian Treebank
rs3_files = glob.glob('../RuRsTreebank_full/**/*.rs3', recursive=True)
print(f'Found {len(rs3_files)} files.')

Found 333 files.


## Step 5: Extract and analyze EDUs
From each `.rs3` file, we extract segments that represent EDUs, and parse each one using spaCy to inspect dependency relations.

In [23]:
# Extract EDUs and save their dependency structure
all_results = []
file_counter = 0

for rs3_path in rs3_files[:5]:  # limit to first 5 files for demo
    file_counter += 1
    print(f"\n📂 File {file_counter}: {rs3_path}")
    
    # Parse XML file
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text:
            edus.append(edu_text)
    
    print(f"🔹 Total EDUs: {len(edus)}\n")
    
    # Prepare file-specific results
    file_results = {
        "file_path": rs3_path,
        "total_edus": len(edus),
        "edus_analysis": []
    }
    
    # Analyze first 3 EDUs
    for idx, edu in enumerate(edus[:3]):
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        
        # Collect dependency information
        edu_analysis = {
            "edu_id": idx + 1,
            "text": edu,
            "tokens": [],
            "dependencies": []
        }
        
        for token in doc:
            print(f"  {token.text} → {token.dep_} → {token.head.text}")
            
            # Store token and dependency info
            edu_analysis["tokens"].append({
                "text": token.text,
                "lemma": token.lemma_,
                "pos": token.pos_,
                "tag": token.tag_,
                "dep": token.dep_,
                "head": token.head.text,
                "head_pos": token.head.pos_
            })
            
            edu_analysis["dependencies"].append({
                "token": token.text,
                "relation": token.dep_,
                "head": token.head.text
            })
        
        file_results["edus_analysis"].append(edu_analysis)
        print('-' * 30)
    
    # Save individual file results
    filename = os.path.basename(rs3_path).replace('.rs3', '')
    output_file = os.path.join(dependency_dir, f"{filename}_dependency_analysis.json")
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(file_results, f, ensure_ascii=False, indent=2)
    
    print(f"💾 Saved dependency analysis to: {output_file}")
    all_results.append(file_results)

# Save combined results
combined_file = os.path.join(summary_dir, "combined_dependency_analysis.json")
with open(combined_file, 'w', encoding='utf-8') as f:
    json.dump(all_results, f, ensure_ascii=False, indent=2)

print(f"\n💾 Saved combined results to: {combined_file}")
print(f"📊 Total files processed: {len(all_results)}")


📂 File 1: ../RuRsTreebank_full/blogs/test/blogs_36.rs3
🔹 Total EDUs: 97

EDU 1:  https://ff-mag.livejournal.com/61921.html
    → dep →  
  https://ff-mag.livejournal.com/61921.html → ROOT → https://ff-mag.livejournal.com/61921.html
------------------------------
EDU 2:  Завтрак: обязательное условие для всех, кто хочет быть в форме, или уловка маркетологов?
    → dep → Завтрак
  Завтрак → nsubj → условие
  : → punct → условие
  обязательное → amod → условие
  условие → ROOT → условие
  для → case → всех
  всех → nmod → условие
  , → punct → хочет
  кто → nsubj → хочет
  хочет → acl:relcl → всех
  быть → cop → форме
  в → case → форме
  форме → obl → хочет
  , → punct → уловка
  или → cc → уловка
  уловка → conj → условие
  маркетологов → nmod → уловка
  ? → punct → условие
------------------------------
EDU 3:  Завтраки любят
    → dep → Завтраки
  Завтраки → nsubj → любят
  любят → ROOT → любят
------------------------------
💾 Saved dependency analysis to: ../results/russian_edu_anal

## Step 6: Visualize dependency trees
We use `displacy.render` to visualize the syntactic structure of a few selected EDUs in Jupyter.

In [24]:
# Dependency visualization with displacy (Jupyter only) and save to files
from spacy import displacy

visualization_counter = 0

for rs3_path in rs3_files[:1]:  # для одного файла
    print(f"\n📊 Visualization for file: {rs3_path}")
    root = ET.parse(rs3_path).getroot()
    edus = []
    for segment in root.findall('.//segment'):
        edu_text = segment.text.strip().replace("#####", "") if segment.text else ''
        if edu_text and "https://" not in edu_text and "IMG" not in edu_text: 
            edus.append(edu_text)

    # Prepare HTML file for all visualizations from this file
    filename = os.path.basename(rs3_path).replace('.rs3', '')
    html_content = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Dependency Visualization - {filename}</title>
        <meta charset="utf-8">
        <style>
            body {{ font-family: Arial, sans-serif; margin: 20px; }}
            .edu-section {{ margin: 30px 0; padding: 20px; border: 1px solid #ccc; }}
            .edu-header {{ font-weight: bold; color: #333; margin-bottom: 10px; }}
            .edu-text {{ background-color: #f5f5f5; padding: 10px; margin: 10px 0; }}
        </style>
    </head>
    <body>
        <h1>Dependency Visualization for {filename}</h1>
        <p>Generated on: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
    """

    for idx, edu in enumerate(edus[:5]):  
        visualization_counter += 1
        print(f"EDU {idx+1}: {edu}")
        doc = nlp_ru(edu)
        
        # Display in Jupyter
        displacy.render(doc, style='dep', jupyter=True)
        
        # Generate HTML for saving
        html_viz = displacy.render(doc, style='dep', page=False)
        
        html_content += f"""
        <div class="edu-section">
            <div class="edu-header">EDU {idx+1}:</div>
            <div class="edu-text">{edu}</div>
            <div class="visualization">
                {html_viz}
            </div>
        </div>
        """
        
        # Save individual EDU visualization
        edu_html_file = os.path.join(visualization_dir, f"{filename}_edu_{idx+1}_dependency.html")
        individual_html = f"""
        <!DOCTYPE html>
        <html>
        <head>
            <title>EDU {idx+1} Dependency - {filename}</title>
            <meta charset="utf-8">
        </head>
        <body>
            <h2>EDU {idx+1} from {filename}</h2>
            <p><strong>Text:</strong> {edu}</p>
            {html_viz}
        </body>
        </html>
        """
        
        with open(edu_html_file, 'w', encoding='utf-8') as f:
            f.write(individual_html)
        
        print(f"💾 Saved individual visualization: {edu_html_file}")
    
    # Close and save combined HTML file
    html_content += """
    </body>
    </html>
    """
    
    combined_html_file = os.path.join(visualization_dir, f"{filename}_all_dependencies.html")
    with open(combined_html_file, 'w', encoding='utf-8') as f:
        f.write(html_content)
    
    print(f"💾 Saved combined visualization: {combined_html_file}")

print(f"\n🎯 Total visualizations created: {visualization_counter}")
print(f"📁 All visualizations saved in: {visualization_dir}")


📊 Visualization for file: ../RuRsTreebank_full/blogs/test/blogs_36.rs3
EDU 1:  Завтрак: обязательное условие для всех, кто хочет быть в форме, или уловка маркетологов?


💾 Saved individual visualization: ../results/russian_edu_analysis_20250816_194212/visualizations/blogs_36_edu_1_dependency.html
EDU 2:  Завтраки любят


💾 Saved individual visualization: ../results/russian_edu_analysis_20250816_194212/visualizations/blogs_36_edu_2_dependency.html
EDU 3: и ненавидят,


💾 Saved individual visualization: ../results/russian_edu_analysis_20250816_194212/visualizations/blogs_36_edu_3_dependency.html
EDU 4: мечтают о них перед сном


💾 Saved individual visualization: ../results/russian_edu_analysis_20250816_194212/visualizations/blogs_36_edu_4_dependency.html
EDU 5: и посвящают им длинные посты в инстаграме,


💾 Saved individual visualization: ../results/russian_edu_analysis_20250816_194212/visualizations/blogs_36_edu_5_dependency.html
💾 Saved combined visualization: ../results/russian_edu_analysis_20250816_194212/visualizations/blogs_36_all_dependencies.html

🎯 Total visualizations created: 5
📁 All visualizations saved in: ../results/russian_edu_analysis_20250816_194212/visualizations


## Step 7: Results Summary
Generate a comprehensive summary of the analysis and save all results to organized directories.

In [25]:
# Generate comprehensive summary report
summary_report = {
    "analysis_metadata": {
        "timestamp": timestamp,
        "output_directory": output_dir,
        "total_files_processed": len(all_results),
        "spacy_model": "ru_core_news_sm"
    },
    "file_statistics": [],
    "dependency_patterns": {},
    "pos_patterns": {}
}

# Collect statistics
total_edus = 0
total_tokens = 0
dependency_counts = {}
pos_counts = {}

for file_result in all_results:
    file_stats = {
        "filename": os.path.basename(file_result["file_path"]),
        "total_edus": file_result["total_edus"],
        "analyzed_edus": len(file_result["edus_analysis"]),
        "total_tokens": 0
    }
    
    for edu in file_result["edus_analysis"]:
        file_stats["total_tokens"] += len(edu["tokens"])
        total_tokens += len(edu["tokens"])
        
        # Count dependencies
        for dep in edu["dependencies"]:
            dep_rel = dep["relation"]
            dependency_counts[dep_rel] = dependency_counts.get(dep_rel, 0) + 1
        
        # Count POS tags
        for token in edu["tokens"]:
            pos_tag = token["pos"]
            pos_counts[pos_tag] = pos_counts.get(pos_tag, 0) + 1
    
    total_edus += file_stats["analyzed_edus"]
    summary_report["file_statistics"].append(file_stats)

# Sort patterns by frequency
summary_report["dependency_patterns"] = dict(sorted(dependency_counts.items(), key=lambda x: x[1], reverse=True))
summary_report["pos_patterns"] = dict(sorted(pos_counts.items(), key=lambda x: x[1], reverse=True))

# Add overall statistics
summary_report["overall_statistics"] = {
    "total_edus_analyzed": total_edus,
    "total_tokens_analyzed": total_tokens,
    "unique_dependency_relations": len(dependency_counts),
    "unique_pos_tags": len(pos_counts),
    "average_tokens_per_edu": total_tokens / total_edus if total_edus > 0 else 0
}

# Save summary report
summary_file = os.path.join(summary_dir, "analysis_summary.json")
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary_report, f, ensure_ascii=False, indent=2)

# Create readable text summary
text_summary = f"""
RUSSIAN EDU DEPENDENCY ANALYSIS SUMMARY
=======================================
Generated: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Output Directory: {output_dir}

OVERALL STATISTICS:
- Files processed: {len(all_results)}
- Total EDUs analyzed: {total_edus}
- Total tokens analyzed: {total_tokens}
- Average tokens per EDU: {total_tokens / total_edus if total_edus > 0 else 0:.2f}
- Unique dependency relations: {len(dependency_counts)}
- Unique POS tags: {len(pos_counts)}

TOP 10 DEPENDENCY RELATIONS:
"""

for i, (dep, count) in enumerate(list(summary_report["dependency_patterns"].items())[:10]):
    text_summary += f"{i+1:2d}. {dep:15s} ({count:3d} occurrences)\n"

text_summary += "\nTOP 10 POS TAGS:\n"
for i, (pos, count) in enumerate(list(summary_report["pos_patterns"].items())[:10]):
    text_summary += f"{i+1:2d}. {pos:15s} ({count:3d} occurrences)\n"

text_summary += f"""
FILES PROCESSED:
"""
for file_stat in summary_report["file_statistics"]:
    text_summary += f"- {file_stat['filename']}: {file_stat['analyzed_edus']} EDUs, {file_stat['total_tokens']} tokens\n"

text_summary += f"""
OUTPUT STRUCTURE:
{output_dir}/
├── dependency_structures/     # JSON files with detailed dependency analysis
├── visualizations/           # HTML files with dependency tree visualizations  
└── summaries/               # Summary reports and statistics

"""

# Save text summary
text_summary_file = os.path.join(summary_dir, "analysis_summary.txt")
with open(text_summary_file, 'w', encoding='utf-8') as f:
    f.write(text_summary)

print("="*60)
print("📊 ANALYSIS COMPLETE! 📊")
print("="*60)
print(text_summary)
print(f"💾 Detailed summary saved to: {summary_file}")
print(f"📄 Text summary saved to: {text_summary_file}")
print(f"🗂️  All results available in: {output_dir}")

📊 ANALYSIS COMPLETE! 📊

RUSSIAN EDU DEPENDENCY ANALYSIS SUMMARY
Generated: 2025-08-16 19:42:36
Output Directory: ../results/russian_edu_analysis_20250816_194212

OVERALL STATISTICS:
- Files processed: 5
- Total EDUs analyzed: 15
- Total tokens analyzed: 102
- Average tokens per EDU: 6.80
- Unique dependency relations: 16
- Unique POS tags: 14

TOP 10 DEPENDENCY RELATIONS:
 1. punct           ( 17 occurrences)
 2. dep             ( 15 occurrences)
 3. ROOT            ( 15 occurrences)
 4. nmod            ( 10 occurrences)
 5. case            (  8 occurrences)
 6. conj            (  8 occurrences)
 7. nsubj           (  7 occurrences)
 8. amod            (  6 occurrences)
 9. obl             (  4 occurrences)
10. cc              (  4 occurrences)

TOP 10 POS TAGS:
 1. NOUN            ( 21 occurrences)
 2. SPACE           ( 15 occurrences)
 3. PUNCT           ( 15 occurrences)
 4. PROPN           ( 13 occurrences)
 5. ADJ             (  8 occurrences)
 6. ADP             (  8 occurrences)