# Explainable Rule-Based Relation Extraction
## Milestone 2 - SemEval 2010 Task 8

**Objective:** Implement and evaluate a deterministic, rule-based system for relation extraction that is both effective and fully explainable.

This notebook details the process of building a relation extraction system using spaCy. The core of this approach is an automatic rule discovery mechanism that mines patterns from training data, filters them based on statistical quality (precision and support), and applies them using spaCy's efficient matchers.

**Key Goals for Milestone 2:**
1.  **Implement a Baseline:** Develop a rule-based system from scratch.
2.  **Quantitative Evaluation:** Measure performance using metrics like accuracy, precision, recall, and F1-score.
3.  **Qualitative Analysis:** Analyze the system's behavior, understand its strengths through explainability, and investigate its weaknesses through error analysis.

This notebook will walk through each of these steps, from data preparation to the final analysis.

In [None]:
## 1. Setup: Libraries and Data Loading

# === 1.1 Import Libraries ===
import json
import numpy as np
import spacy
from collections import defaultdict, Counter
from sklearn.metrics import accuracy_score, classification_report
from tqdm.auto import tqdm
import os
from pathlib import Path

# === 1.2 Load spaCy Model ===
nlp = spacy.load("en_core_web_lg")

print("Libraries loaded successfully!")
print(f"spaCy version: {spacy.__version__}")

# === 1.3 Load Datasets ===
# Set working directory to the project root for consistent paths
# Assumes the notebook is run from the root of the project
print(f"Current working directory: {Path.cwd()}")

print("\nLoading datasets...")
try:
    with open('../data/processed/train/train.json', 'r') as f:
        train_data = json.load(f)

    with open('../data/processed/test/test.json', 'r') as f:
        test_data = json.load(f)

    print(f"Training samples: {len(train_data)}")
    print(f"Test samples: {len(test_data)}")
except FileNotFoundError as e:
    print(f"Error: {e}. Make sure you have run the preprocessing scripts and that the data files exist at the specified paths.")


  from .autonotebook import tqdm as notebook_tqdm


Libraries loaded successfully!
spaCy version: 3.8.11
Current working directory: /Users/berke/Desktop/School/TU_Wien-MSC/2025W/194.093_NLP-IE/Project/Token13-tuw-nlp-ie-2025WS

Loading datasets...
Training samples: 8000
Test samples: 2717
Training samples: 8000
Test samples: 2717


## 2. Data Processing and Feature Extraction

To build reliable rule-based patterns, we first transform each annotated sample into a structured linguistic representation. This preprocessing stage provides all the features our rule induction and matching steps depend on.

1. **Reconstructing spaCy `Doc` Objects**

    We rebuild spaCy `Doc` objects directly from the pre-tokenized JSON annotations.
    This gives us access to tokens, lemmas, POS tags, and dependency heads **without** running the spaCy NLP pipeline again.
    Each `Doc` is therefore lightweight but still fully compatible with spaCy’s token and dependency operations.

2. **Identifying Entity Spans**

    For each sample, we use the token indices of the annotated entities (`e1` and `e2`) to recover their corresponding `Span` objects inside the reconstructed `Doc`.
    These spans give us the entity roots, their heads, and their token ranges.

3. **Extracting Linguistic Features**

    We compute two core features for rule construction:

    * **Dependency Path**:
    Instead of using spaCy’s LCA matrix, we compute the dependency path between the entity roots by traversing their ancestor chains and locating the first common ancestor manually.
    This method is simple, deterministic, and works cleanly with our reconstructed dependency trees.

    * **Between-Entity Tokens**:
    We extract the exact token span between `e1` and `e2`, capturing intermediate lemmas, POS tags, and dependency labels.
    These between-words often encode strong relational cues (e.g., “caused by”, “part of”, “located in”).
        - **Span**: the continuous sequence of tokens lying strictly between the two entity spans.
            *Example*: In “A binds to B”, the span between `e1 = A` and `e2 = B` is the tokens “binds to”.

Together, these features give a compact but expressive description of how the two entities relate within the sentence.

The following functions implement this preprocessing pipeline.

In [2]:
from spacy.tokens import Doc

def doc_from_json(item, nlp):
    """
    Create a spaCy Doc from pre-computed JSON annotations.
    """
    tokens_data = item['tokens']
    
    # Extract token attributes
    words = [t['text'] for t in tokens_data]
    spaces = [i < len(words) - 1 for i in range(len(words))]
    
    # Create Doc with words and spaces
    doc = Doc(nlp.vocab, words=words, spaces=spaces)
    
    # Set linguistic attributes from pre-computed data
    for token, token_data in zip(doc, tokens_data):
        token.lemma_ = token_data['lemma']
        token.pos_ = token_data['pos']
        token.tag_ = token_data['tag']
        token.dep_ = token_data['dep']
        
        # Set head (dependency parent)
        head_id = token_data['head']
        if head_id != token.i:
            token.head = doc[head_id]
    
    return doc


def get_dependency_path(doc, e1_span, e2_span):
    """Extract dependency path between entity roots via LCA (no matrix)."""
    e1_root = e1_span.root
    e2_root = e2_span.root
    
    # Collect ancestors from e1_root to the root
    ancestors_e1 = []
    cur = e1_root
    while True:
        ancestors_e1.append(cur)
        if cur.head == cur:  # reached root
            break
        cur = cur.head
    
    # Walk up from e2_root until we hit something in ancestors_e1
    path_down_nodes = []
    cur = e2_root
    while cur not in ancestors_e1:
        path_down_nodes.append(cur)
        if cur.head == cur:  # fallback, no intersection (shouldn't happen in a tree)
            break
        cur = cur.head
    
    lca = cur
    # nodes from e1_root up to LCA (exclusive)
    path_up_nodes = []
    cur = e1_root
    while cur != lca:
        path_up_nodes.append(cur)
        cur = cur.head
    
    # Build features
    path_up = [(t.dep_, t.pos_, t.lemma_) for t in path_up_nodes]
    lca_feat = (lca.dep_, lca.pos_, lca.lemma_)
    path_down = [(t.dep_, t.pos_, t.lemma_) for t in reversed(path_down_nodes)]

    return path_up + [lca_feat] + path_down


def get_between_span(doc, e1_span, e2_span):
    """Get span between entities using Doc slicing."""
    if e1_span.start < e2_span.start:
        return doc[e1_span.end:e2_span.start]
    return doc[e2_span.end:e1_span.start]


def preprocess_data(data_list, nlp):
    """
    Process data using pre-computed annotations from JSON.
    """
    processed = []
    
    for item in tqdm(data_list, desc="Processing"):
        # Create Doc from pre-computed annotations
        doc = doc_from_json(item, nlp)
        
        e1_info = item['entities'][0]
        e2_info = item['entities'][1]
        
        # Create spans using token indices
        e1_token_ids = e1_info['token_ids']
        e2_token_ids = e2_info['token_ids']
        e1_span = doc[min(e1_token_ids):max(e1_token_ids)+1]
        e2_span = doc[min(e2_token_ids):max(e2_token_ids)+1]
        
        # Extract features
        dep_path = get_dependency_path(doc, e1_span, e2_span)
        between_span = get_between_span(doc, e1_span, e2_span)
        
        between_words = [
            {'text': t.text, 'lemma': t.lemma_, 'pos': t.pos_, 'dep': t.dep_}
            for t in between_span
        ]
        
        processed.append({
            'id': item['id'],
            'text': item['text'],
            'doc': doc,
            'e1_span': e1_span,
            'e2_span': e2_span,
            'relation': item['relation']['type'],
            'direction': item['relation'].get('direction', ''),
            'dep_path': dep_path,
            'between_words': between_words
        })
    
    return processed


In [3]:
# Process train and test data
print("Processing data...")
print()

train_processed = preprocess_data(train_data, nlp)
print("\nProcessing test data...")
test_processed = preprocess_data(test_data, nlp)

print(f"\nProcessed {len(train_processed)} training samples")
print(f"Processed {len(test_processed)} test samples")

# Display sample
print("\n" + "="*80)
print("Sample output:")
print("="*80)
sample = train_processed[0]
doc = sample['doc']
e1_span = sample['e1_span']
e2_span = sample['e2_span']

print(f"Text: {sample['text']}")
print(f"Entity 1: {e1_span.text} (POS: {e1_span.root.pos_}, DEP: {e1_span.root.dep_})")
print(f"Entity 2: {e2_span.text} (POS: {e2_span.root.pos_}, DEP: {e2_span.root.dep_})")
print(f"Relation: {sample['relation']}")
print(f"\nDependency path: {sample['dep_path'][:3]}...")
print(f"Between words: {[w['text'] for w in sample['between_words']]}")


Processing data...



Processing: 100%|██████████| 8000/8000 [00:00<00:00, 9440.13it/s] 
Processing: 100%|██████████| 8000/8000 [00:00<00:00, 9440.13it/s] 



Processing test data...


Processing: 100%|██████████| 2717/2717 [00:00<00:00, 13037.17it/s]


Processed 8000 training samples
Processed 2717 test samples

Sample output:
Text: The system as described above has its greatest application in an arrayed configuration of antenna elements.
Entity 1: configuration (POS: NOUN, DEP: pobj)
Entity 2: elements (POS: NOUN, DEP: pobj)
Relation: Component-Whole

Dependency path: [('pobj', 'NOUN', 'configuration'), ('prep', 'ADP', 'of'), ('pobj', 'NOUN', 'element')]...
Between words: ['of', 'antenna']





## 3.5 Exploratory Data Analysis - Extract Patterns from Data

Before defining rules manually, let's analyze the actual dataset to discover:
1. Most frequent words/lemmas per relation type
2. Common verbs and prepositions for each relation
3. Dependency patterns extracted from the shortest path between entity roots
4. Discriminative features that distinguish relations

---

**Why These Default Values?**

**Keywords: 30** — Open-class words (nouns, adjectives) have high variety; need more examples to capture diverse expressions  
**Verbs: 15** — Medium-sized vocabulary; syntactic backbone of relations  
**Prepositions: 10** — Small closed-class set (~70 in English); highly discriminative

These balance **coverage** (capture enough patterns) vs. **precision** (avoid noise).


In [4]:
def generate_patterns_from_analysis(relation_features, top_n_keywords=30, top_n_verbs=15, top_n_preps=10):
    """
    Generate RELATION_PATTERNS dictionary from data analysis.
    Extract most frequent and distinctive features per relation.
    """
    generated_patterns = {}
    
    for relation, features in relation_features.items():
        # Extract top keywords (lemmas)
        keywords = [lemma for lemma, count in features['top_lemmas'][:top_n_keywords]]
        
        # Extract top verbs
        verbs = [verb for verb, count in features['top_verbs'][:top_n_verbs]]
        
        # Extract top prepositions
        preps = [prep for prep, count in features['top_preps'][:top_n_preps]]
        
        # Extract dependency patterns (convert tuples back to lists)
        dep_patterns = []
        for path, count in features['top_dep_paths'][:5]:
            if len(path) >= 2:  # At least 2 dependencies
                dep_patterns.append(list(path[:3]))  # Take first 3 deps
        
        generated_patterns[relation] = {
            'keywords': keywords,
            'prep_patterns': preps,
            'verb_patterns': verbs,
            'dependency_patterns': dep_patterns
        }
    
    return generated_patterns

In [5]:
def analyze_relation_features(processed_data):
    """
    Analyze linguistic features for each relation type.
    Returns dictionaries of feature frequencies per relation.
    """
    # Group by relation type
    relation_groups = defaultdict(list)
    for sample in processed_data:
        relation_groups[sample['relation']].append(sample)
    
    # Analyze each relation
    relation_analysis = {}
    
    for relation, samples in relation_groups.items():
        # Collect features from all samples of this relation
        all_lemmas = []
        all_verbs = []
        all_preps = []
        all_dep_paths = []
        all_between_words = []
        
        for sample in samples:
            doc = sample['doc']
            
            # Collect lemmas (excluding entities)
            e1_tokens = set(range(sample['e1_span'].start, sample['e1_span'].end))
            e2_tokens = set(range(sample['e2_span'].start, sample['e2_span'].end))
            
            for token in doc:
                if token.i not in e1_tokens and token.i not in e2_tokens:
                    lemma = token.lemma_.lower()
                    
                    # Collect verbs (don't filter stopwords for verbs)
                    if token.pos_ == 'VERB' and not token.is_punct and len(lemma) > 2:
                        all_verbs.append(lemma)
                    
                    # Collect prepositions (INCLUDE stopwords like "of", "in", "at")
                    if token.pos_ == 'ADP' and not token.is_punct:
                        all_preps.append(lemma)
                    
                    # Collect other lemmas (filter stopwords for general keywords)
                    if not token.is_stop and not token.is_punct and len(lemma) > 2:
                        all_lemmas.append(lemma)
            
            # Collect dependency paths (sequence of dependency labels along shortest path)
            if sample['dep_path']:
                path_deps = tuple([d[0] for d in sample['dep_path']])
                all_dep_paths.append(path_deps)
            
            # Between words (fixed: should be "if word['text'].strip()" not "if not")
            for word in sample['between_words']:
                if word['text'].strip() and len(word['lemma']) > 2:
                    all_between_words.append(word['lemma'].lower())
        
        # Count frequencies
        lemma_freq = Counter(all_lemmas).most_common(30)
        verb_freq = Counter(all_verbs).most_common(15)
        prep_freq = Counter(all_preps).most_common(10)
        dep_path_freq = Counter(all_dep_paths).most_common(10)
        between_freq = Counter(all_between_words).most_common(20)
        
        relation_analysis[relation] = {
            'count': len(samples),
            'top_lemmas': lemma_freq,
            'top_verbs': verb_freq,
            'top_preps': prep_freq,
            'top_dep_paths': dep_path_freq,
            'top_between_words': between_freq
        }
    
    return relation_analysis

In [6]:
# Generate data-driven patterns
print("\n" + "="*80)
print("GENERATING DATA-DRIVEN PATTERNS")
print("Top features per relation extracted from analysis")
print("="*80)

# First, analyze the training data to get relation features
relation_features = analyze_relation_features(train_processed)
data_driven_patterns = generate_patterns_from_analysis(relation_features)

# Display generated patterns
for relation in sorted(data_driven_patterns.keys()):
    patterns = data_driven_patterns[relation]
    print(f"\n{relation}:")
    print(f"  Keywords ({len(patterns['keywords'])}): {patterns['keywords'][:10]}")
    print(f"  Verbs ({len(patterns['verb_patterns'])}): {patterns['verb_patterns']}")
    print(f"  Preps ({len(patterns['prep_patterns'])}): {patterns['prep_patterns']}")
    print(f"  Dep patterns: {len(patterns['dependency_patterns'])} patterns extracted")

print("\n" + "="*80)
print("Data-driven patterns generated successfully!")
print("These patterns are based on actual frequency analysis of the training data.")
print("="*80)


GENERATING DATA-DRIVEN PATTERNS
Top features per relation extracted from analysis

Cause-Effect:
  Keywords (30): ['cause', 'result', 'lead', 'produce', 'trigger', 'year', 'come', 'people', 'water', 'time']
  Verbs (15): ['cause', 'result', 'make', 'have', 'produce', 'trigger', 'come', 'lead', 'take', 'generate', 'use', 'get', 'find', 'help', 'follow']
  Preps (10): ['of', 'by', 'in', 'from', 'to', 'on', 'with', 'for', 'as', 'at']
  Dep patterns: 5 patterns extracted

Component-Whole:
  Keywords (30): ['comprise', 'contain', 'include', 'inside', 'hand', 'small', 'large', 'like', 'show', 'open']
  Verbs (15): ['have', 'use', 'make', 'contain', 'comprise', 'include', 'show', 'see', 'consist', 'take', 'hold', 'provide', 'connect', 'compose', 'move']
  Preps (10): ['of', 'in', 'with', 'on', 'to', 'for', 'from', 'at', 'by', 'as']
  Dep patterns: 5 patterns extracted

Content-Container:
  Keywords (30): ['inside', 'contain', 'find', 'store', 'enclose', 'small', 'lock', 'plastic', 'hide', 'p

## 4. Automatic Rule Discovery from Training Data

We build a deterministic and fully explainable rule-based system by mining
patterns from the training data and converting them into spaCy matchers
(Matcher, PhraseMatcher, DependencyMatcher). Each discovered rule is associated
with:

- a predicted relation,
- a precision value,
- a support count, (the count for the relation with the highest frequency for that pattern)
- and a human-readable explanation.

All rules are ranked by (precision, support) and applied in a deterministic
decision list: the highest-precision rule that matches is selected ("first match
wins"). This produces an efficient, interpretable, and data-driven rule-based
classifier.


**Explainability** <br>
A helper function plots top rules per relation, showing:
   - pattern type,
   - precision,
   - support,
   - and a short textual explanation.

---

## How Patterns Are Scored and Converted into Rules

### 1. Pattern Mining

For each training sentence we extract:

- **Lexical patterns**
   - `LEMMA`: single lemmas between the two entities  
   - `BIGRAM`: lemma pairs between entities  
   - `PREP`: prepositions (ADP) between entities  
   - `BEFORE_E1` / `AFTER_E2`: context tokens next to entities  
   - `ENTITY_POS`: `(POS(E1), POS(E2))` pair  

- **Dependency patterns**
   - `DEP_VERB`: verb lemma + dependency roles of both entities relative to that verb  
   (e.g., `nsubj` → `contain` → `dobj` → Content-Container)
   - `DEP_LABELS`: dependency labels of the entity heads

For each pattern we store a frequency table:
  - pattern_counts[pattern][relation] = number of occurrences


---

### 2. Precision and Support Calculation

In the `filter_and_rank_patterns` function, the precision for each rule is calculated as follows:

1.  **`total_count`**: The total number of occurrences of a specific pattern across *all* relation types is summed up. 
   - $\text{total} = \sum_{\text{relations}} \text{count(pattern, relation)}$
2.  **`best_count`**: The maximum count of this pattern for any single relation is identified. This represents the 'support' for the pattern's most frequent relation.
   - $\text{best\_relation} = \arg\max_{\text{r}} \text{count(pattern, r)}$
   - $\text{support} = \max_{\text{r}} \text{count(pattern, r)}$
3.  **`precision`**: The precision is then calculated by dividing the `best_count` by the `total_count`. This metric indicates how reliably the pattern predicts its dominant relation.
   - $\text{precision} = \frac{\text{support}}{\text{total}}$

This calculation ensures that a rule is considered high-precision only if it consistently points to a single relation type, even if its overall frequency is moderate.


**Rule creation criteria**
- `precision ≥ 0.60`  
- `support ≥ 2`

Patterns passing both thresholds are converted into rules.

---

### 3. Rule Ranking and Representation

Selected rules are sorted by:
1. **precision** (descending)  
2. **support** (descending)

Each rule is stored as:

```python
{
    "name": "Component-Whole_PREP_of",
    "relation": "Component-Whole",
    "matcher_type": "lexical",
    "pattern_type": "PREP",
    "pattern_data": ["of"],
    "precision": 0.76,
    "support": 340,
    "explanation": 'Preposition "of" between entities'
}
```
---
### 4. Why This Approach Works

* **Data-driven**: patterns come directly from labeled examples
* **Reliable**: precision threshold ensures rules are strong predictors
* **Flexible**: supports lexical, syntactic, and contextual signals
* **Explainable**: every prediction includes the rule + statistics
* **Deterministic**: no machine learning model is required at inference time


In [7]:
def extract_candidate_patterns(processed_data):
    """
    Mine candidate lexical and dependency patterns from labeled training data.

    Args:
        processed_data: iterable of samples, each with:
            - 'relation': gold relation label
            - 'doc': spaCy Doc
            - 'e1_span', 'e2_span': entity spans
            - 'dep_path': dependency path between entities (optional)

    Returns:
        lexical_patterns: dict[pattern_key][relation] -> count
        dep_patterns: dict[pattern_key][relation] -> count
    """
    # Group samples by relation
    relation_groups = defaultdict(list)
    for sample in processed_data:
        relation_groups[sample['relation']].append(sample)
    
    # Track pattern occurrences: pattern_key -> {relation -> count}
    lexical_patterns = defaultdict(lambda: defaultdict(int))
    dep_patterns = defaultdict(lambda: defaultdict(int))
    
    print("Mining candidate patterns from training data...")
    print("="*80)
    
    for relation, samples in relation_groups.items():
        print(f"\n{relation}: {len(samples)} samples")
        
        for sample in samples:
            doc = sample['doc']
            e1_span = sample['e1_span']
            e2_span = sample['e2_span']
            
            # Extract between-span features
            if e1_span.start < e2_span.start:
                between_span = doc[e1_span.end:e2_span.start]
            else:
                between_span = doc[e2_span.end:e1_span.start]
            
            # 1. LEXICAL PATTERNS: Between-span lemmas and bigrams
            between_lemmas = [t.lemma_.lower() for t in between_span if not t.is_punct]
            
            # Single lemmas
            for lemma in between_lemmas:
                if len(lemma) > 2:
                    pattern_key = ('LEMMA', lemma)
                    lexical_patterns[pattern_key][relation] += 1
            
            # Bigrams
            for i in range(len(between_lemmas) - 1):
                bigram = (between_lemmas[i], between_lemmas[i+1])
                pattern_key = ('BIGRAM', bigram)
                lexical_patterns[pattern_key][relation] += 1
            
            # Prepositions (very important)
            for token in between_span:
                if token.pos_ == 'ADP':
                    pattern_key = ('PREP', token.lemma_.lower())
                    lexical_patterns[pattern_key][relation] += 1
            
            # Context window: word before e1 and word after e2
            if e1_span.start > 0:
                before_e1 = doc[e1_span.start - 1]
                if not before_e1.is_punct and len(before_e1.lemma_) > 2:
                    pattern_key = ('BEFORE_E1', before_e1.lemma_.lower())
                    lexical_patterns[pattern_key][relation] += 1
            
            if e2_span.end < len(doc):
                after_e2 = doc[e2_span.end]
                if not after_e2.is_punct and len(after_e2.lemma_) > 2:
                    pattern_key = ('AFTER_E2', after_e2.lemma_.lower())
                    lexical_patterns[pattern_key][relation] += 1
            
            # Entity POS tag pattern
            pattern_key = ('ENTITY_POS', e1_span.root.pos_, e2_span.root.pos_)
            lexical_patterns[pattern_key][relation] += 1
            
            # 2. DEPENDENCY PATTERNS: e1 and e2 roles + verb
            e1_head = e1_span.root
            e2_head = e2_span.root
            
            # Find connecting verb (if any)
            dep_path = sample['dep_path']
            path_lemmas = [d[2] for d in dep_path] if dep_path else []
            path_deps = [d[0] for d in dep_path] if dep_path else []
            
            # Look for verb in path
            for token in doc:
                if token.pos_ == 'VERB':
                    # Check if this verb connects e1 and e2
                    e1_dep_to_verb = None
                    e2_dep_to_verb = None
                    
                    # Check e1 relation to verb
                    if e1_head.head == token:
                        e1_dep_to_verb = e1_head.dep_
                    elif e1_head == token:
                        e1_dep_to_verb = 'VERB_IS_E1'
                    
                    # Check e2 relation to verb
                    if e2_head.head == token:
                        e2_dep_to_verb = e2_head.dep_
                    elif e2_head == token:
                        e2_dep_to_verb = 'VERB_IS_E2'
                    
                    if e1_dep_to_verb and e2_dep_to_verb:
                        verb_lemma = token.lemma_.lower()
                        pattern_key = ('DEP_VERB', verb_lemma, e1_dep_to_verb, e2_dep_to_verb)
                        dep_patterns[pattern_key][relation] += 1
            
            # Simpler: just e1 and e2 dependency labels
            pattern_key = ('DEP_LABELS', e1_head.dep_, e2_head.dep_)
            dep_patterns[pattern_key][relation] += 1
    
    return lexical_patterns, dep_patterns

In [8]:
def filter_and_rank_patterns(lexical_patterns, dep_patterns, min_precision=0.60, min_support=2):
    """
    Filter patterns by precision and support, then rank them.
    Lower thresholds (precision=0.60, support=2) for better coverage.
    Returns: ordered list of rule dicts
    """
    rules = []
    
    # Process lexical patterns
    for pattern_key, relation_counts in lexical_patterns.items():
        total_count = sum(relation_counts.values())
        if total_count < min_support:
            continue
        
        # Find dominant relation
        best_relation = max(relation_counts, key=relation_counts.get)
        best_count = relation_counts[best_relation]
        precision = best_count / total_count
        
        if precision >= min_precision:
            pattern_type, *pattern_data = pattern_key
            
            # Create rule dict
            rule = {
                'name': f"{best_relation}_{pattern_type}_{hash(pattern_key) % 10000}",
                'relation': best_relation,
                'direction': 'e1,e2',  # Default direction
                'matcher_type': 'lexical',
                'pattern_type': pattern_type,
                'pattern_data': pattern_data,
                'precision': precision,
                'support': best_count,
                'explanation': f"{pattern_type} pattern: {pattern_data}"
            }
            rules.append(rule)
    
    # Process dependency patterns  
    for pattern_key, relation_counts in dep_patterns.items():
        total_count = sum(relation_counts.values())
        if total_count < min_support:
            continue
        
        best_relation = max(relation_counts, key=relation_counts.get)
        best_count = relation_counts[best_relation]
        precision = best_count / total_count
        
        if precision >= min_precision:
            pattern_type, *pattern_data = pattern_key
            
            rule = {
                'name': f"{best_relation}_{pattern_type}_{hash(pattern_key) % 10000}",
                'relation': best_relation,
                'direction': 'e1,e2',
                'matcher_type': 'dependency',
                'pattern_type': pattern_type,
                'pattern_data': pattern_data,
                'precision': precision,
                'support': best_count,
                'explanation': f"{pattern_type}: {pattern_data}"
            }
            rules.append(rule)
    
    # Sort by precision (descending), then support (descending)
    rules.sort(key=lambda r: (-r['precision'], -r['support']))
    
    return rules

In [9]:
# Mine patterns from training data
print("\nStep 1: Mining patterns from training data...")
lexical_patterns, dep_patterns = extract_candidate_patterns(train_processed)

print(f"\nFound {len(lexical_patterns)} unique lexical pattern candidates")
print(f"Found {len(dep_patterns)} unique dependency pattern candidates")

# Filter and rank patterns
print("\nStep 2: Filtering by precision ≥ 0.60 and support ≥ 2...")
DISCOVERED_RULES = filter_and_rank_patterns(lexical_patterns, dep_patterns, 
                                             min_precision=0.60, min_support=2)

print(f"\nDiscovered {len(DISCOVERED_RULES)} high-quality rules")
print("\nTop 10 rules:")
print("="*100)
print(f"{'Relation':<25} {'Type':<15} {'Precision':<12} {'Support':<10} {'Pattern'}")
print("-"*100)
for rule in DISCOVERED_RULES[:10]:
    print(f"{rule['relation']:<25} {rule['pattern_type']:<15} {rule['precision']:<12.3f} {rule['support']:<10} {str(rule['pattern_data'])[:40]}")


Step 1: Mining patterns from training data...
Mining candidate patterns from training data...

Component-Whole: 941 samples

Other: 1410 samples

Instrument-Agency: 504 samples

Member-Collection: 690 samples

Cause-Effect: 1003 samples

Entity-Destination: 845 samples

Content-Container: 540 samples

Message-Topic: 634 samples

Product-Producer: 717 samples

Entity-Origin: 716 samples

Found 16976 unique lexical pattern candidates
Found 368 unique dependency pattern candidates

Step 2: Filtering by precision ≥ 0.60 and support ≥ 2...

Discovered 1841 high-quality rules

Top 10 rules:
Relation                  Type            Precision    Support    Pattern
----------------------------------------------------------------------------------------------------
Cause-Effect              LEMMA           1.000        29         ['trigger']
Cause-Effect              BIGRAM          1.000        27         [('cause', 'of')]
Cause-Effect              BIGRAM          1.000        25         [('t

In [10]:
# Visualize rules by relation for explainability
def visualize_rules_by_relation(rules, top_n=5):
    """
    Display the top-N highest precision rules for each relation.

    Note:
        `rules` is already be sorted globally by (precision desc, support desc) as produced by `filter_and_rank_patterns()`.
        
        Because of this, taking the first N rules in each relation group
        truly reflects the strongest patterns for that relation.

    This function does not re-sort; it only groups and prints the top rules.
    """
    relation_rules = defaultdict(list)
    
    for rule in rules:
        relation_rules[rule['relation']].append(rule)
    
    print("\n" + "="*100)
    print("TOP RULES BY RELATION TYPE (for Explainability)")
    print("="*100)
    
    for relation in sorted(relation_rules.keys()):
        rules_list = relation_rules[relation][:top_n]
        print(f"\n{'='*100}")
        print(f"Relation: {relation} ({len(relation_rules[relation])} total rules)")
        print(f"{'='*100}")
        
        for i, rule in enumerate(rules_list, 1):
            print(f"\n  Rule {i}: {rule['name']}")
            print(f"    Type: {rule['pattern_type']}")
            print(f"    Precision: {rule['precision']:.3f} | Support: {rule['support']}")
            
            # Convert to spaCy Matcher syntax
            pattern_type = rule['pattern_type']
            pattern_data = rule['pattern_data']
            
            if pattern_type == 'LEMMA':
                spacy_pattern = f'[{{"LEMMA": "{pattern_data[0]}"}}]'
            elif pattern_type == 'BIGRAM':
                spacy_pattern = f'[{{"LEMMA": "{pattern_data[0][0]}"}}, {{"LEMMA": "{pattern_data[0][1]}"}}]'
            elif pattern_type == 'PREP':
                spacy_pattern = f'[{{"LEMMA": "{pattern_data[0]}", "POS": "ADP"}}]'
            elif pattern_type == 'BEFORE_E1':
                spacy_pattern = f'Word before E1: {{"LEMMA": "{pattern_data[0]}"}}'
            elif pattern_type == 'AFTER_E2':
                spacy_pattern = f'Word after E2: {{"LEMMA": "{pattern_data[0]}"}}'
            elif pattern_type == 'ENTITY_POS':
                spacy_pattern = f'E1.pos_=="{pattern_data[0]}" AND E2.pos_=="{pattern_data[1]}"'
            elif pattern_type == 'DEP_VERB':
                verb, e1_dep, e2_dep = pattern_data
                # Show structured DependencyMatcher pattern
                # REL_OPs are ">" indicating head relations
                spacy_pattern = f'''
            DependencyMatcher Pattern:
            [
                {{
                    "RIGHT_ID": "verb",
                    "RIGHT_ATTRS": {{"LEMMA": "{verb}", "POS": "VERB"}}
                }},
                {{
                    "LEFT_ID": "verb",
                    "REL_OP": ">",  
                    "RIGHT_ID": "e1",
                    "RIGHT_ATTRS": {{"DEP": "{e1_dep}"}}
                }},
                {{
                    "LEFT_ID": "verb",
                    "REL_OP": ">",  # verb is head of e2
                    "RIGHT_ID": "e2",
                    "RIGHT_ATTRS": {{"DEP": "{e2_dep}"}}
                }}
            ]'''
            elif pattern_type == 'DEP_LABELS':
                spacy_pattern = f'E1.dep_=="{pattern_data[0]}" AND E2.dep_=="{pattern_data[1]}"'
            else:
                spacy_pattern = str(pattern_data)
            
            print(f"    spaCy Pattern: {spacy_pattern}")

visualize_rules_by_relation(DISCOVERED_RULES, top_n=3)



TOP RULES BY RELATION TYPE (for Explainability)

Relation: Cause-Effect (291 total rules)

  Rule 1: Cause-Effect_LEMMA_9900
    Type: LEMMA
    Precision: 1.000 | Support: 29
    spaCy Pattern: [{"LEMMA": "trigger"}]

  Rule 2: Cause-Effect_BIGRAM_570
    Type: BIGRAM
    Precision: 1.000 | Support: 27
    spaCy Pattern: [{"LEMMA": "cause"}, {"LEMMA": "of"}]

  Rule 3: Cause-Effect_BIGRAM_5528
    Type: BIGRAM
    Precision: 1.000 | Support: 25
    spaCy Pattern: [{"LEMMA": "trigger"}, {"LEMMA": "by"}]

Relation: Component-Whole (128 total rules)

  Rule 1: Component-Whole_BIGRAM_6849
    Type: BIGRAM
    Precision: 1.000 | Support: 8
    spaCy Pattern: [{"LEMMA": "comprise"}, {"LEMMA": "a"}]

  Rule 2: Component-Whole_BIGRAM_242
    Type: BIGRAM
    Precision: 1.000 | Support: 7
    spaCy Pattern: [{"LEMMA": "have"}, {"LEMMA": "two"}]

  Rule 3: Component-Whole_BIGRAM_8533
    Type: BIGRAM
    Precision: 1.000 | Support: 7
    spaCy Pattern: [{"LEMMA": "good"}, {"LEMMA": "part"}]

R

In [11]:
def analyze_relation_features(processed_data):
    """
    Analyze linguistic features for each relation type.

    Each sample in `processed_data` is expected to contain:
        - 'relation': gold relation label (str)
        - 'doc': spaCy Doc
        - 'e1_span', 'e2_span': spaCy spans for the two entities
        - 'dep_path': list of (dep_label, ...) along shortest path (optional)
        - 'between_words': list of dicts with 'text' and 'lemma'

    Returns:
        dict[relation] -> {
            'count': int,
            'top_lemmas': [(lemma, freq)],
            'top_verbs': [(lemma, freq)],
            'top_preps': [(lemma, freq)],
            'top_dep_paths': [(path_tuple, freq)],
            'top_between_words': [(lemma, freq)]
        }
    """
    # Group by relation type
    relation_groups = defaultdict(list)
    for sample in processed_data:
        relation_groups[sample['relation']].append(sample)
    
    # Analyze each relation
    relation_analysis = {}
    
    for relation, samples in relation_groups.items():
        # Collect features from all samples of this relation
        all_lemmas = []
        all_verbs = []
        all_preps = []
        all_dep_paths = []
        all_between_words = []
        
        for sample in samples:
            doc = sample['doc']
            
            # Collect lemmas (excluding entities)
            e1_tokens = set(range(sample['e1_span'].start, sample['e1_span'].end))
            e2_tokens = set(range(sample['e2_span'].start, sample['e2_span'].end))
            
            for token in doc:
                if token.i not in e1_tokens and token.i not in e2_tokens:
                    lemma = token.lemma_.lower()
                    
                    # Collect verbs (don't filter stopwords for verbs)
                    if token.pos_ == 'VERB' and not token.is_punct and len(lemma) > 2:
                        all_verbs.append(lemma)
                    
                    # Collect prepositions (INCLUDE stopwords like "of", "in", "at")
                    if token.pos_ == 'ADP' and not token.is_punct:
                        all_preps.append(lemma)
                    
                    # Collect other lemmas (filter stopwords for general keywords)
                    if not token.is_stop and not token.is_punct and len(lemma) > 2:
                        all_lemmas.append(lemma)
            
            # Collect dependency paths (sequence of dependency labels along shortest path)
            if sample['dep_path']:
                path_deps = tuple([d[0] for d in sample['dep_path']])
                all_dep_paths.append(path_deps)
            
            # Between words (fixed: should be "if word['text'].strip()" not "if not")
            for word in sample['between_words']:
                if word['text'].strip() and len(word['lemma']) > 2:
                    all_between_words.append(word['lemma'].lower())
        
        # Count frequencies
        lemma_freq = Counter(all_lemmas).most_common(30)
        verb_freq = Counter(all_verbs).most_common(15)
        prep_freq = Counter(all_preps).most_common(10)
        dep_path_freq = Counter(all_dep_paths).most_common(10)
        between_freq = Counter(all_between_words).most_common(20)
        
        relation_analysis[relation] = {
            'count': len(samples),
            'top_lemmas': lemma_freq,
            'top_verbs': verb_freq,
            'top_preps': prep_freq,
            'top_dep_paths': dep_path_freq,
            'top_between_words': between_freq
        }
    
    return relation_analysis

## 5. Deterministic Rule Application Engine

Now we implement the core rule application logic using:

- **spaCy Matcher**: For single-token and sequence patterns (LEMMA, BIGRAM, PREP).  
- **spaCy PhraseMatcher**: For efficient matching of lemma-based patterns.  
- **spaCy DependencyMatcher**: For dependency-based patterns (DEP_VERB).

This follows spaCy’s recommended practice for rule-based matching.

---
### How does it work?

Rules are pre-sorted by (precision desc, support desc),
e.g. via `filter_and_rank_patterns()`. 
- We iterate them in order and stop at the first matching rule, which gives a deterministic
decision-list classifier ("first high-precision rule wins").

---

The function `apply_rule_based_classifier` applies the discovered rules to each
sentence using spaCy’s rule-based matchers. At the beginning of the function, we
pre-compile all rule patterns into three matcher objects:

- `Matcher`
- `PhraseMatcher`
- `DependencyMatcher`

These matchers serve as efficient pattern detectors over the spaCy `Doc`.

During classification (the main loop), each sample is processed by iterating
through the rules in ranked order (highest precision first). For each rule, the
corresponding matcher is applied:

- Token and phrase patterns are matched against the *between-entity* span.
- Dependency patterns are matched against the entire sentence.
- Context and POS rules are checked with simple Python conditions.

As soon as a rule’s pattern matches, that rule becomes the prediction
("first-match-wins" decision list). The function records the predicted relation,
the argument direction, and a human-readable explanation including the rule’s
precision and support. If no rule matches, the instance is labeled `"Other"`.

This preserves deterministic behavior and full explainability while using spaCy
only as a fast pattern-matching backend.


In [12]:
from spacy.matcher import Matcher, PhraseMatcher, DependencyMatcher

def apply_rule_based_classifier(samples, rules, nlp):
    """
    Apply rule-based classification using spaCy's proper matchers:
    - Matcher: for token sequences (LEMMA, BIGRAM, PREP)
    - PhraseMatcher: for efficient phrase matching (LEMMA lists)
    - DependencyMatcher: for dependency patterns (DEP_VERB)
    
    This follows spaCy documentation best practices.
    """
    # Pre-compile all matchers
    token_matcher = Matcher(nlp.vocab)
    phrase_matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")
    dep_matcher = DependencyMatcher(nlp.vocab)
    
    # # Map match IDs back to rules
    # rule_lookup = {}  # match_id -> rule
    
    # === 1. Compile Token Patterns (BIGRAM, PREP) ===
    for i, rule in enumerate(rules):
        match_id = f"rule_{i}"
        # rule_lookup[match_id] = rule
        
        if rule['matcher_type'] == 'lexical':
            pattern_type = rule['pattern_type']
            pattern_data = rule['pattern_data']
            
            if pattern_type == 'BIGRAM':
                pattern = [{"LEMMA": pattern_data[0][0]}, {"LEMMA": pattern_data[0][1]}]
                token_matcher.add(match_id, [pattern])
            
            elif pattern_type == 'PREP':
                pattern = [{"LEMMA": pattern_data[0], "POS": "ADP"}]
                token_matcher.add(match_id, [pattern])
    
    # === 2. Compile Phrase Patterns (LEMMA) - More efficient ===
    lemma_rules = [(i, r) for i, r in enumerate(rules) 
                   if r['matcher_type'] == 'lexical' and r['pattern_type'] == 'LEMMA']
    
    if lemma_rules:
        # Create Doc patterns with full pipeline to get lemmas
        # Use nlp() instead of nlp.make_doc() when attr="LEMMA" is needed
        patterns = [nlp(r['pattern_data'][0]) for _, r in lemma_rules]
        match_ids = [f"rule_{i}" for i, _ in lemma_rules]
        
        for match_id, pattern in zip(match_ids, patterns):
            phrase_matcher.add(match_id, [pattern])
    
    # === 3. Compile Dependency Patterns (DEP_VERB) ===
    for i, rule in enumerate(rules):
        if rule['pattern_type'] == 'DEP_VERB':
            match_id = f"rule_{i}"
            verb_lemma, e1_dep, e2_dep = rule['pattern_data']
            
            # Build DependencyMatcher pattern
            pattern = [
                # Anchor: the verb
                {
                    "RIGHT_ID": "verb",
                    "RIGHT_ATTRS": {"LEMMA": verb_lemma, "POS": "VERB"}
                },
                # E1 connected to verb
                {
                    "LEFT_ID": "verb",
                    "REL_OP": ">",  # verb is head of e1
                    "RIGHT_ID": "e1",
                    "RIGHT_ATTRS": {"DEP": e1_dep}
                },
                # E2 connected to verb
                {
                    "LEFT_ID": "verb",
                    "REL_OP": ">",  # verb is head of e2
                    "RIGHT_ID": "e2",
                    "RIGHT_ATTRS": {"DEP": e2_dep}
                }
            ]
            
            dep_matcher.add(match_id, [pattern])
    
    # === 4. Apply Matchers to Samples ===
    predictions, directions, explanations = [], [], []
    
    for sample in tqdm(samples, desc="Classifying"):
        doc = sample['doc']
        e1_span, e2_span = sample['e1_span'], sample['e2_span']
        between_span = doc[e1_span.end:e2_span.start] if e1_span.start < e2_span.start else doc[e2_span.end:e1_span.start]
        e1_head, e2_head = e1_span.root, e2_span.root
        
        matched_rule = None
        
        # Apply rules in order (iterate through rules to maintain priority)
        for i, rule in enumerate(rules):
            match_id = f"rule_{i}"
            pattern_type = rule['pattern_type']
            pattern_data = rule['pattern_data']
            
            # === Token Matcher (BIGRAM, PREP) ===
            if pattern_type in ['BIGRAM', 'PREP']:
                matches = token_matcher(between_span)
                if any(nlp.vocab.strings[m[0]] == match_id for m in matches):
                    matched_rule = rule
                    break
            
            # === Phrase Matcher (LEMMA) ===
            elif pattern_type == 'LEMMA':
                matches = phrase_matcher(between_span)
                if any(nlp.vocab.strings[m[0]] == match_id for m in matches):
                    matched_rule = rule
                    break
            
            # === Dependency Matcher (DEP_VERB) ===
            elif pattern_type == 'DEP_VERB':
                matches = dep_matcher(doc)
                # Check if match involves our entities
                for match_id_found, token_ids in matches:
                    if nlp.vocab.strings[match_id_found] == match_id:
                        # Verify entities are involved in match
                        e1_in_match = any(t in range(e1_span.start, e1_span.end) for t in token_ids)
                        e2_in_match = any(t in range(e2_span.start, e2_span.end) for t in token_ids)
                        if e1_in_match or e2_in_match:
                            matched_rule = rule
                            break
                if matched_rule:
                    break
            
            # === Context Patterns (Manual checks) ===
            elif pattern_type == 'BEFORE_E1' and e1_span.start > 0:
                if doc[e1_span.start - 1].lemma_.lower() == pattern_data[0]:
                    matched_rule = rule
                    break
            
            elif pattern_type == 'AFTER_E2' and e2_span.end < len(doc):
                if doc[e2_span.end].lemma_.lower() == pattern_data[0]:
                    matched_rule = rule
                    break
            
            # === Entity POS Pattern ===
            elif pattern_type == 'ENTITY_POS':
                if e1_head.pos_ == pattern_data[0] and e2_head.pos_ == pattern_data[1]:
                    matched_rule = rule
                    break
            
            # === Simple Dependency Labels ===
            elif pattern_type == 'DEP_LABELS':
                if e1_head.dep_ == pattern_data[0] and e2_head.dep_ == pattern_data[1]:
                    matched_rule = rule
                    break
        
        # Record prediction
        if matched_rule:
            predictions.append(matched_rule['relation'])
            directions.append(matched_rule['direction'])
            explanations.append(f"Rule {matched_rule['name']}: {matched_rule['explanation']} (precision={matched_rule['precision']:.2f}, support={matched_rule['support']})")
        else:
            predictions.append('Other')
            directions.append('e1,e2')
            explanations.append('No high-precision rule matched; defaulting to Other.')
    
    return predictions, directions, explanations

In [13]:
# Test the rule-based classifier on a few samples
print("Testing rule-based classifier on sample sentences...")
print("="*80)

# Quick test on 5 samples
test_samples = train_processed[:5]
test_preds, test_dirs, test_expls = apply_rule_based_classifier(test_samples, DISCOVERED_RULES, nlp)

for i, (sample, relation, explanation) in enumerate(zip(test_samples, test_preds, test_expls)):
    print(f"\nSample {i+1}:")
    print(f"Text: {sample['text'][:100]}...")
    print(f"E1: '{sample['e1_span'].text}' | E2: '{sample['e2_span'].text}'")
    print(f"True: {sample['relation']}")
    print(f"Predicted: {relation}")
    print(f"Explanation: {explanation}")
    print(f"Match: {'✓' if relation == sample['relation'] else '✗'}")

Testing rule-based classifier on sample sentences...


Classifying: 100%|██████████| 5/5 [00:00<00:00,  8.06it/s]


Sample 1:
Text: The system as described above has its greatest application in an arrayed configuration of antenna el...
E1: 'configuration' | E2: 'elements'
True: Component-Whole
Predicted: Other
Explanation: No high-precision rule matched; defaulting to Other.
Match: ✗

Sample 2:
Text: The child was carefully wrapped and bound into the cradle by means of a cord....
E1: 'child' | E2: 'cradle'
True: Other
Predicted: Entity-Destination
Explanation: Rule Entity-Destination_BIGRAM_9249: BIGRAM pattern: [('into', 'the')] (precision=0.91, support=267)
Match: ✗

Sample 3:
Text: The author of a keygen uses a disassembler to look at the raw assembly code....
E1: 'author' | E2: 'disassembler'
True: Instrument-Agency
Predicted: Instrument-Agency
Explanation: Rule Instrument-Agency_DEP_VERB_8188: DEP_VERB: ['use', 'nsubj', 'dobj'] (precision=0.87, support=45)
Match: ✓

Sample 4:
Text: A misty ridge uprises from the surge....
E1: 'ridge' | E2: 'surge'
True: Other
Predicted: Other
Explanation: Rule




**Generate Predictions Using the Rule-Based Classifier**

We evaluate the discovered rules by applying the deterministic rule engine
to the processed datasets. The function `apply_rule_based_classifier`:

1. Pre-compiles all rules into spaCy matchers (token, phrase, dependency).
2. For each sample, checks the rules in ranked order (by precision).
3. Stops at the first matching rule ("decision list" behavior).
4. Returns the predicted relation and the explanation of the rule that fired.

This gives fully explainable predictions for every instance.

In [14]:
# Generate predictions using the simplified rule-based classifier
print("Evaluating on training set...")
train_predictions, train_directions, train_explanations = apply_rule_based_classifier(train_processed, DISCOVERED_RULES, nlp)
train_true = [s['relation'] for s in train_processed]

print("\nEvaluating on test set...")
test_predictions, test_directions, test_explanations = apply_rule_based_classifier(test_processed, DISCOVERED_RULES, nlp)
test_true = [s['relation'] for s in test_processed]

Evaluating on training set...


Classifying: 100%|██████████| 8000/8000 [16:32<00:00,  8.06it/s]




Evaluating on test set...


Classifying: 100%|██████████| 2717/2717 [05:46<00:00,  7.84it/s]
Classifying: 100%|██████████| 2717/2717 [05:46<00:00,  7.84it/s]


## 6. Evaluation with Deterministic Rules

---

We now evaluate the performance of the deterministic rule-based classifier on
both the training and test sets. For each split, we compute:

- **Accuracy** – overall correctness of predictions.
- **Per-class precision, recall, and F1-score** – to show how well the rules
  capture each relation category.
- **Support** – number of samples per class.

This gives a clear picture of how well the learned rules generalize to unseen
data and which relations are easy or difficult for the rule-based approach.

In [15]:
# Comprehensive evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("="*80)
print("DETERMINISTIC RULE-BASED SYSTEM EVALUATION")
print("="*80)

# Training set evaluation
print("\n### TRAINING SET RESULTS ###\n")
train_accuracy = accuracy_score(train_true, train_predictions)
print(f"Accuracy: {train_accuracy:.3f}")

print("\nPer-class metrics:")
print(classification_report(train_true, train_predictions, zero_division=0))
print("="*80)

# Test set evaluation
print("\n### TEST SET RESULTS ###\n")
test_accuracy = accuracy_score(test_true, test_predictions)
print(f"Accuracy: {test_accuracy:.3f}")

print("\nPer-class metrics:")
print(classification_report(test_true, test_predictions, zero_division=0, digits=3))

DETERMINISTIC RULE-BASED SYSTEM EVALUATION

### TRAINING SET RESULTS ###

Accuracy: 0.605

Per-class metrics:
                    precision    recall  f1-score   support

      Cause-Effect       0.83      0.76      0.79      1003
   Component-Whole       0.71      0.33      0.46       941
 Content-Container       0.73      0.72      0.73       540
Entity-Destination       0.80      0.87      0.83       845
     Entity-Origin       0.81      0.63      0.71       716
 Instrument-Agency       0.74      0.62      0.67       504
 Member-Collection       0.87      0.18      0.30       690
     Message-Topic       0.79      0.81      0.80       634
             Other       0.30      0.62      0.41      1410
  Product-Producer       0.72      0.51      0.60       717

          accuracy                           0.61      8000
         macro avg       0.73      0.61      0.63      8000
      weighted avg       0.70      0.61      0.61      8000


### TEST SET RESULTS ###

Accuracy: 0.521

Per

**Rule Diagnostics and Summary Statistics**

To better understand the behavior of the discovered rules, we compute several
diagnostic statistics:

- **Number of rules per relation** – shows how many high-precision patterns were
  learned for each class.
- **Average precision and support** – summarize the overall quality of the
  rule set. Higher precision indicates more reliable rules; higher support
  indicates patterns that appear frequently in training data.
- **Macro-averaged F1 and accuracy** – provide a global summary of system
  performance on both training and test sets.

These diagnostics help identify relations that are well-covered by rules and
those that may need additional patterns or refinement.


In [16]:
# Rule statistics and diagnostics
print("\n" + "="*80)
print("RULE DIAGNOSTICS")
print("="*80)

# Count rules per relation
relation_rule_counts = defaultdict(int)
for rule in DISCOVERED_RULES:
    relation_rule_counts[rule['relation']] += 1

print("\nRules discovered per relation:")
print(f"{'Relation':<30} {'Number of Rules'}")
print("-"*50)
for relation in sorted(relation_rule_counts.keys()):
    print(f"{relation:<30} {relation_rule_counts[relation]}")

print(f"\nTotal rules: {len(DISCOVERED_RULES)}")
print(f"Average precision: {np.mean([r['precision'] for r in DISCOVERED_RULES]):.3f}")
print(f"Average support: {np.mean([r['support'] for r in DISCOVERED_RULES]):.1f}")

from sklearn.metrics import (
    f1_score,
    precision_score,
    recall_score,
    classification_report,
)

# ========= Macro-averaged metrics =========
test_macro_precision  = precision_score(test_true,  test_predictions,
                                        average='macro', zero_division=0)
train_macro_precision = precision_score(train_true, train_predictions,
                                        average='macro', zero_division=0)

test_macro_recall  = recall_score(test_true,  test_predictions,
                                  average='macro', zero_division=0)
train_macro_recall = recall_score(train_true, train_predictions,
                                  average='macro', zero_division=0)

test_macro_f1  = f1_score(test_true,  test_predictions,
                          average='macro', zero_division=0)
train_macro_f1 = f1_score(train_true, train_predictions,
                          average='macro', zero_division=0)

# `test_accuracy` and `train_accuracy` assumed pre-computed

# ========= Summary table (Train vs Test) =========
print(f"\n{'Metric':<30} {'Test Set':<15} {'Train Set':<15}")
print("-" * 60)
print(f"{'Macro-averaged Precision':<30} {test_macro_precision:<15.3f} {train_macro_precision:<15.3f}")
print(f"{'Macro-averaged Recall':<27}    {test_macro_recall:<15.3f} {train_macro_recall:<15.3f}")
print(f"{'Macro-averaged F1':<23}        {test_macro_f1:<15.3f} {train_macro_f1:<15.3f}")
print(f"{'Accuracy':<13}                  {test_accuracy:<15.3f} {train_accuracy:<15.3f}")


RULE DIAGNOSTICS

Rules discovered per relation:
Relation                       Number of Rules
--------------------------------------------------
Cause-Effect                   291
Component-Whole                128
Content-Container              59
Entity-Destination             221
Entity-Origin                  147
Instrument-Agency              172
Member-Collection              65
Message-Topic                  327
Other                          238
Product-Producer               193

Total rules: 1841
Average precision: 0.868
Average support: 6.3

Metric                         Test Set        Train Set      
------------------------------------------------------------
Macro-averaged Precision       0.601           0.729          
Macro-averaged Recall          0.517           0.605          
Macro-averaged F1              0.530           0.628          
Accuracy                       0.521           0.605          

Metric                         Test Set        Train Set     

In [17]:
# Show example rule firings
print("\n" + "="*80)
print("EXAMPLE RULE FIRINGS (Test Set)")
print("="*80)

# Group test samples by relation
test_by_relation = defaultdict(list)
for i, sample in enumerate(test_processed):
    if test_predictions[i] == sample['relation'] and sample['relation'] != 'Other':
        test_by_relation[sample['relation']].append((sample, i))

# Show 1-2 examples per relation
for relation in sorted(test_by_relation.keys())[:5]:  # First 5 relations
    examples = test_by_relation[relation][:2]  # Up to 2 examples each
    
    print(f"\n### {relation} ###")
    for sample, idx in examples:
        print(f"\nText: {sample['text']}")
        print(f"E1: '{sample['e1_span'].text}' | E2: '{sample['e2_span'].text}'")
        
        # Find the triggered rule to show its structured pattern
        explanation = test_explanations[idx]
        print(f"Rule fired: {explanation}")
        
        # If it's a DEP_VERB rule, show structured DependencyMatcher pattern
        if 'DEP_VERB' in explanation:
            # Extract rule name from explanation
            rule_name = explanation.split(':')[0].replace('Rule ', '')
            # Find the rule in DISCOVERED_RULES
            for rule in DISCOVERED_RULES:
                if rule['name'] == rule_name and rule['pattern_type'] == 'DEP_VERB':
                    verb, e1_dep, e2_dep = rule['pattern_data']
                    print(f"\n  DependencyMatcher Pattern:")
                    print(f"  [")
                    print(f"      {{")
                    print(f"          \"RIGHT_ID\": \"verb\",")
                    print(f"          \"RIGHT_ATTRS\": {{\"LEMMA\": \"{verb}\", \"POS\": \"VERB\"}}")
                    print(f"      }},")
                    print(f"      {{")
                    print(f"          \"LEFT_ID\": \"verb\",")
                    print(f"          \"REL_OP\": \">\",  # verb is head of e1")
                    print(f"          \"RIGHT_ID\": \"e1\",")
                    print(f"          \"RIGHT_ATTRS\": {{\"DEP\": \"{e1_dep}\"}}")
                    print(f"      }},")
                    print(f"      {{")
                    print(f"          \"LEFT_ID\": \"verb\",")
                    print(f"          \"REL_OP\": \">\",  # verb is head of e2")
                    print(f"          \"RIGHT_ID\": \"e2\",")
                    print(f"          \"RIGHT_ATTRS\": {{\"DEP\": \"{e2_dep}\"}}")
                    print(f"      }}")
                    print(f"  ]")
                    break
        
        print("-" * 80)



EXAMPLE RULE FIRINGS (Test Set)

### Cause-Effect ###

Text: Avian influenza is an infectious disease of birds caused by type A strains of the influenza virus.
E1: 'influenza' | E2: 'virus'
Rule fired: Rule Cause-Effect_LEMMA_8248: LEMMA pattern: ['cause'] (precision=0.97, support=492)
--------------------------------------------------------------------------------

Text: Of the hundreds of strains of avian influenza A viruses, only four have caused human infections: H5N1, H7N3, H7N7, and H9N2.
E1: 'viruses' | E2: 'infections'
Rule fired: Rule Cause-Effect_BIGRAM_6209: BIGRAM pattern: [('have', 'cause')] (precision=1.00, support=9)
--------------------------------------------------------------------------------

### Component-Whole ###

Text: Still further, the circuit comprises a digital adder and an analog-to-digital converter with an analog input connected to the output of the operational amplifier and a digital output connected to a first input of the digital adder.
E1: 'circuit' 

## 7. Save Predictions for Official Scorer (Optional)

The files are saved for potential offline evaluation with the official Perl scorer.
Note: The Perl scorer can be slow. Use the sklearn metrics above for quick evaluation.

In [18]:
def save_predictions_for_scorer(predictions, processed_data, output_file):
    """Save predictions in official scorer format: ID\tRelation(e1,e2)"""
    with open(output_file, 'w') as f:
        for pred, sample in zip(predictions, processed_data):
            sample_id = sample['id']
            
            # Get direction from original data (default to e1,e2)
            direction = sample.get('direction', '') or ''
            # Remove parentheses if they exist in direction
            direction = direction.replace('(', '').replace(')', '')
            if not direction:
                direction = 'e1,e2'
            
            # Format: ID\tRelation(e1,e2) - ALL relations need direction
            f.write(f"{sample_id}\t{pred}({direction})\n")
    
    print(f"Saved {len(predictions)} predictions to {output_file}")


def create_answer_key(processed_data, output_file):
    """Create answer key file from processed data in official format."""
    with open(output_file, 'w') as f:
        for sample in processed_data:
            sample_id = sample['id']
            relation = sample['relation']
            direction = sample.get('direction', '') or ''
            # Remove parentheses if they exist in direction
            direction = direction.replace('(', '').replace(')', '')
            if not direction:
                direction = 'e1,e2'
            f.write(f"{sample_id}\t{relation}({direction})\n")
    
    print(f"Saved {len(processed_data)} gold labels to {output_file}")


# Save predictions and answer keys
print("Preparing files for official scorer...")
save_predictions_for_scorer(train_predictions, train_processed, 'train_predictions.txt')
create_answer_key(train_processed, 'train_answer_key.txt')

save_predictions_for_scorer(test_predictions, test_processed, 'test_predictions.txt')
create_answer_key(test_processed, 'test_answer_key.txt')

Preparing files for official scorer...
Saved 8000 predictions to train_predictions.txt
Saved 8000 gold labels to train_answer_key.txt
Saved 2717 predictions to test_predictions.txt
Saved 2717 gold labels to test_answer_key.txt
Saved 2717 predictions to test_predictions.txt
Saved 2717 gold labels to test_answer_key.txt


## 8. Error Analysis

Analyze misclassifications to understand system limitations.


In [19]:
# Analyze misclassifications
def analyze_errors(samples, predictions, true_labels, explanations, n_samples=20):
    """Analyze misclassified examples."""
    errors = []
    
    for i, (sample, pred, true) in enumerate(zip(samples, predictions, true_labels)):
        if pred != true:
            errors.append({
                'index': i,
                'sample': sample,
                'predicted': pred,
                'true': true,
                'text': sample['text'],
                'explanation': explanations[i]
            })
    
    print(f"Total errors: {len(errors)} / {len(samples)} ({len(errors)/len(samples)*100:.1f}%)")
    print(f"\nShowing first {min(n_samples, len(errors))} errors:\n")
    print("="*80)
    
    for i, error in enumerate(errors[:n_samples]):
        print(f"\nError {i+1}:")
        print(f"Text: {error['text']}")
        print(f"Entity 1: {error['sample']['e1_span'].text}")
        print(f"Entity 2: {error['sample']['e2_span'].text}")
        print(f"True relation: {error['true']}")
        print(f"Predicted: {error['predicted']}")
        print(f"Rule applied: {error['explanation']}")
        
        # Show between words and dependency info
        between_words = [w['text'] for w in error['sample']['between_words']]
        print(f"Between words: {between_words}")
        
        # Show dependency path
        dep_path = error['sample']['dep_path']
        if dep_path:
            path_str = ' -> '.join([f"{d[0]}({d[2]})" for d in dep_path[:5]])
            print(f"Dependency path: {path_str}")
        
        print("-" * 80)
    
    return errors

In [20]:
# Error distribution by relation type
def analyze_error_patterns(samples, predictions, true_labels):
    """Analyze error patterns by relation type."""
    error_matrix = defaultdict(lambda: defaultdict(int))
    
    for sample, pred, true in zip(samples, predictions, true_labels):
        if pred != true:
            error_matrix[true][pred] += 1
    
    print("\nMost Common Misclassification Patterns:")
    print("="*80)
    print(f"{'True Label':<25} {'Predicted As':<25} {'Count':<10}")
    print("-"*80)
    
    # Sort by count
    all_errors = []
    for true_label in error_matrix:
        for pred_label in error_matrix[true_label]:
            count = error_matrix[true_label][pred_label]
            all_errors.append((true_label, pred_label, count))
    
    all_errors.sort(key=lambda x: x[2], reverse=True)
    
    for true_label, pred_label, count in all_errors[:15]:
        print(f"{true_label:<25} {pred_label:<25} {count:<10}")
    
    return error_matrix

error_patterns = analyze_error_patterns(test_processed, test_predictions, test_true)


Most Common Misclassification Patterns:
True Label                Predicted As              Count     
--------------------------------------------------------------------------------
Component-Whole           Other                     195       
Member-Collection         Other                     172       
Product-Producer          Other                     99        
Entity-Origin             Other                     85        
Other                     Entity-Destination        56        
Message-Topic             Other                     56        
Instrument-Agency         Other                     51        
Cause-Effect              Other                     42        
Other                     Message-Topic             38        
Other                     Content-Container         31        
Other                     Instrument-Agency         26        
Other                     Component-Whole           25        
Product-Producer          Cause-Effect              24     

In [21]:
# Analyze test set errors
print("Analyzing errors on test set...")
test_errors = analyze_errors(test_processed, test_predictions, test_true, test_explanations, n_samples=15)

Analyzing errors on test set...
Total errors: 1302 / 2717 (47.9%)

Showing first 15 errors:


Error 1:
Text: The school master teaches the lesson with a stick.
Entity 1: master
Entity 2: stick
True relation: Instrument-Agency
Predicted: Other
Rule applied: No high-precision rule matched; defaulting to Other.
Between words: ['teaches', 'the', 'lesson', 'with', 'a']
Dependency path: nsubj(master) -> ROOT(teach) -> prep(with) -> pobj(stick)
--------------------------------------------------------------------------------

Error 2:
Text: The ear of the African elephant is significantly larger--measuring 183 cm by 114 cm in the bush elephant.
Entity 1: ear
Entity 2: elephant
True relation: Component-Whole
Predicted: Other
Rule applied: No high-precision rule matched; defaulting to Other.
Between words: ['of', 'the', 'African']
Dependency path: nsubj(ear) -> prep(of) -> pobj(elephant)
--------------------------------------------------------------------------------

Error 3:
Text: A child is t

# Quantitative, Qualitative and Error Analysis

## Confusion Matrix
---
<img src="confusion_matrix_training__rb_system.png" alt="Confusion Matrix of Training Data Set" width="500" />

---
<img src="confusion_matrix_test__rb_system.png" alt="Confusion Matrix of Test Data Set" width="500" />

## 9. Quantitative and Qualitative Interpretation of Results

The deterministic rule-based system achieves **0.605 accuracy** on the training
set and **0.521 accuracy** on the test set, with a **macro-F1 of 0.53** on the
test split. This reflects a typical profile for rule-based relation extraction:
**high precision** when rules fire, but **lower recall**, especially for relations
without consistent surface cues.

---

### Precision–Recall Behaviour

Because rules only fire when explicit lexical or dependency patterns appear, the
system naturally favors **high precision** and suffers from **lower recall**:

- Precision is strong across most relations (0.70–0.85) because rules are mined
  with strict precision thresholds and ranked by reliability.
- Recall varies widely. Many valid instances contain no trigger pattern, so the
  classifier defaults to `"Other"`, lowering recall for several relations.
- Well-marked relations (e.g., Cause-Effect, Entity-Destination) have a good
  precision–recall balance, while ambiguous ones (Component-Whole,
  Member-Collection) show large precision–recall gaps.

This pattern is expected and confirms that deterministic rules behave as
high-precision, low-recall classifiers.

---

### Strong Performing Relations

Relations with stable lexical and syntactic patterns achieve the highest F1:

- **Cause-Effect** (F1 ≈ 0.80): consistent verbs like *cause*, *lead to*.
- **Content-Container** (F1 ≈ 0.69): clear markers such as *in*, *inside*, *contain*.
- **Entity-Destination** (F1 ≈ 0.78): reliable directional cues (*to*, *into*).
- **Message-Topic** (F1 ≈ 0.69): Topic-marking cues (*about*, *regarding*).

These relations have strong, repeated indicators that rules can capture reliably.

---

### Moderate Performing Relations

These relations have useful cues but also structural variability:

- **Entity-Origin**, **Instrument-Agency**, **Product-Producer**

They share patterns with other relations or appear in multiple syntactic forms,
which reduces both precision and recall.

---

### Challenging Relations

Some relations perform poorly due to ambiguous or sparse linguistic signals:

- **Component-Whole** – overloaded prepositions (*of*, *in*, *on*) across many meanings.
- **Member-Collection** – often requires semantic/world knowledge (e.g., *team–player*).
- **Other** – catch-all category with inconsistent phrasing.

These relations lack strong surface cues, making them inherently difficult for
a rule-based system.

---

### Generalization and Overfitting

The difference between **train macro-F1 (0.63)** and **test macro-F1 (0.53)**
indicates **moderate overfitting**, expected for deterministic rules. Rule mining
captures specific training patterns with high precision, but variations on the
test set reduce recall.

---

### Overall Summary

The system discovers many highly reliable rules (average precision ≈ 0.906),
leading to strong performance on well-marked relations. Performance drops for
relations with ambiguous or weak linguistic cues due to recall limitations.
Overall, the results match the expected behavior of a **high-precision,
low-recall, fully explainable decision-list classifier**.


## Rule Diagnostics Overview

<img src="rule_diagnostics.png" alt="Rule Diagnostics Summary" width="500" />

The rule-mining step generated a total of **1841 rules** across all relations,
with an average precision of **0.868** and average support of **6.3**. This shows
that the majority of discovered patterns are highly reliable when they appear,
while some relations require many small, specialized rules due to high lexical
variability (e.g., Message-Topic, Entity-Destination, Cause-Effect).

Macro-averaged precision and recall reflect the typical behavior of a
deterministic rule-based system: **precision is high**, since rules only fire
when their patterns match exactly, but **recall is lower**, as many instances do
not contain strong surface cues. These diagnostics complement the per-class
evaluation and help characterize the strengths and limitations of the learned
ruleset.


## Error Analysis

<img src="error_analysis__most_common_missclf_patterns.png" alt="Error Analysis of Rule-Based System" width="500" />

The majority of misclassifications follow a consistent pattern: instances from
several relation types fall through to the `"Other"` class. This reflects a
typical limitation of deterministic rule-based systems—rules only trigger when a
specific pattern appears, so relations with weak or highly variable surface cues
are often missed entirely.

The most common errors include:

- **Component-Whole → Other** (195 cases)  
- **Member-Collection → Other** (172 cases)  
- **Product-Producer → Other** (99 cases)

These relations rely on subtle semantic distinctions or ambiguous
prepositions (e.g., *of*, *in*, *with*) that do not form strong, repeated
patterns. As a result, many valid instances lack clear lexical triggers for the
rules.

Several `"Other"` instances are also incorrectly mapped to more specific
relations (e.g., *Other → Entity-Destination*, *Other → Message-Topic*), showing
that generic constructions sometimes resemble the patterns of better-defined
relations.

Overall, the misclassification patterns reinforce the precision–recall trade-off:
the rule-based system is highly accurate when rules fire, but misses many
instances without explicit cues, leading to lower recall for subtle or
ambiguous relation types.
