# Milestone 3 - Notebook 2: Unified Pattern Mining

## Objective

Extract ALL 4 pattern types as DependencyMatcher patterns:
- **Type A (Triangle):** Event-driven via LCA (verb anchors)
- **Type B (Bridge):** Prepositional chains
- **Type C (Linear):** Sequence fallback with precedence operators
- **Type D (Direct):** Noun compounds and modifiers

## Mining Priority Order

1. Type D (Direct) - Check first, extract and CONTINUE (don't stop)
2. Type A (Triangle) - Extract and STOP
3. Type B (Bridge) - Extract and STOP
4. Type C (Linear) - Fallback if no A/B found

## Output

`../data/raw_patterns.json`

In [1]:
# Imports
import json
import sys
from pathlib import Path
from collections import defaultdict
import spacy
from tqdm.auto import tqdm

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Import modules
from utils import preprocess_data
from pattern_miner import extract_all_patterns, summarize_patterns

print("Imports successful!")

Successfully imported functions from Milestone 2: /Users/egeaydin/Github/TUW2025WS/Token13-tuw-nlp-ie-2025WS/milestone_2/rule_based
Imports successful!


In [2]:
# Load spaCy model
print("Loading spaCy model...")
nlp = spacy.load("en_core_web_lg")
print(f"Model loaded: {nlp.meta['name']}")

Loading spaCy model...
Model loaded: core_web_lg


In [3]:
# Load concept clusters from Notebook 1
print("Loading concept clusters...")
with open('../data/concept_clusters.json', 'r') as f:
    concept_data = json.load(f)

expanded_clusters = concept_data['expanded_clusters']
lemma_to_concept = concept_data['lemma_to_concept']

print(f"Loaded {len(expanded_clusters)} concept clusters")
print(f"Total unique words: {len(lemma_to_concept)}")

Loading concept clusters...
Loaded 8 concept clusters
Total unique words: 141


## 1. Load and Preprocess Data

In [4]:
# Load training data
print("Loading training data...")
data_path = Path('../../data/processed/train/train.json')

with open(data_path, 'r') as f:
    train_data = json.load(f)

print(f"Loaded {len(train_data)} training samples")

Loading training data...
Loaded 8000 training samples


In [5]:
# Preprocess data (reuse M2 function)
print("Preprocessing data...")
train_processed = preprocess_data(train_data, nlp)
print(f"Processed {len(train_processed)} samples")

# Show sample
sample = train_processed[0]
print(f"\nSample:")
print(f"  Text: {sample['text'][:100]}...")
print(f"  E1: {sample['e1_span'].text}")
print(f"  E2: {sample['e2_span'].text}")
print(f"  Relation: {sample['relation_directed']}")

Preprocessing data...


Processing:   0%|          | 0/8000 [00:00<?, ?it/s]

Processed 8000 samples

Sample:
  Text: The system as described above has its greatest application in an arrayed configuration of antenna el...
  E1: configuration
  E2: elements
  Relation: Component-Whole(e2,e1)


## 2. Extract Patterns (All 4 Types)

In [6]:
# Mine patterns
print("="*80)
print("EXTRACTING PATTERNS")
print("="*80)

pattern_counts = extract_all_patterns(train_processed, lemma_to_concept)

EXTRACTING PATTERNS
Mining patterns from 8000 samples...


Mining patterns:   0%|          | 0/8000 [00:00<?, ?it/s]

\nPattern mining complete!
Total unique patterns: 1786
\nPattern type counts:
  DIRECT (e1->e2): 516
  DIRECT (e2->e1): 12
  2-HOP: 1751
  SIBLING: 12
  FALLBACK: 400


In [7]:
# Summarize patterns
summary = summarize_patterns(pattern_counts)

print("\nPattern Summary:")
print(f"  Total unique patterns: {summary['total_patterns']}")
print(f"\nBy type:")
for pattern_type, count in summary['by_type'].items():
    print(f"    {pattern_type}: {count}")
print(f"\nBy relation: {len(summary['by_relation'])} relations covered")


Pattern Summary:
  Total unique patterns: 1786

By type:
    DIRECT_2HOP: 20
    BRIDGE: 55
    TRIANGLE: 1062
    LINEAR: 621
    DIRECT: 7
    FALLBACK: 14
    DIRECT_SIBLING: 7

By relation: 19 relations covered


## 3. Inspect Patterns

In [8]:
# Show examples of each pattern type
print("\nExample patterns by type:\n")

for pattern_type in ['DIRECT', 'TRIANGLE', 'BRIDGE', 'LINEAR']:
    print(f"\n{pattern_type} patterns:")
    count = 0
    for pattern_key, relation_counts in pattern_counts.items():
        if pattern_key[0] == pattern_type:
            print(f"  {pattern_key}")
            print(f"    Relations: {relation_counts}")
            count += 1
            if count >= 3:
                break


Example patterns by type:


DIRECT patterns:
  ('DIRECT', 'compound', 'e1->e2')
    Relations: defaultdict(<class 'int'>, {'Member-Collection(e1,e2)': 7, 'Other': 86, 'Component-Whole(e2,e1)': 79, 'Entity-Origin(e2,e1)': 107, 'Product-Producer(e2,e1)': 6, 'Product-Producer(e1,e2)': 54, 'Content-Container(e1,e2)': 6, 'Member-Collection(e2,e1)': 8, 'Component-Whole(e1,e2)': 9, 'Message-Topic(e2,e1)': 5, 'Instrument-Agency(e1,e2)': 15, 'Cause-Effect(e1,e2)': 12, 'Instrument-Agency(e2,e1)': 2, 'Entity-Origin(e1,e2)': 1})
  ('DIRECT', 'poss', 'e1->e2')
    Relations: defaultdict(<class 'int'>, {'Product-Producer(e1,e2)': 11, 'Instrument-Agency(e2,e1)': 11, 'Product-Producer(e2,e1)': 42, 'Component-Whole(e2,e1)': 15, 'Other': 14, 'Member-Collection(e1,e2)': 1, 'Entity-Origin(e2,e1)': 1, 'Component-Whole(e1,e2)': 1})
  ('DIRECT', 'nmod', 'e1->e2')
    Relations: defaultdict(<class 'int'>, {'Product-Producer(e1,e2)': 4, 'Cause-Effect(e1,e2)': 2, 'Other': 2, 'Component-Whole(e1,e2)': 1, 'Instr

## 4. Save Raw Patterns

In [9]:
# Convert pattern_counts to serializable format
raw_patterns = {
    'pattern_counts': {str(k): v for k, v in pattern_counts.items()},
    'summary': summary,
    'metadata': {
        'total_samples': len(train_processed),
        'total_patterns': len(pattern_counts),
        'concept_clusters_used': len(lemma_to_concept)
    }
}

# Save
output_path = Path('../data/raw_patterns.json')
with open(output_path, 'w') as f:
    json.dump(raw_patterns, f, indent=2)

print(f"\nSaved raw patterns to: {output_path}")
print(f"File size: {output_path.stat().st_size / 1024:.2f} KB")


Saved raw patterns to: ../data/raw_patterns.json
File size: 208.51 KB


## Summary

Pattern mining complete!

**Key Results:**
- Extracted all 4 pattern types following priority order
- Type D patterns can coexist with A/B/C
- Type A/B/C are mutually exclusive
- Concept abstraction applied to anchor words

**Next Step:** Notebook 3 - Pattern Refinement & Augmentation