# Milestone 3 - Notebook 3: Pattern Refinement & Augmentation

## Objective

- Apply tiered filtering based on pattern complexity
- Generate passive voice variants from high-precision active patterns
- Sort patterns by (length desc, precision desc)

## Output

`../data/patterns_augmented.json`

In [1]:
import json
import sys
from pathlib import Path
import importlib

sys.path.insert(0, str(Path.cwd().parent / 'src'))

from pattern_augmentation import (
    filter_patterns_tiered,
    generate_passive_variants,
    sort_patterns,
    RELATION_MIN_PRECISION
)

print('Imports successful!')

Imports successful!


In [2]:
# Display relation-specific precision thresholds
print("="*60)
print("RELATION-SPECIFIC PRECISION FLOORS")
print("="*60)
print("\nThese thresholds control minimum precision for pattern inclusion:")
print(f"\n{'Relation':<25} | {'Min Precision':<15}")
print("-" * 45)
for rel, thresh in sorted(RELATION_MIN_PRECISION.items()):
    print(f"{rel:<25} | {thresh:.2f}")
    
print(f"\nNote: Lower thresholds = more patterns = better coverage")
print(f"      Higher thresholds = fewer patterns = better precision")

RELATION-SPECIFIC PRECISION FLOORS

These thresholds control minimum precision for pattern inclusion:

Relation                  | Min Precision  
---------------------------------------------
Cause-Effect              | 0.45
Component-Whole           | 0.55
Content-Container         | 0.50
Entity-Destination        | 0.45
Entity-Origin             | 0.55
Instrument-Agency         | 0.45
Member-Collection         | 0.55
Message-Topic             | 0.50
Other                     | 0.80
Product-Producer          | 0.50

Note: Lower thresholds = more patterns = better coverage
      Higher thresholds = fewer patterns = better precision


## 1. Load Raw Patterns

In [3]:
# Load from Notebook 2
with open('../data/raw_patterns.json', 'r') as f:
    raw_data = json.load(f)

with open('../data/concept_clusters.json', 'r') as f:
    concept_data = json.load(f)

expanded_clusters = concept_data['expanded_clusters']

# Convert pattern_counts back from strings
pattern_counts_str = raw_data['pattern_counts']
pattern_counts = {eval(k): v for k, v in pattern_counts_str.items()}

print(f"Loaded {len(pattern_counts)} raw patterns")

Loaded 1958 raw patterns


## 2. Tiered Filtering

In [4]:
print('='*80)
print('TIERED FILTERING')
print('='*80)

filtered_patterns = filter_patterns_tiered(pattern_counts, expanded_clusters)

TIERED FILTERING
Filtering 1958 candidate patterns...
  Global minimum support: 1
Filtered to 1397 patterns
  Complex (len > 3): 433
  Simple (len <= 3): 964
  'Other' patterns: 28

Pattern precision distribution:
  0.25-0.30:    0 (  0.0%)
  0.30-0.40:    0 (  0.0%)
  0.40-0.50:    1 (  0.1%)
  0.50-0.60:   25 (  1.8%)
  0.60-0.70:   64 (  4.6%)
  0.70-0.80:   40 (  2.9%)
  0.80-0.90:   34 (  2.4%)
  0.90-1.00:   18 (  1.3%)

Pattern type distribution:
  BRIDGE              :   38 (  2.7%)
  DIRECT              :    2 (  0.1%)
  DIRECT_2HOP         :   13 (  0.9%)
  DIRECT_SIBLING      :    6 (  0.4%)
  LINEAR              :  465 ( 33.3%)
  TRIANGLE            :  873 ( 62.5%)


## 3. Generate Passive Variants

In [5]:
augmented_patterns = generate_passive_variants(
    filtered_patterns,
    min_precision_for_flip=0.75
)

\nGenerating passive voice variants...
Generated 149 passive variants
Total patterns: 1546


## 4. Sort Patterns

In [6]:
sorted_patterns = sort_patterns(augmented_patterns)

\nSorted 1546 patterns by (length desc, precision desc)


## 5. Save Augmented Patterns

In [7]:
# Save patterns
output_path = Path('../data/patterns_augmented.json')

with open(output_path, 'w') as f:
    json.dump(sorted_patterns, f, indent=2)

print(f'\nSaved {len(sorted_patterns)} patterns to {output_path}')
print(f'File size: {output_path.stat().st_size / 1024:.2f} KB')


Saved 1546 patterns to ../data/patterns_augmented.json
File size: 1994.20 KB


In [8]:
# VERIFICATION: Pattern counts
print("="*60)
print("VERIFICATION")
print("="*60)

print(f"Filtered patterns: {len(filtered_patterns)}")
print(f"Augmented patterns: {len(augmented_patterns)}")
print(f"Sorted patterns: {len(sorted_patterns)}")

assert 1200 <= len(augmented_patterns) <= 1550, f"Unexpected pattern count: {len(augmented_patterns)}"
assert any(p['pattern_type'] == 'TRIANGLE' for p in sorted_patterns), "Missing TRIANGLE patterns!"
assert any(p['pattern_type'] == 'BRIDGE' for p in sorted_patterns), "Missing BRIDGE patterns!"

print("\n✅ PASS: Pattern refinement verified")

VERIFICATION
Filtered patterns: 1397
Augmented patterns: 1546
Sorted patterns: 1546

✅ PASS: Pattern refinement verified


## Summary

**Next:** Notebook 4 - Execution Engine