# Milestone 3 - Notebook 3: Pattern Refinement & Augmentation

## Objective

- Apply tiered filtering based on pattern complexity
- Generate passive voice variants from high-precision active patterns
- Sort patterns by (length desc, precision desc)

## Output

`../data/patterns_augmented.json`

In [1]:
import json
import sys
from pathlib import Path
import importlib


def _find_milestone3_src(start: Path) -> Path:
    """Find the correct milestone_3/src directory regardless of Jupyter CWD."""
    start = start.resolve()

    # Common cases:
    # - CWD = <repo>/milestone_3/notebooks        -> parent/src
    # - CWD = <repo>/milestone_3                 -> src
    # - CWD = <repo> (server started from repo)  -> milestone_3/src
    candidates = []
    candidates.append(start / "milestone_3" / "src")
    candidates.append(start / "src" if start.name == "milestone_3" else None)
    candidates.append(start.parent / "src" if start.name == "notebooks" else None)

sys.path.insert(0, str(Path.cwd().parent / 'src'))

from pattern_augmentation import (
    filter_patterns_tiered,
    generate_passive_variants,
    sort_patterns
)

print('Imports successful!')

Imports successful!


## 1. Load Raw Patterns

In [2]:
# Load from Notebook 2
with open('../data/raw_patterns.json', 'r') as f:
    raw_data = json.load(f)

with open('../data/concept_clusters.json', 'r') as f:
    concept_data = json.load(f)

expanded_clusters = concept_data['expanded_clusters']

# Convert pattern_counts back from strings
pattern_counts_str = raw_data['pattern_counts']
pattern_counts = {eval(k): v for k, v in pattern_counts_str.items()}

print(f"Loaded {len(pattern_counts)} raw patterns")

Loaded 1786 raw patterns


## 2. Tiered Filtering

In [3]:
print('='*80)
print('TIERED FILTERING')
print('='*80)

filtered_patterns = filter_patterns_tiered(pattern_counts, expanded_clusters)

TIERED FILTERING
Filtering 1786 candidate patterns...
Filtered to 1723 patterns
  Complex (len > 3): 570
  Simple (len <= 3): 1153
  'Other' patterns: 308

Pattern precision distribution:
  0.25-0.30:    1 (  0.1%)
  0.30-0.40:   27 (  1.6%)
  0.40-0.50:   19 (  1.1%)
  0.50-0.60:   82 (  4.8%)
  0.60-0.70:   70 (  4.1%)
  0.70-0.80:   37 (  2.1%)
  0.80-0.90:   28 (  1.6%)
  0.90-1.00:   15 (  0.9%)

Pattern type distribution:
  BRIDGE              :   51 (  3.0%)
  DIRECT              :    4 (  0.2%)
  DIRECT_2HOP         :   18 (  1.0%)
  DIRECT_SIBLING      :    7 (  0.4%)
  FALLBACK            :   11 (  0.6%)
  LINEAR              :  614 ( 35.6%)
  TRIANGLE            : 1018 ( 59.1%)


## 3. Generate Passive Variants

In [4]:
augmented_patterns = generate_passive_variants(
    filtered_patterns,
    min_precision_for_flip=0.75
)

\nGenerating passive voice variants...
Generated 155 passive variants
Total patterns: 1878


## 4. Sort Patterns

In [5]:
sorted_patterns = sort_patterns(augmented_patterns)

\nSorted 1878 patterns by (length desc, precision desc)


## 5. Save Augmented Patterns

In [6]:
# Save patterns
output_path = Path('../data/patterns_augmented.json')

with open(output_path, 'w') as f:
    json.dump(sorted_patterns, f, indent=2)

print(f'\nSaved {len(sorted_patterns)} patterns to {output_path}')
print(f'File size: {output_path.stat().st_size / 1024:.2f} KB')


Saved 1878 patterns to ../data/patterns_augmented.json
File size: 2850.60 KB


## Summary

**Next:** Notebook 4 - Execution Engine