# Milestone 3 - Notebook 1: Concept Abstraction

## Objective

Create semantic clusters to generalize from specific words to concepts:
- **Manual seed clusters** for VERB, NOUN, and PREP concepts
- **Auto-expand** VERB and NOUN clusters using word embeddings (similarity > 0.75)
- **Manual prepositions only** (no expansion to avoid noise)

## Output

`../data/concept_clusters.json`

In [1]:
# Imports
import json
from collections import Counter, defaultdict

import numpy as np
import spacy
from pathlib import Path
from tqdm.auto import tqdm
from sklearn.metrics.pairwise import cosine_similarity

print("Libraries loaded successfully!")


Libraries loaded successfully!


In [2]:
# Load spaCy model with word vectors
print("Loading spaCy en_core_web_lg model...")
nlp = spacy.load("en_core_web_lg")
print(f"Model loaded: {nlp.meta['name']}")
print(f"Vector dimensions: {nlp.vocab.vectors_length}")

Loading spaCy en_core_web_lg model...
Model loaded: core_web_lg
Vector dimensions: 300


## 1. Define Manual Seed Clusters

### VERBS (will be auto-expanded)
- CAUSE_VERB: Causation and triggering
- CREATE_VERB: Creation and production
- TOPIC_VERB: Discussion and communication
- MOVEMENT_VERB: Motion and travel

### NOUNS (will be auto-expanded)
- GROUP_NOUN: Collections and aggregates

### PREPOSITIONS (manual only - NOT expanded)
- CONTAINER_PREP: Containment
- PART_PREP: Part-whole relationships
- DESTINATION_PREP: Direction and destination

In [3]:
# Manual seed clusters
SEED_CLUSTERS = {
    # VERBS - Will be auto-expanded
    "CAUSE_VERB": ["cause", "trigger", "stem", "result", "lead", "induce", "spark", "provoke"],
    "CREATE_VERB": ["generate", "produce", "manufacture", "build", "assemble", "synthesize", "yield"],
    "TOPIC_VERB": ["discuss", "mention", "about", "cover", "describe", "regarding", "concerning"],
    "MOVEMENT_VERB": ["go", "move", "travel", "ship", "arrive", "depart"],
    
    # NOUNS - Will be auto-expanded
    "GROUP_NOUN": ["group", "set", "collection", "bunch", "fleet", "team", "array", "series"],
    
    # PREPOSITIONS - Manual only (DO NOT EXPAND)
    "CONTAINER_PREP": ["in", "inside", "within", "into"],
    "PART_PREP": ["of", "comprising", "consisting", "composed"],
    "DESTINATION_PREP": ["to", "towards", "into", "onto"]
}

print("Seed clusters defined:")
for concept, seeds in SEED_CLUSTERS.items():
    print(f"  {concept}: {len(seeds)} seeds")

Seed clusters defined:
  CAUSE_VERB: 8 seeds
  CREATE_VERB: 7 seeds
  TOPIC_VERB: 7 seeds
  MOVEMENT_VERB: 6 seeds
  GROUP_NOUN: 8 seeds
  CONTAINER_PREP: 4 seeds
  PART_PREP: 4 seeds
  DESTINATION_PREP: 4 seeds


## 2. Auto-Expansion Logic

Expand VERB and NOUN clusters using word embeddings:
- Compute average vector for seed words
- Find similar words in vocabulary (cosine similarity > 0.75)
- Limit to top 50 similar words per cluster
- **Skip PREP clusters** (manual only)

In [4]:
def expand_concept_clusters(
    nlp,
    seed_clusters,
    similarity_threshold=0.75,
    top_n=50,
    vocabulary=None,
    excluded_map=None,
    concept_thresholds=None,
    concept_pos_map=None,
    lemma_pos_map=None,
    exclude_stopwords=True,
):
    """
    Expand VERB and NOUN clusters using word embeddings.
    Skip PREP clusters (manual only).

    Args:
        nlp: spaCy model with word vectors
        seed_clusters: Dict of {concept_name: seed_words}
        similarity_threshold: Default minimum cosine similarity (fallback)
        top_n: Maximum expanded words per cluster
        vocabulary: Optional iterable of strings to restrict search space
        excluded_map: Optional dict of {concept_name: [words_to_exclude]}
        concept_thresholds: Optional dict overriding threshold per concept
        concept_pos_map: Optional dict of {concept_name: {allowed_pos_tags}}
        lemma_pos_map: Optional dict of {lemma: pos_tag} (e.g., derived from training)
        exclude_stopwords: If True, filters out spaCy stopwords

    Returns:
        expanded_clusters: Dict with seeds + expanded words
        lemma_to_concept: Reverse mapping for fast lookup
    """
    expanded_clusters = {}
    lemma_to_concept = {}

    for concept, seeds in seed_clusters.items():
        print(f"\nProcessing {concept}...")

        threshold = similarity_threshold
        if concept_thresholds and concept in concept_thresholds:
            threshold = concept_thresholds[concept]

        seeds_lower = {s.lower() for s in seeds}
        excluded_set = set()
        if excluded_map and concept in excluded_map:
            excluded_set = {w.lower() for w in excluded_map[concept]}

        allowed_pos = None
        if concept_pos_map and concept in concept_pos_map:
            allowed_pos = set(concept_pos_map[concept])

        # Skip preposition clusters (manual only)
        if "PREP" in concept:
            print("  PREP cluster - manual only (no expansion)")
            expanded_clusters[concept] = {
                "seeds": seeds,
                "expanded": [],
                "threshold": threshold,
            }
            for word in seeds:
                lemma_to_concept[word.lower()] = concept
            continue

        # Get vectors for seed words (fast: use vocab, avoid full pipeline)
        seed_vectors = []
        valid_seeds = []
        for word in seeds:
            lex = nlp.vocab[word.lower()]
            if lex.has_vector:
                seed_vectors.append(lex.vector)
                valid_seeds.append(word)
            else:
                print(f"  Warning: '{word}' has no vector")

        if not seed_vectors:
            print(f"  No valid seed vectors for {concept}")
            expanded_clusters[concept] = {
                "seeds": seeds,
                "expanded": [],
                "threshold": threshold,
            }
            continue

        avg_vector = np.mean(seed_vectors, axis=0).reshape(1, -1)

        candidates = []

        if vocabulary is not None:
            iterator = tqdm(sorted(vocabulary), desc=f"  {concept}", leave=False)
            print("  Searching restricted training vocabulary for similar words...")
            for word_str in iterator:
                w = str(word_str).lower()

                # Basic filters
                if not w.isalpha():
                    continue
                if w in seeds_lower:
                    continue
                if excluded_set and w in excluded_set:
                    continue

                lex = nlp.vocab[w]

                # Safety filters
                if exclude_stopwords and lex.is_stop:
                    continue

                # POS filter (domain-adapted via training POS map)
                if allowed_pos is not None and lemma_pos_map is not None:
                    w_pos = lemma_pos_map.get(w)
                    if w_pos not in allowed_pos:
                        continue

                if not lex.has_vector:
                    continue

                similarity = cosine_similarity(avg_vector, lex.vector.reshape(1, -1))[0][0]
                if similarity >= threshold:
                    candidates.append((w, similarity))
        else:
            print("  Searching full spaCy vocabulary for similar words...")
            for lex in tqdm(nlp.vocab, desc=f"  {concept}", leave=False):
                w = lex.text.lower()

                if not lex.has_vector:
                    continue
                if not lex.is_alpha:
                    continue
                if w in seeds_lower:
                    continue
                if excluded_set and w in excluded_set:
                    continue
                if exclude_stopwords and lex.is_stop:
                    continue

                similarity = cosine_similarity(avg_vector, lex.vector.reshape(1, -1))[0][0]
                if similarity >= threshold:
                    candidates.append((w, similarity))

        candidates.sort(key=lambda x: x[1], reverse=True)
        expanded_words = [word for word, _sim in candidates[:top_n]]

        print(f"  Found {len(expanded_words)} similar words")
        print(f"  Top 10: {expanded_words[:10]}")

        expanded_clusters[concept] = {
            "seeds": seeds,
            "expanded": expanded_words,
            "threshold": threshold,
        }

        for word in (seeds + expanded_words):
            lemma_to_concept[word.lower()] = concept

    print("\nExpansion complete!")
    print(f"Total concepts: {len(expanded_clusters)}")
    print(f"Total unique words: {len(lemma_to_concept)}")

    return expanded_clusters, lemma_to_concept

In [5]:
# --- Domain adaptation: restrict expansion to training-data vocab + blacklist antonyms ---
# We also derive a lemma→POS map from the training data so we can filter expansions
# by concept type (e.g., keep VERB clusters verb-only).

def _find_train_json_path() -> Path:
    for p in [Path.cwd(), *Path.cwd().parents]:
        candidate = p / "data/processed/train/train.json"
        if candidate.exists():
            return candidate
    raise FileNotFoundError("Could not find data/processed/train/train.json from current working directory")

train_json_path = _find_train_json_path()

with train_json_path.open("r", encoding="utf8") as f:
    train_data = json.load(f)

training_vocab = set()
lemma_pos_counts = defaultdict(Counter)

for item in train_data:
    for tok in item.get("tokens", []):
        lemma = tok.get("lemma")
        pos = tok.get("pos")
        if not lemma or not pos:
            continue
        lemma = lemma.lower()
        training_vocab.add(lemma)
        lemma_pos_counts[lemma][pos] += 1

lemma_pos_map = {lemma: counts.most_common(1)[0][0] for lemma, counts in lemma_pos_counts.items()}

print(f"Restricted search space to {len(training_vocab)} unique lemmas from training data.")

# Safety net: exclude known antonyms / opposites that can be close in vector space
EXCLUDED_WORDS = {
    "CREATE_VERB": ["destroy", "demolish", "remove", "delete", "break"],
    "CAUSE_VERB": ["prevent", "stop", "block"],
    "MOVEMENT_VERB": ["stay", "remain", "stop"],
}


Restricted search space to 15387 unique lemmas from training data.


In [6]:
# Run expansion
print("="*80)
print("EXPANDING CONCEPT CLUSTERS")
print("="*80)

# POS constraints per concept (keeps expansions syntactically consistent)
CONCEPT_POS_MAP = {
    "CAUSE_VERB": {"VERB"},
    "CREATE_VERB": {"VERB"},
    "TOPIC_VERB": {"VERB"},
    "MOVEMENT_VERB": {"VERB"},
    "GROUP_NOUN": {"NOUN"},
}

# Optional per-concept thresholds (use global as fallback)
CONCEPT_THRESHOLDS = {
    # Topic words drift easily; keep this one stricter
    "TOPIC_VERB": 0.70,
}

expanded_clusters, lemma_to_concept = expand_concept_clusters(
    nlp=nlp,
    seed_clusters=SEED_CLUSTERS,
    similarity_threshold=0.55,
    top_n=50,
    vocabulary=training_vocab,
    excluded_map=EXCLUDED_WORDS,
    concept_thresholds=CONCEPT_THRESHOLDS,
    concept_pos_map=CONCEPT_POS_MAP,
    lemma_pos_map=lemma_pos_map,
    exclude_stopwords=True,
)


EXPANDING CONCEPT CLUSTERS

Processing CAUSE_VERB...


  CAUSE_VERB:   0%|          | 0/15387 [00:00<?, ?it/s]

  Searching restricted training vocabulary for similar words...
  Found 33 similar words
  Top 10: ['affect', 'stimulate', 'occur', 'suppress', 'arise', 'decrease', 'avoid', 'reduce', 'inhibit', 'concern']

Processing CREATE_VERB...


  CREATE_VERB:   0%|          | 0/15387 [00:00<?, ?it/s]

  Searching restricted training vocabulary for similar words...
  Found 35 similar words
  Top 10: ['create', 'develop', 'construct', 'utilize', 'achieve', 'fabricate', 'maintain', 'demonstrate', 'incorporate', 'derive']

Processing TOPIC_VERB...


  TOPIC_VERB:   0%|          | 0/15387 [00:00<?, ?it/s]

  Searching restricted training vocabulary for similar words...
  Found 4 similar words
  Top 10: ['explain', 'consider', 'regard', 'understand']

Processing MOVEMENT_VERB...


  MOVEMENT_VERB:   0%|          | 0/15387 [00:00<?, ?it/s]

  Searching restricted training vocabulary for similar words...
  Found 27 similar words
  Top 10: ['leave', 'come', 'begin', 'bring', 'decide', 'wait', 'journey', 'proceed', 'place', 'continue']

Processing GROUP_NOUN...


  GROUP_NOUN:   0%|          | 0/15387 [00:00<?, ?it/s]

  Searching restricted training vocabulary for similar words...
  Found 2 similar words
  Top 10: ['number', 'variety']

Processing CONTAINER_PREP...
  PREP cluster - manual only (no expansion)

Processing PART_PREP...
  PREP cluster - manual only (no expansion)

Processing DESTINATION_PREP...
  PREP cluster - manual only (no expansion)

Expansion complete!
Total concepts: 8
Total unique words: 141


## 3. Inspect Results

Validate expansion quality by inspecting expanded clusters.

In [7]:
# Display expanded clusters
print("="*80)
print("EXPANDED CONCEPT CLUSTERS")
print("="*80)

for concept, data in expanded_clusters.items():
    print(f"\n{concept}:")
    print(f"  Seeds ({len(data['seeds'])}): {data['seeds']}")
    print(f"  Expanded ({len(data['expanded'])}): {data['expanded'][:20]}...")
    print(f"  Total words: {len(data['seeds']) + len(data['expanded'])}")

EXPANDED CONCEPT CLUSTERS

CAUSE_VERB:
  Seeds (8): ['cause', 'trigger', 'stem', 'result', 'lead', 'induce', 'spark', 'provoke']
  Expanded (33): ['affect', 'stimulate', 'occur', 'suppress', 'arise', 'decrease', 'avoid', 'reduce', 'inhibit', 'concern', 'turn', 'eliminate', 'overcome', 'happen', 'increase', 'initiate', 'contribute', 'lack', 'produce', 'diminish']...
  Total words: 41

CREATE_VERB:
  Seeds (7): ['generate', 'produce', 'manufacture', 'build', 'assemble', 'synthesize', 'yield']
  Expanded (35): ['create', 'develop', 'construct', 'utilize', 'achieve', 'fabricate', 'maintain', 'demonstrate', 'incorporate', 'derive', 'sustain', 'provide', 'analyze', 'contribute', 'replicate', 'obtain', 'employ', 'optimize', 'deliver', 'evaluate']...
  Total words: 42

TOPIC_VERB:
  Seeds (7): ['discuss', 'mention', 'about', 'cover', 'describe', 'regarding', 'concerning']
  Expanded (4): ['explain', 'consider', 'regard', 'understand']...
  Total words: 11

MOVEMENT_VERB:
  Seeds (6): ['go', 'm

In [8]:
# Display reverse mapping statistics
print(f"\nReverse mapping (lemma → concept):")
print(f"  Total unique lemmas: {len(lemma_to_concept)}")
print(f"\nSample mappings:")
for lemma, concept in list(lemma_to_concept.items())[:20]:
    print(f"  '{lemma}' → {concept}")


Reverse mapping (lemma → concept):
  Total unique lemmas: 141

Sample mappings:
  'cause' → CAUSE_VERB
  'trigger' → CAUSE_VERB
  'stem' → CAUSE_VERB
  'result' → CAUSE_VERB
  'lead' → CAUSE_VERB
  'induce' → CAUSE_VERB
  'spark' → CAUSE_VERB
  'provoke' → CAUSE_VERB
  'affect' → CAUSE_VERB
  'stimulate' → CAUSE_VERB
  'occur' → CAUSE_VERB
  'suppress' → CAUSE_VERB
  'arise' → CAUSE_VERB
  'decrease' → CAUSE_VERB
  'avoid' → CAUSE_VERB
  'reduce' → CAUSE_VERB
  'inhibit' → CAUSE_VERB
  'concern' → CAUSE_VERB
  'turn' → MOVEMENT_VERB
  'eliminate' → CAUSE_VERB


## 4. Save Results

Save expanded clusters and reverse mapping to JSON.

In [9]:
# Prepare output
output_data = {
    "expanded_clusters": expanded_clusters,
    "lemma_to_concept": lemma_to_concept,
    "metadata": {
        "similarity_threshold": 0.75,
        "top_n": 50,
        "total_concepts": len(expanded_clusters),
        "total_unique_words": len(lemma_to_concept),
        "spacy_model": nlp.meta['name']
    }
}

# Save to JSON
output_path = Path("../data/concept_clusters.json")
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"\nSaved concept clusters to: {output_path}")
print(f"File size: {output_path.stat().st_size / 1024:.2f} KB")


Saved concept clusters to: ../data/concept_clusters.json
File size: 7.80 KB


## Summary

Concept abstraction complete! 

**Key Results:**
- Manual seed clusters defined for VERB, NOUN, and PREP concepts
- VERB and NOUN clusters auto-expanded using word embeddings (similarity > 0.75)
- PREP clusters kept manual (no expansion)
- Reverse mapping created for fast lemma → concept lookup

**Next Step:** Notebook 2 - Unified Pattern Mining