# ðŸ§  Anti-Graphiti: Semantic Data Augmentation with Ollama

This notebook implements a **Data Augmentation Pipeline** to address the class imbalance in the Bloom's Taxonomy dataset. 

**Goal:** Generate high-quality, semantically rich questions for minority classes (`Apply`, `Analyze`, `Evaluate`, `Create`) using a local LLM (Ollama).

**Strategy:**
1.  **Analyze Imbalance**: Identify how many samples are needed for each class.
2.  **Expert Prompting**: Use a "Data Engineer" persona with strict **5W1H** constraints.
3.  **Semantic Filtering**: Use `spaCy` to ensure every generated question has a valid **Action Verb** as its ROOT.
4.  **Merge**: Combine valid synthetic data with the original training set.

> **Prerequisite**: Ensure Ollama is running (`ollama serve`).

In [None]:
import os
import json
import time
import random
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import spacy
from collections import Counter

# Configuration
OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL_NAME = "qwen3" # Updated based on your screenshots
DATA_PATH = "../data/raw/bloom_questions.csv"
OUTPUT_PATH = "../data/raw/bloom_questions_augmented.csv"

# Load SpaCy for Semantic Filtering
try:
    nlp = spacy.load("en_core_web_sm")
except:
    print("Downloading SpaCy model...")
    os.system("python -m spacy download en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

print(f"Target Model: {MODEL_NAME}")

## 1. Distribution Analysis

In [None]:
if os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_PATH)
    counts = df['level'].value_counts()
    TARGET_COUNT = counts.max() # Aim to match the majority class
    
    print(f"Majority Class: {counts.idxmax()} ({TARGET_COUNT} samples)")
    print("\nSamples needed to balance:")
    needed = {}
    for label, count in counts.items():
        diff = TARGET_COUNT - count
        if diff > 0:
            needed[label] = diff
            print(f"  - {label}: +{diff}")
else:
    print("Data file not found!")
    needed = {'Apply': 100, 'Analyze': 100} # Dummy defaults

## 2. Ollama Integration (The Expert Persona)

We use a specialized prompt to enforce:
1.  **Linguistic Structure**: Strong Action Verb at the ROOT.
2.  **Diversity**: Strict adherence to **5W1H** (Who, What, Where, When, Why, How).
3.  **Context**: Educational domain.

In [None]:
def generate_batch(level, count=5, model=MODEL_NAME):
    """Generate a batch of questions for a specific Bloom level."""
    
    prompt = f"""
    Role: You are an expert AI Data Engineer specializing in Bloom's Taxonomy.
    Task: Generate {count} unique, high-quality educational questions for the Bloom's level: '{level}'.
    
    Constraints:
    1. **5W1H Requirement**: You MUST vary the start of the questions. Use exactly this mix:
       - 30% starting with 'How' (e.g., 'How would you...')
       - 20% starting with 'Why' (e.g., 'Why does...')
       - 20% starting with 'What' (e.g., 'What is the impact...')
       - 30% Imperative/Direct commands (e.g., 'Design...', 'Critique...', 'Compare...')
       
    2. **Linguistic Structure**: Every question must have a main **Action Verb** that clearly maps to the '{level}' category.
    
    3. **Output Format**: Return ONLY a raw CSV list with no headers, no numbering, no preamble. Format: "question"
    """
    
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False,
        "options": {
            "temperature": 0.85, # High creativity for diversity
            "top_p": 0.9
        }
    }
    
    try:
        response = requests.post(OLLAMA_URL, json=payload)
        if response.status_code == 200:
            res_json = response.json()
            # Extract text and split by newlines
            text = res_json.get('response', '')
            # Basic cleanup
            questions = [q.strip('" \t-1234567890.') for q in text.split('\n') if len(q) > 10]
            return questions
        else:
            print(f"Error: {response.status_code}")
            return []
    except Exception as e:
        print(f"Connection Error: {e}")
        return []

# Test Connection
print("Testing Ollama Connection...")
test_q = generate_batch("Apply", count=1)
print(f"Result: {test_q}")

## 3. Semantic Filtering (SemantiQ Core)

We use **spaCy** to verify that the generated question actually looks like a question or a command. We check if the **ROOT** of the sentence is a verb (or satisfying specific POS tags).

In [None]:
def is_semantically_valid(question):
    """Check if the question has a valid linguistic structure."""
    doc = nlp(question)
    
    # 1. Length check
    if len(doc) < 4:
        return False
        
    # 2. ROOT Check: The root should usually be a VERB (Design, Explain) 
    # or the sentence should contain a Wh-word (Who, What, Where...)
    has_verb_root = False
    has_wh_word = False
    
    for token in doc:
        if token.dep_ == 'ROOT' and token.pos_ in ['VERB', 'NOUN', 'PROPN']:
            # Noun allowed because sometimes titles/definitions are parsed as nouns
            has_verb_root = True
        if token.tag_.startswith('W'): # Wh-determiner, Wh-pronoun, etc.
            has_wh_word = True
            
    return has_verb_root or has_wh_word

def filter_questions(questions):
    valid = []
    for q in questions:
        if is_semantically_valid(q):
            valid.append(q)
    return valid

## 4. Main Generation Loop

We iterate through the classes that need augmentation.

In [None]:
generated_data = []

for label, amount_needed in needed.items():
    print(f"\nGenerating {amount_needed} samples for '{label}'...")
    
    collected = 0
    pbar = tqdm(total=amount_needed)
    
    while collected < amount_needed:
        # Request a batch (ask for slightly more to account for filtering)
        raw_batch = generate_batch(label, count=10)
        if not raw_batch:
            time.sleep(2) # Backoff if error
            continue
            
        # Semantic Filter
        valid_batch = filter_questions(raw_batch)
        
        # Add unique ones
        for q in valid_batch:
            if collected < amount_needed:
                generated_data.append({'question': q, 'level': label})
                collected += 1
                pbar.update(1)
    
    pbar.close()
    
print(f"\nTotal Generated: {len(generated_data)}")

## 5. Export and Merge

In [None]:
if generated_data:
    df_aug = pd.DataFrame(generated_data)
    
    # Save Augmentation Only
    df_aug.to_csv("../data/raw/generated_only.csv", index=False)
    
    # Merge with Original
    df_final = pd.concat([df, df_aug]).sample(frac=1, random_state=42).reset_index(drop=True)
    
    df_final.to_csv(OUTPUT_PATH, index=False)
    print(f"\nSaved Balanced Dataset to: {OUTPUT_PATH}")
    print(f"Final Size: {len(df_final)}")
    print(df_final['level'].value_counts())
else:
    print("No data generated.")