# Phoenix Context Collapse Parser (CCP) v2 — Fine-Tuning Notebook

This notebook fine-tunes a small language model to become the **Context Collapse Parser** — a domain-specific intent parser that translates ambiguous GTM prompts into structured **GTM Intent IR** (JSON).

## v2 Changes: Chain-of-Thought Training

**Why Chain-of-Thought?** Pure JSON output training risks teaching the model formatting over semantics. v2 forces the model to **reason first, then structure**:

```
<reasoning>
- 'my accounts' indicates personally assigned accounts
- 'best' is ambiguous - could mean highest ARR, best fit, or most engaged
- Assumed sales rep role from possessive language
</reasoning>

<intent_ir>
{"intent_type": "account_discovery", ...}
</intent_ir>
```

**What CCP does:**
- Infers implied GTM context (role, motion, ICP, geography, time horizon)
- Outputs explicit reasoning chain (for interpretability)
- Outputs structured JSON with confidence scores
- Enables downstream LLMs to execute reliably via Phoenix MCP tools

**Environment:**
- macOS M-series with Metal/MPS
- Python 3.11
- QLoRA fine-tuning (4-bit quantization)

**See:** `PRD.md` for full architecture and `gtm_domain_knowledge.md` for training context.

## Cell 1 — Environment sanity check


In [None]:
import torch
import platform

print("Python:", platform.python_version())
print("Torch:", torch.__version__)
print("MPS available:", torch.backends.mps.is_available())
print("MPS built:", torch.backends.mps.is_built())


## Cell 2 — CCP Configuration

In [None]:
# ===== CCP CONFIG =====

# Base model — recommend ~3B param model for fast inference
# Options: "microsoft/phi-2", "mistralai/Mistral-7B-Instruct-v0.2", "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
BASE_MODEL_PATH = "./models/phi-2"

# Training dataset (JSONL with gtm_prompt -> reasoning -> intent_ir triplets)
# Uses Chain-of-Thought format to ensure model learns GTM semantics, not just JSON formatting
DATASET_PATH = "./data/ccp_training_with_reasoning.jsonl"

# Output directory for CCP LoRA adapter
OUTPUT_DIR = "./ccp-adapter"

# ===== GTM INTENT IR SCHEMA VERSION =====
IR_SCHEMA_VERSION = "2.0.0"  # v2 adds Chain-of-Thought reasoning

# ===== TRAINING PARAMS =====
NUM_EPOCHS = 3
LEARNING_RATE = 2e-4
BATCH_SIZE = 1
GRAD_ACCUM_STEPS = 8
MAX_SEQ_LENGTH = 1536  # Increased for reasoning chains

## Cell 3 — Imports


In [None]:
import json
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model
import transformers

## Cell 4 — Load tokenizer


In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    BASE_MODEL_PATH,
    local_files_only=True,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


## Cell 5 — Load model (QLoRA, Metal-safe)

⚠️ Important:
- We **do NOT use fp16** on macOS
- 4-bit quantization still works via bitsandbytes


In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float32,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    quantization_config=bnb_config,
    device_map={"": "mps"},
    local_files_only=True,
)

model.config.use_cache = False
model.gradient_checkpointing_enable()


## Cell 6 — LoRA configuration


In [None]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


## Cell 7 — Load dataset


In [None]:
dataset = load_dataset(
    "json",
    data_files=DATASET_PATH,
    split="train"
)


## Cell 8 — GTM Intent IR Schema

Define the structured output schema that CCP must produce.

In [None]:
# GTM Intent IR Schema (v2 - with Chain-of-Thought)
GTM_INTENT_IR_SCHEMA = {
    "intent_type": [
        "account_discovery", "pipeline_analysis", "expansion_identification",
        "churn_risk_assessment", "lead_prioritization", "territory_planning",
        "forecast_review", "competitive_analysis", "engagement_summary"
    ],
    "motion": ["outbound", "inbound", "expansion", "renewal", "churn_prevention"],
    "role_assumption": ["sales_rep", "sales_manager", "revops", "marketing", "cs", "exec"],
    "account_scope": ["net_new", "existing", "churned", "all"],
    "time_horizon": ["immediate", "this_week", "this_month", "this_quarter", "this_year", "custom"],
    "output_format": ["list", "summary", "detailed", "export", "visualization"],
}

# System prompt for CCP v2 - Chain-of-Thought + JSON
CCP_SYSTEM_PROMPT = """You are the Phoenix Context Collapse Parser (CCP). Your job is to transform ambiguous GTM (Go-To-Market) prompts into structured GTM Intent IR.

IMPORTANT: You must REASON about the request before producing structured output. This ensures you understand the GTM semantics, not just the JSON format.

Given a user's GTM request:
1. First, analyze the prompt in a <reasoning> section:
   - Identify explicit signals (keywords, jargon, metrics mentioned)
   - Infer implicit context (role, motion, time horizon)
   - Note any ambiguities that affect confidence
   - Explain WHY you chose each field value

2. Then, output the structured intent in an <intent_ir> section as JSON with:
   - intent_type: The primary intent category
   - motion: The GTM motion (outbound, expansion, renewal, etc.)
   - role_assumption: Inferred user role
   - account_scope: Which accounts (net_new, existing, all)
   - icp_selector: Which ICP to apply (default, or specific product/segment)
   - icp_resolution_required: true if ICP needs downstream resolution
   - geography_scope: Geographic filter if mentioned (null if global)
   - time_horizon: Time scope for the request
   - output_format: How results should be presented
   - confidence_scores: 0.0-1.0 confidence for each inferred field
   - assumptions_applied: List of assumptions made
   - clarification_needed: true if request is too ambiguous

Format your response EXACTLY as:
<reasoning>
[Your analysis here]
</reasoning>

<intent_ir>
[Valid JSON here]
</intent_ir>"""


def format_ccp_example(example):
    """Format training example for CCP v2: GTM prompt -> Reasoning -> Intent IR JSON"""
    prompt = example["gtm_prompt"].strip()
    
    # Get reasoning chain (required in v2)
    reasoning = example.get("reasoning", "").strip()
    if not reasoning:
        # Fallback for legacy data without reasoning
        reasoning = "- Analyzing prompt for GTM signals\n- Inferring context from keywords"
    
    # Build the IR output
    if "intent_ir" in example:
        ir_json = example["intent_ir"]
        if isinstance(ir_json, str):
            ir_output = ir_json
        else:
            ir_output = json.dumps(ir_json, indent=2)
    else:
        # Legacy format compatibility
        ir_output = example.get("response", "{}").strip()
    
    # Format as instruction -> reasoning -> JSON output (Chain-of-Thought)
    text = f"""<s>[INST] {CCP_SYSTEM_PROMPT}

User request: {prompt} [/INST]
<reasoning>
{reasoning}
</reasoning>

<intent_ir>
{ir_output}
</intent_ir></s>"""
    
    return {"text": text}

dataset = dataset.map(format_ccp_example)

## Cell 9 — Tokenization


In [None]:
def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
        padding=False,
    )

tokenized_dataset = dataset.map(
    tokenize,
    remove_columns=dataset.column_names
)

## Cell 10 — Data collator


In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


## Cell 11 — Training arguments


In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    num_train_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    report_to="none",
    optim="adamw_torch",
    bf16=False,
    fp16=False,
)


## Cell 12 — Trainer


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)


## Cell 13 — Train CCP

Training a ~3B model with QLoRA on GTM intent parsing.
Expect 1-3 hours depending on dataset size.

In [None]:
trainer.train()


## Cell 14 — Save adapter

This saves **LoRA only**.


In [None]:
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)


## Cell 15 — CCP Inference & Validation

Test the trained CCP model with GTM prompts and validate JSON output.

In [None]:
import re

def parse_gtm_intent(user_prompt: str, max_new_tokens: int = 800) -> dict:
    """
    Run CCP v2 inference: GTM prompt -> Reasoning + Intent IR JSON
    
    Returns a dict with:
    - reasoning: The model's reasoning chain (for interpretability)
    - intent_ir: The structured intent (for downstream use)
    - _valid: Whether the JSON parsed successfully
    - _raw: Raw model output for debugging
    """
    input_text = f"<s>[INST] {CCP_SYSTEM_PROMPT}\n\nUser request: {user_prompt} [/INST]\n"
    inputs = tokenizer(input_text, return_tensors="pt").to("mps")
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,  # Low temp for structured output
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Extract content after [/INST]
    if "[/INST]" in generated:
        response = generated.split("[/INST]")[-1].strip()
    else:
        response = generated
    
    # Remove trailing </s> if present
    response = response.replace("</s>", "").strip()
    
    # Extract reasoning section
    reasoning_match = re.search(r'<reasoning>\s*(.*?)\s*</reasoning>', response, re.DOTALL)
    reasoning = reasoning_match.group(1).strip() if reasoning_match else ""
    
    # Extract intent_ir section
    ir_match = re.search(r'<intent_ir>\s*(.*?)\s*</intent_ir>', response, re.DOTALL)
    json_str = ir_match.group(1).strip() if ir_match else response
    
    # Validate and parse JSON
    try:
        intent_ir = json.loads(json_str)
        return {
            "reasoning": reasoning,
            "intent_ir": intent_ir,
            "_valid": True,
            "_raw": response
        }
    except json.JSONDecodeError as e:
        return {
            "reasoning": reasoning,
            "intent_ir": {},
            "_valid": False,
            "_error": str(e),
            "_raw": response
        }


def validate_intent_ir(result: dict) -> list[str]:
    """Validate IR against schema, return list of issues"""
    issues = []
    
    if not result.get("_valid", False):
        issues.append(f"Invalid JSON: {result.get('_error', 'unknown')}")
        return issues
    
    ir = result.get("intent_ir", {})
    
    # Check required fields
    required = ["intent_type", "motion", "role_assumption", "account_scope"]
    for field in required:
        if field not in ir:
            issues.append(f"Missing required field: {field}")
    
    # Validate enum values
    for field, valid_values in GTM_INTENT_IR_SCHEMA.items():
        if field in ir and ir[field] not in valid_values:
            issues.append(f"Invalid {field}: {ir[field]} (valid: {valid_values})")
    
    # Check confidence scores
    if "confidence_scores" in ir:
        for field, score in ir["confidence_scores"].items():
            if not isinstance(score, (int, float)) or not 0 <= score <= 1:
                issues.append(f"Invalid confidence score for {field}: {score}")
    
    # Check reasoning quality (new in v2)
    reasoning = result.get("reasoning", "")
    if not reasoning:
        issues.append("Missing reasoning chain - model should explain its inferences")
    elif len(reasoning) < 50:
        issues.append("Reasoning chain too short - may indicate superficial analysis")
    
    return issues


# Standard GTM test prompts
TEST_PROMPTS = [
    "Show me my best accounts",
    "Which deals are at risk this quarter?",
    "Find me companies like Acme Corp",
    "I need to hit my number, what should I focus on?",
    "Give me expansion opportunities in EMEA",
]

# Adversarial test prompts (should trigger high uncertainty or clarification)
ADVERSARIAL_PROMPTS = [
    "accounts",  # Too minimal - should need clarification
    "Show me the zorbax metrics",  # Made-up jargon - should show uncertainty  
    "What's the weather like?",  # Non-GTM - should flag as wrong domain
    "Make me a sandwich",  # Completely unrelated - should fail gracefully
]

print("=" * 60)
print("CCP v2 INFERENCE TEST - Chain-of-Thought")
print("=" * 60)

print("\n### STANDARD GTM PROMPTS ###\n")
for prompt in TEST_PROMPTS:
    print(f"Prompt: {prompt}")
    print("-" * 40)
    
    result = parse_gtm_intent(prompt)
    issues = validate_intent_ir(result)
    
    # Show reasoning first (key for v2)
    print("REASONING:")
    print(result.get("reasoning", "(no reasoning)"))
    print()
    
    if result.get("_valid"):
        # Pretty print the IR
        print("INTENT IR:")
        print(json.dumps(result["intent_ir"], indent=2))
    else:
        print(f"[INVALID JSON] {result.get('_raw', '')[:200]}")
    
    if issues:
        print(f"\nValidation issues: {issues}")
    print("=" * 60)

print("\n### ADVERSARIAL PROMPTS (Should Show Uncertainty) ###\n")
for prompt in ADVERSARIAL_PROMPTS:
    print(f"Prompt: {prompt}")
    print("-" * 40)
    
    result = parse_gtm_intent(prompt)
    
    print("REASONING:")
    print(result.get("reasoning", "(no reasoning)"))
    print()
    
    if result.get("_valid"):
        ir = result["intent_ir"]
        # Check if model appropriately flagged uncertainty
        clarification = ir.get("clarification_needed", False)
        low_confidence = any(
            score < 0.5 
            for score in ir.get("confidence_scores", {}).values()
        )
        
        if clarification or low_confidence:
            print("[GOOD] Model showed appropriate uncertainty")
        else:
            print("[WARNING] Model may be over-confident on ambiguous input")
        
        print("INTENT IR:")
        print(json.dumps(ir, indent=2))
    else:
        print(f"[INVALID] {result.get('_raw', '')[:200]}")
    print("=" * 60)

## Cell 16 — Save CCP Adapter

Saves the LoRA adapter. Include schema version in metadata.

In [None]:
import os

# Save adapter
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Save CCP v2 metadata (includes Chain-of-Thought info)
ccp_metadata = {
    "schema_version": IR_SCHEMA_VERSION,
    "base_model": BASE_MODEL_PATH,
    "training_approach": "chain_of_thought",
    "output_format": {
        "reasoning": "<reasoning>...</reasoning>",
        "intent_ir": "<intent_ir>{JSON}</intent_ir>"
    },
    "intent_types": GTM_INTENT_IR_SCHEMA["intent_type"],
    "motions": GTM_INTENT_IR_SCHEMA["motion"],
    "role_assumptions": GTM_INTENT_IR_SCHEMA["role_assumption"],
    "adversarial_test_cases": ADVERSARIAL_PROMPTS,
}

with open(os.path.join(OUTPUT_DIR, "ccp_metadata.json"), "w") as f:
    json.dump(ccp_metadata, f, indent=2)

print(f"CCP v2 adapter saved to {OUTPUT_DIR}")
print(f"Schema version: {IR_SCHEMA_VERSION}")
print(f"Training approach: Chain-of-Thought + JSON")