# AAIPL Fine-Tuning Pipeline ‚Äî Qwen3-4B

**Target: Win the league on MI300X (192GB). ~30 min total.**

| Phase | Time |
|-------|------|
| 0. Copy model | 2 min |
| 1. Generate 400 MCQs (100/topic) | 12 min |
| 2. Fine-tune A-Agent | 5 min |
| 3. Fine-tune Q-Agent | 5 min |
| 4. Test + Push | 5 min |

**Strategy:**
- **Adaptive verification** ‚Äî 2-way verify for Seating/Family/Series (works well); skip for Syllogisms (model can't self-solve)
- **Answer hint rotation** ‚Äî balanced A/B/C/D
- **Simple Syllogisms prompt** ‚Äî high JSON success rate
- **Robust JSON extraction** ‚Äî multi-strategy parsing + auto-fix

**Constraints:** Q-Agent <13s, A-Agent <9s, ‚â•50% filter pass rate.

---
## Phase 0: Setup ‚Äî Copy the Base Model

In [1]:
# Find the Qwen3-4B snapshot hash
!ls /root/.cache/huggingface/models--Qwen--Qwen3-4B/snapshots/

1cfa9a7208912126459214e8b04321603b3df60c


In [2]:
# Copy the base model to hf_models/ (dereference symlinks with -L)
!mkdir -p ./hf_models/Qwen3-4B
!cp -rL /root/.cache/huggingface/models--Qwen--Qwen3-4B/snapshots/*/. ./hf_models/Qwen3-4B/
!ls ./hf_models/Qwen3-4B/

LICENSE				  model-00002-of-00003.safetensors
README.md			  model-00003-of-00003.safetensors
config.json			  model.safetensors.index.json
generation_config.json		  tokenizer.json
merges.txt			  tokenizer_config.json
model-00001-of-00003.safetensors  vocab.json


In [3]:
# Quick sanity check ‚Äî load and verify the base model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./hf_models/Qwen3-4B",
    max_seq_length=1024,
    dtype=None,
    load_in_4bit=False,
    device_map="auto",
)
print(f"Model: {model.config._name_or_path}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B")
del model, tokenizer
import gc, torch
gc.collect()
torch.cuda.empty_cache()
print("Base model verified and unloaded.")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
INFO 02-15 08:35:46 [__init__.py:225] Automatically detected platform rocm.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc3.dev39+gf417746ad.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0a0+git1c57644. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled 

[2026-02-15 08:35:50] INFO modeling.py:987: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model: ./hf_models/Qwen3-4B
Parameters: 4.0B
Base model verified and unloaded.


---
## Phase 1: Generate Synthetic Training Data

Qwen3-4B as teacher ‚Üí **400 MCQs** (100/topic).

- `enable_thinking=False` to prevent `<think>` tags
- **Adaptive verification**: 2-way verify for Seating/Family/Series, skip for Syllogisms
- Answer hint rotation for balanced A/B/C/D distribution
- Robust JSON extraction (markdown blocks, brace matching, auto-fix)

In [4]:
# ========== LOAD QWEN3-4B AS TEACHER + ADAPTIVE VERIFICATION ==========
import json, time, random, re, gc, torch
from pathlib import Path
from collections import Counter
from unsloth import FastLanguageModel

TEACHER_PATH = "./hf_models/Qwen3-4B"

teacher_model, teacher_tokenizer = FastLanguageModel.from_pretrained(
    model_name=TEACHER_PATH,
    max_seq_length=2048,
    dtype=torch.bfloat16,
    load_in_4bit=False,
    device_map="auto",
    trust_remote_code=True,
)
FastLanguageModel.for_inference(teacher_model)

if teacher_tokenizer.pad_token is None:
    teacher_tokenizer.pad_token = teacher_tokenizer.eos_token
    teacher_tokenizer.pad_token_id = teacher_tokenizer.eos_token_id
teacher_tokenizer.padding_side = "left"

BATCH_SIZE = 32

def query_teacher_batch(system_prompts, user_prompts, temperature=0.7, max_tokens=512):
    """Batched inference with Qwen3 <think> tag handling."""
    messages_list = []
    for sys_p, usr_p in zip(system_prompts, user_prompts):
        messages_list.append([
            {"role": "system", "content": sys_p},
            {"role": "user", "content": usr_p}
        ])
    
    texts = [teacher_tokenizer.apply_chat_template(
        m, tokenize=False, add_generation_prompt=True,
        enable_thinking=False
    ) for m in messages_list]
    
    inputs = teacher_tokenizer(
        texts, return_tensors="pt", padding=True, truncation=True, max_length=1536
    ).to(teacher_model.device)
    
    with torch.no_grad():
        outputs = teacher_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.1,
            pad_token_id=teacher_tokenizer.pad_token_id,
        )
    
    input_len = inputs["input_ids"].shape[1]
    responses = []
    for output in outputs:
        raw = teacher_tokenizer.decode(output[input_len:], skip_special_tokens=True)
        raw = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
        responses.append(raw)
    return responses

def verify_answer(question_text: str, choices: list, num_rounds: int = 2) -> tuple:
    """Ask the teacher to solve an MCQ multiple times. Returns (majority_answer, confidence, reasoning).
    Returns (None, 0.0, '') if no majority."""
    choices_str = " ".join(choices)
    sys_p = "You are an expert. Answer the MCQ. Output ONLY JSON: {\"answer\": \"A/B/C/D\", \"reasoning\": \"brief\"}"
    usr_p = f"Question: {question_text}\nChoices: {choices_str}\n\nOutput JSON only."
    
    responses = query_teacher_batch([sys_p] * num_rounds, [usr_p] * num_rounds, temperature=0.3, max_tokens=150)
    
    answers = []
    reasonings = []
    for raw in responses:
        raw = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
        try:
            start = raw.find('{')
            end = raw.rfind('}') + 1
            if start >= 0 and end > start:
                parsed = json.loads(raw[start:end])
                if "answer" in parsed:
                    ans = str(parsed["answer"]).strip()[0].upper()
                    if ans in "ABCD":
                        answers.append(ans)
                        reasonings.append(str(parsed.get("reasoning", "")))
        except (json.JSONDecodeError, IndexError, KeyError):
            pass
    
    if not answers:
        return None, 0.0, ""
    
    counts = Counter(answers)
    majority_ans, majority_count = counts.most_common(1)[0]
    confidence = majority_count / len(answers)
    
    best_reasoning = ""
    for a, r in zip(answers, reasonings):
        if a == majority_ans and len(r) > len(best_reasoning):
            best_reasoning = r
    
    return majority_ans, confidence, best_reasoning

# Topics that should use verification (model CAN self-solve these)
VERIFY_TOPICS = {"Seating Arrangements (Linear, Circular)", "Family tree logic", "Mixed Series (Alphanumeric)"}
# Syllogisms: NO verification (model always defaults to "D")
SKIP_VERIFY_TOPICS = {"Syllogisms"}

# Quick test
t0 = time.time()
batch_test = query_teacher_batch(
    ["You are helpful."] * BATCH_SIZE,
    [f"What is {i+1} + {i+1}?" for i in range(BATCH_SIZE)],
    max_tokens=20
)
t1 = time.time()
print(f"Batch {BATCH_SIZE} in {t1-t0:.1f}s ({(t1-t0)/BATCH_SIZE:.2f}s each)")
print(f"Verify topics: {VERIFY_TOPICS}")
print(f"Skip verify: {SKIP_VERIFY_TOPICS}")
print(f"GPU: {torch.cuda.memory_allocated()/1024**3:.1f} GiB")

Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.10.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc3.dev39+gf417746ad.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0a0+git1c57644. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


[2026-02-15 08:36:01] INFO modeling.py:987: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  out = torch_matmul(X, W.t(), out = out)


Batch 32 in 9.6s (0.30s each)
Verify topics: {'Family tree logic', 'Mixed Series (Alphanumeric)', 'Seating Arrangements (Linear, Circular)'}
Skip verify: {'Syllogisms'}
GPU: 10.2 GiB


In [8]:
# ========== TOPICS CONFIG (100 per topic = 400 total) ==========

QUESTIONS_PER_TOPIC = 50

TOPICS_CONFIG = {
    "Syllogisms": {
        "count": QUESTIONS_PER_TOPIC,
        "parent": "Logical Reasoning",
        "verify": False,  # Model can't self-solve syllogisms ‚Äî always says D
        "system": "You create syllogism MCQ problems. Output ONLY valid JSON, no other text.",
        "prompt_template": """Create a syllogism MCQ with {num_statements} statements and {num_conclusions} conclusions.
Use quantifiers: All, Some, No, Some...not.
The correct answer MUST be "{answer_hint}".

Output ONLY this JSON:
{{"topic": "Logical Reasoning/Syllogisms", "question": "Statement I: ...\\nStatement II: ...\\nConclusion I: ...\\nConclusion II: ...", "choices": ["A) Only conclusion I follows", "B) Only conclusion II follows", "C) Both I and II follow", "D) Neither I nor II follows"], "answer": "{answer_hint}", "explanation": "brief reason"}}"""
    },
    "Seating Arrangements (Linear, Circular)": {
        "count": QUESTIONS_PER_TOPIC,
        "parent": "Puzzles",
        "verify": True,  # 2-way verification works well here
        "system": "You create seating arrangement MCQ puzzles. Output ONLY valid JSON, no other text.",
        "prompt_template": """Create a {arrangement_type} seating arrangement MCQ with {num_people} people.
Include positional constraints and facing directions. The correct answer is "{answer_hint}".

Output ONLY this JSON:
{{"topic": "Puzzles/Seating Arrangements (Linear, Circular)", "question": "full question with constraints", "choices": ["A) option1", "B) option2", "C) option3", "D) option4"], "answer": "{answer_hint}", "explanation": "step-by-step deduction"}}"""
    },
    "Family tree logic": {
        "count": QUESTIONS_PER_TOPIC,
        "parent": "Blood Relations and Family Tree",
        "verify": True,
        "system": "You create blood relation MCQ puzzles. Output ONLY valid JSON, no other text.",
        "prompt_template": """Create a blood relation MCQ with a chain of {num_relations} family relationships.
Use indirect descriptions. The correct answer is "{answer_hint}".

Output ONLY this JSON:
{{"topic": "Blood Relations and Family Tree/Family tree logic", "question": "full question", "choices": ["A) relation1", "B) relation2", "C) relation3", "D) relation4"], "answer": "{answer_hint}", "explanation": "step-by-step chain"}}"""
    },
    "Mixed Series (Alphanumeric)": {
        "count": QUESTIONS_PER_TOPIC,
        "parent": "Series and Patterns",
        "verify": True,
        "system": "You create number/letter series MCQ problems. Output ONLY valid JSON, no other text.",
        "prompt_template": """Create a {series_type} series MCQ with {num_elements} elements using a {pattern_type} pattern.
The correct answer is "{answer_hint}".

Output ONLY this JSON:
{{"topic": "Series and Patterns/Mixed Series (Alphanumeric)", "question": "series question", "choices": ["A) opt1", "B) opt2", "C) opt3", "D) opt4"], "answer": "{answer_hint}", "explanation": "pattern explanation"}}"""
    }
}

# Answer distribution tracker
answer_counters = {topic: {"A": 0, "B": 0, "C": 0, "D": 0} for topic in TOPICS_CONFIG}

def get_answer_hint(topic: str) -> str:
    """Return the least-used answer letter for balanced distribution."""
    counts = answer_counters[topic]
    min_count = min(counts.values())
    least_used = [l for l, c in counts.items() if c == min_count]
    return random.choice(least_used)

def randomize_params(topic):
    params = {"answer_hint": get_answer_hint(topic)}
    if topic == "Syllogisms":
        params.update({"num_statements": random.choice([2, 3]), "num_conclusions": random.choice([2, 3])})
    elif topic == "Seating Arrangements (Linear, Circular)":
        params.update({"arrangement_type": random.choice(["linear", "circular"]), "num_people": random.choice([5, 6, 7, 8])})
    elif topic == "Family tree logic":
        params.update({"num_relations": random.choice([3, 4, 5, 6])})
    elif topic == "Mixed Series (Alphanumeric)":
        params.update({
            "series_type": random.choice(["alphanumeric", "number", "mixed"]),
            "num_elements": random.choice([5, 6, 7]),
            "pattern_type": random.choice(["arithmetic", "alternating", "geometric", "fibonacci"])
        })
    return params

print(f"Total to generate: {sum(t['count'] for t in TOPICS_CONFIG.values())}")
for t, c in TOPICS_CONFIG.items():
    print(f"  {t}: {c['count']} | verify={c['verify']}")

Total to generate: 200
  Syllogisms: 50 | verify=False
  Seating Arrangements (Linear, Circular): 50 | verify=True
  Family tree logic: 50 | verify=True
  Mixed Series (Alphanumeric): 50 | verify=True


In [9]:
# ========== GENERATE TRAINING DATA ‚Äî ADAPTIVE VERIFICATION ==========
Path("training_data").mkdir(exist_ok=True)
all_a_agent_data = []
all_q_agent_data = []

# --- Robust JSON extraction ---
def extract_json(raw: str) -> dict:
    raw = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
    
    if '```json' in raw:
        try:
            block = raw.split('```json')[1].split('```')[0].strip()
            return json.loads(block)
        except (json.JSONDecodeError, IndexError):
            pass
    if '```' in raw:
        try:
            block = raw.split('```')[1].split('```')[0].strip()
            if block.startswith('json'):
                block = block[4:].strip()
            return json.loads(block)
        except (json.JSONDecodeError, IndexError):
            pass
    
    start = raw.find('{')
    end = raw.rfind('}') + 1
    if start >= 0 and end > start:
        candidate = raw[start:end]
        try:
            return json.loads(candidate)
        except json.JSONDecodeError:
            pass
        fixed = candidate.replace("'", '"')
        fixed = re.sub(r',\s*}', '}', fixed)
        fixed = re.sub(r',\s*]', ']', fixed)
        try:
            return json.loads(fixed)
        except json.JSONDecodeError:
            pass
    return None

# --- Validation with auto-fix ---
def validate_and_fix(parsed: dict) -> tuple:
    if not isinstance(parsed, dict):
        return False, "Not a dict"
    for key in ["question", "choices", "answer"]:
        if key not in parsed:
            return False, f"Missing: {key}"
    if not isinstance(parsed["choices"], list) or len(parsed["choices"]) != 4:
        return False, "Need 4 choices"
    
    labels = ["A", "B", "C", "D"]
    fixed = []
    for i, c in enumerate(parsed["choices"]):
        if not isinstance(c, str) or len(c.strip()) < 1:
            return False, f"Empty choice {i}"
        c = c.strip()
        if len(c) < 2 or c[1] != ')' or c[0].upper() not in "ABCD":
            c = f"{labels[i]}) {c}"
        if c[0].upper() != labels[i]:
            text = c[3:].strip() if len(c) > 3 and c[1] == ')' else c
            c = f"{labels[i]}) {text}"
        fixed.append(c)
    parsed["choices"] = fixed
    
    texts = [c[3:].strip().lower() for c in fixed if len(c) > 3]
    if len(set(texts)) < 3:
        return False, "Too many duplicate choices"
    
    ans = str(parsed["answer"]).strip()
    if len(ans) >= 1 and ans[0].upper() in "ABCD":
        parsed["answer"] = ans[0].upper()
    else:
        return False, f"Bad answer: {ans}"
    
    if not isinstance(parsed["question"], str) or len(parsed["question"].strip()) < 10:
        return False, "Question too short"
    
    if "explanation" not in parsed or not parsed.get("explanation") or len(str(parsed.get("explanation", ""))) < 3:
        parsed["explanation"] = "Analyze systematically to find the correct answer."
    
    return True, "OK"

# --- Dedup ---
def is_unique(new_q: str, existing: list, threshold=0.8) -> bool:
    words = set(new_q.lower().split())
    if len(words) < 3:
        return True
    for eq in existing[-80:]:
        ew = set(eq.lower().split())
        if not ew: continue
        overlap = len(words & ew) / max(len(words | ew), 1)
        if overlap > threshold:
            return False
    return True

# ===== MAIN GENERATION LOOP =====
existing_qs = []
gen_start = time.time()
verified_count = 0
skipped_low_conf = 0
direct_accept = 0

for topic, config in TOPICS_CONFIG.items():
    use_verify = config.get("verify", False)
    mode = "2-way VERIFY" if use_verify else "DIRECT (trust hint)"
    
    print(f"\n{'='*60}")
    print(f"Generating {config['count']} for: {topic} [{mode}]")
    print(f"{'='*60}")
    
    topic_data = []
    fails = 0
    max_attempts = config["count"] * 6
    attempts = 0
    
    while len(topic_data) < config["count"] and attempts < max_attempts:
        needed = min(BATCH_SIZE, config["count"] - len(topic_data) + 8)
        batch_sys = [config["system"]] * needed
        batch_usr = [config["prompt_template"].format(**randomize_params(topic)) for _ in range(needed)]
        
        responses = query_teacher_batch(batch_sys, batch_usr, temperature=0.7, max_tokens=512)
        attempts += len(responses)
        
        for raw in responses:
            if len(topic_data) >= config["count"]:
                break
            
            parsed = extract_json(raw)
            if parsed is None:
                fails += 1
                continue
            
            valid, reason = validate_and_fix(parsed)
            if not valid:
                fails += 1
                continue
            
            if not is_unique(parsed["question"], existing_qs):
                fails += 1
                continue
            
            # === ADAPTIVE VERIFICATION ===
            if use_verify:
                # 2-way verification for Seating/Family/Series
                v_ans, v_conf, v_reasoning = verify_answer(parsed["question"], parsed["choices"], num_rounds=2)
                
                if v_ans is None or v_conf < 0.5:
                    # Verification failed ‚Äî still accept with generator's answer (don't waste it)
                    skipped_low_conf += 1
                    # Keep generator's answer from hint
                else:
                    # Use verified answer + better reasoning
                    parsed["answer"] = v_ans
                    if v_reasoning and len(v_reasoning) > len(str(parsed.get("explanation", ""))):
                        parsed["explanation"] = v_reasoning
                    verified_count += 1
            else:
                # Syllogisms: trust the generator's answer (= the hint we gave it)
                direct_accept += 1
            
            existing_qs.append(parsed["question"])
            answer = parsed["answer"]
            explanation = str(parsed.get("explanation", "Solve step by step."))[:400]
            answer_counters[topic][answer] += 1
            
            # A-Agent training example
            choices_str = " ".join(parsed["choices"])
            all_a_agent_data.append({"conversations": [
                {"role": "user", "content": f"Question: {parsed['question']}\nChoices: {choices_str}\n\nSolve step by step and output JSON: {{\"answer\": \"<letter>\", \"reasoning\": \"<brief>\"}}"},
                {"role": "assistant", "content": json.dumps({"answer": answer, "reasoning": explanation})}
            ]})
            
            # Q-Agent training example
            full_topic = f"{config['parent']}/{topic}"
            all_q_agent_data.append({"conversations": [
                {"role": "user", "content": f"Generate a difficult MCQ on topic: {full_topic}. Output ONLY valid JSON."},
                {"role": "assistant", "content": json.dumps({"topic": full_topic, "question": parsed["question"], "choices": parsed["choices"], "answer": answer, "explanation": explanation})}
            ]})
            
            topic_data.append(parsed)
        
        elapsed = time.time() - gen_start
        rate = len(existing_qs) / elapsed * 60 if elapsed > 0 else 0
        dist = answer_counters[topic]
        dist_str = "/".join(f"{dist[l]}" for l in "ABCD")
        print(f"  [{topic}] {len(topic_data)}/{config['count']} | {rate:.0f} q/min | fails: {fails} | A/B/C/D: {dist_str}")
    
    safe = topic.replace(' ', '_').replace('/', '_').replace('(', '').replace(')', '')
    with open(f"training_data/{safe}.json", 'w') as f:
        json.dump(topic_data, f, indent=2)
    print(f"  DONE: {len(topic_data)} / {attempts} attempts ({fails} fails)")

# Save combined
with open("training_data/a_agent_train.json", 'w') as f:
    json.dump(all_a_agent_data, f, indent=2)
with open("training_data/q_agent_train.json", 'w') as f:
    json.dump(all_q_agent_data, f, indent=2)

total = time.time() - gen_start
print(f"\n{'='*60}")
print(f"DONE in {total/60:.1f} min | Total: {len(all_a_agent_data)}")
print(f"Verified (Seating/Family/Series): {verified_count}")
print(f"Direct accept (Syllogisms): {direct_accept}")
print(f"Low-conf fallback: {skipped_low_conf}")
print(f"Answer distribution:")
for label in "ABCD":
    t = sum(answer_counters[tp][label] for tp in TOPICS_CONFIG)
    print(f"  {label}: {t}")
print(f"{'='*60}")


Generating 50 for: Syllogisms [DIRECT (trust hint)]
  [Syllogisms] 9/50 | 188 q/min | fails: 23 | A/B/C/D: 3/2/1/3
  [Syllogisms] 9/50 | 93 q/min | fails: 55 | A/B/C/D: 3/2/1/3
  [Syllogisms] 11/50 | 75 q/min | fails: 85 | A/B/C/D: 3/2/3/3
  [Syllogisms] 15/50 | 75 q/min | fails: 113 | A/B/C/D: 3/6/3/3
  [Syllogisms] 17/50 | 68 q/min | fails: 143 | A/B/C/D: 3/6/3/5
  [Syllogisms] 17/50 | 57 q/min | fails: 175 | A/B/C/D: 3/6/3/5
  [Syllogisms] 17/50 | 49 q/min | fails: 207 | A/B/C/D: 3/6/3/5
  [Syllogisms] 18/50 | 45 q/min | fails: 238 | A/B/C/D: 4/6/3/5
  [Syllogisms] 18/50 | 40 q/min | fails: 270 | A/B/C/D: 4/6/3/5
  [Syllogisms] 18/50 | 36 q/min | fails: 302 | A/B/C/D: 4/6/3/5
  DONE: 18 / 320 attempts (302 fails)

Generating 50 for: Seating Arrangements (Linear, Circular) [2-way VERIFY]


  out = torch_matmul(X, W.t(), out = out)


  [Seating Arrangements (Linear, Circular)] 22/50 | 29 q/min | fails: 10 | A/B/C/D: 0/5/7/10
  [Seating Arrangements (Linear, Circular)] 43/50 | 27 q/min | fails: 21 | A/B/C/D: 3/11/10/19


  out = torch_matmul(X, W.t(), out = out)


  [Seating Arrangements (Linear, Circular)] 50/50 | 26 q/min | fails: 24 | A/B/C/D: 3/13/10/24
  DONE: 50 / 79 attempts (24 fails)

Generating 50 for: Family tree logic [2-way VERIFY]
  [Family tree logic] 19/50 | 28 q/min | fails: 13 | A/B/C/D: 8/7/2/2
  [Family tree logic] 28/50 | 29 q/min | fails: 36 | A/B/C/D: 11/11/4/2


  out = torch_matmul(X, W.t(), out = out)


  [Family tree logic] 41/50 | 30 q/min | fails: 53 | A/B/C/D: 12/16/7/6


  out = torch_matmul(X, W.t(), out = out)


  [Family tree logic] 47/50 | 29 q/min | fails: 64 | A/B/C/D: 12/19/7/9


  out = torch_matmul(X, W.t(), out = out)


  [Family tree logic] 50/50 | 29 q/min | fails: 66 | A/B/C/D: 13/19/9/9
  DONE: 50 / 122 attempts (66 fails)

Generating 50 for: Mixed Series (Alphanumeric) [2-way VERIFY]
  [Mixed Series (Alphanumeric)] 13/50 | 30 q/min | fails: 19 | A/B/C/D: 12/0/1/0
  [Mixed Series (Alphanumeric)] 20/50 | 31 q/min | fails: 44 | A/B/C/D: 17/2/1/0
  [Mixed Series (Alphanumeric)] 23/50 | 30 q/min | fails: 73 | A/B/C/D: 19/3/1/0
  [Mixed Series (Alphanumeric)] 24/50 | 30 q/min | fails: 104 | A/B/C/D: 20/3/1/0
  [Mixed Series (Alphanumeric)] 25/50 | 30 q/min | fails: 135 | A/B/C/D: 21/3/1/0
  [Mixed Series (Alphanumeric)] 28/50 | 30 q/min | fails: 164 | A/B/C/D: 23/4/1/0
  [Mixed Series (Alphanumeric)] 29/50 | 29 q/min | fails: 193 | A/B/C/D: 24/4/1/0


  out = torch_matmul(X, W.t(), out = out)


  [Mixed Series (Alphanumeric)] 31/50 | 29 q/min | fails: 220 | A/B/C/D: 26/4/1/0


  out = torch_matmul(X, W.t(), out = out)


  [Mixed Series (Alphanumeric)] 32/50 | 29 q/min | fails: 246 | A/B/C/D: 27/4/1/0


  out = torch_matmul(X, W.t(), out = out)


  [Mixed Series (Alphanumeric)] 32/50 | 28 q/min | fails: 272 | A/B/C/D: 27/4/1/0
  DONE: 32 / 304 attempts (272 fails)

DONE in 5.3 min | Total: 150
Verified (Seating/Family/Series): 127
Direct accept (Syllogisms): 18
Low-conf fallback: 5
Answer distribution:
  A: 47
  B: 42
  C: 23
  D: 38


---
## Validate Generated Data & Unload Teacher

Quick sanity checks, then free GPU for fine-tuning.

In [10]:
# ========== VALIDATE DATA + UNLOAD TEACHER ==========
import json

print("=" * 60)
print("DATA VALIDATION REPORT")
print("=" * 60)

for name in ["a_agent_train.json", "q_agent_train.json"]:
    with open(f"training_data/{name}") as f:
        data = json.load(f)
    print(f"\n{name}: {len(data)} examples")
    
    errors = 0
    for i, item in enumerate(data):
        convos = item.get("conversations", [])
        if len(convos) != 2: errors += 1; continue
        if convos[0]["role"] != "user" or convos[1]["role"] != "assistant": errors += 1; continue
        try:
            parsed = json.loads(convos[1]["content"])
            if parsed.get("answer") not in "ABCD": errors += 1
        except json.JSONDecodeError:
            errors += 1
    
    print(f"  {'‚úÖ All valid!' if errors == 0 else f'‚ö†Ô∏è {errors} errors'}")

# Answer distribution
print("\nAnswer Distribution:")
with open("training_data/a_agent_train.json") as f:
    a_data = json.load(f)
counts = {"A": 0, "B": 0, "C": 0, "D": 0}
for item in a_data:
    ans = json.loads(item["conversations"][1]["content"]).get("answer", "?")
    if ans in counts: counts[ans] += 1
total = sum(counts.values())
for letter, count in counts.items():
    pct = count / total * 100 if total > 0 else 0
    bar = "‚ñà" * int(pct / 2)
    print(f"  {letter}: {count:3d} ({pct:4.1f}%) {bar}")

# Per-topic
print("\nPer-Topic:")
for label, fname in [("Syllogisms", "Syllogisms.json"), ("Seating", "Seating_Arrangements_Linear,_Circular.json"),
                     ("Family", "Family_tree_logic.json"), ("Series", "Mixed_Series_Alphanumeric.json")]:
    path = Path("training_data") / fname
    if path.exists():
        with open(path) as f: td = json.load(f)
        tc = {"A": 0, "B": 0, "C": 0, "D": 0}
        for q in td:
            if q.get("answer") in tc: tc[q["answer"]] += 1
        dist = " | ".join(f"{l}:{tc[l]}" for l in "ABCD")
        print(f"  {label}: {len(td)} | {dist}")
    else:
        print(f"  {label}: MISSING")

# Cross-check
with open("training_data/q_agent_train.json") as f:
    q_data = json.load(f)
mismatches = sum(1 for a, q in zip(a_data, q_data)
    if json.loads(a["conversations"][1]["content"])["answer"] != json.loads(q["conversations"][1]["content"])["answer"])
print(f"\nA/Q consistency: {'‚úÖ All match' if mismatches == 0 else f'‚ùå {mismatches} mismatches'}")

print("=" * 60)

# Unload teacher
del teacher_model, teacher_tokenizer
gc.collect()
torch.cuda.empty_cache()
print(f"GPU freed: {torch.cuda.mem_get_info()[0]/1024**3:.1f} GiB available")

DATA VALIDATION REPORT

a_agent_train.json: 150 examples
  ‚úÖ All valid!

q_agent_train.json: 150 examples
  ‚úÖ All valid!

Answer Distribution:
  A:  47 (31.3%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  B:  42 (28.0%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  C:  23 (15.3%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  D:  38 (25.3%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

Per-Topic:
  Syllogisms: 18 | A:4 | B:6 | C:3 | D:5
  Seating: 50 | A:3 | B:13 | C:10 | D:24
  Family: 50 | A:13 | B:19 | C:9 | D:9
  Series: 32 | A:27 | B:4 | C:1 | D:0

A/Q consistency: ‚úÖ All match
GPU freed: 240.6 GiB available


---
## Phase 2: Fine-Tune A-Agent

A-Agent solves MCQs ‚Äî critical for elimination round and defense.

In [11]:
import os
import json
import torch
from datasets import Dataset
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, standardize_sharegpt, train_on_responses_only
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq

print("Libraries loaded.")

Libraries loaded.


In [12]:
# ========== LOAD A-AGENT TRAINING DATA ==========
with open("training_data/a_agent_train.json", 'r') as f:
    a_data = json.load(f)

print(f"Loaded {len(a_data)} A-Agent training examples")
a_dataset = Dataset.from_list(a_data)
print(f"Dataset: {a_dataset}")
print(f"Sample:\n{json.dumps(a_dataset[0], indent=2)[:500]}")

Loaded 150 A-Agent training examples
Dataset: Dataset({
    features: ['conversations'],
    num_rows: 150
})
Sample:
{
  "conversations": [
    {
      "content": "Question: Statement I: All cats are animals.\nStatement II: Some animals are mammals.\nConclusion I: Some cats are mammals.\nConclusion II: All animals are mammals.\nChoices: A) Only conclusion I follows B) Only conclusion II follows C) Both I and II follow D) Neither I nor II follows\n\nSolve step by step and output JSON: {\"answer\": \"<letter>\", \"reasoning\": \"<brief>\"}",
      "role": "user"
    },
    {
      "content": "{\"answer\": \"D\",


In [13]:
# ========== LOAD QWEN3-4B FOR A-AGENT FINE-TUNING ==========
max_seq_length = 1024  # MCQ data is short ‚Äî 1024 is plenty, saves VRAM & time
dtype = torch.bfloat16  # ROCm compatible
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./hf_models/Qwen3-4B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
print("Qwen3-4B loaded.")

# Add LoRA adapters ‚Äî all projection layers, rank 64
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=128,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)
print("LoRA adapters added (r=64, alpha=128).")

Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.10.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc3.dev39+gf417746ad.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0a0+git1c57644. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


`torch_dtype` is deprecated! Use `dtype` instead!
[2026-02-15 08:43:40] INFO modeling.py:987: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Qwen3-4B loaded.


Unsloth 2025.10.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


LoRA adapters added (r=64, alpha=128).


In [14]:
# ========== PREPARE A-AGENT DATASET ==========
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def formatting_prompts_func(examples):
    texts = []
    for convo in examples["conversations"]:
        if isinstance(convo, list) and all(isinstance(m, dict) for m in convo):
            texts.append(tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False))
    return {"text": texts}

a_dataset = standardize_sharegpt(a_dataset)
a_dataset = a_dataset.map(formatting_prompts_func, batched=True, remove_columns=a_dataset.column_names)
a_dataset = a_dataset.filter(lambda x: len(x["text"].strip()) > 0)
print(f"Prepared {len(a_dataset)} A-Agent examples")
if len(a_dataset) > 0:
    print(f"Sample: {a_dataset['text'][0][:200]}...")

num_proc must be <= 150. Reducing num_proc to 150 for dataset of size 150.


Unsloth: Standardizing formats (num_proc=150):   0%|          | 0/150 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

Filter:   0%|          | 0/150 [00:00<?, ? examples/s]

Prepared 150 A-Agent examples
Sample: <|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Question: Statement I: All cats are animals.
Statement II: Some animals are mammals.
...


In [16]:
# ========== TRAIN A-AGENT ==========
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=a_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    packing=False,
    args=SFTConfig(
        per_device_train_batch_size=32,
        gradient_accumulation_steps=2,
        warmup_steps=5,
        num_train_epochs=2,
        learning_rate=2e-4,
        logging_steps=5,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="a_agent_training_output",
        report_to="none",
        bf16=True,
        dataloader_pin_memory=False,
        remove_unused_columns=True,
        gradient_checkpointing=True,
        dataloader_num_workers=0,
    ),
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
)

FastLanguageModel.for_training(model)
print("Starting A-Agent training...")
trainer_stats = trainer.train()
print(f"A-Agent done! Loss: {trainer_stats.training_loss:.4f}")

Unsloth: Tokenizing ["text"] (num_proc=64):   0%|          | 0/150 [00:00<?, ? examples/s]

Map (num_proc=64):   0%|          | 0/150 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Starting A-Agent training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 150 | Num Epochs = 2 | Total steps = 6
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 2 x 1) = 64
 "-____-"     Trainable parameters = 132,120,576 of 4,154,588,672 (3.18% trained)


Unsloth: Will smartly offload gradients to save VRAM!
A-Agent done! Loss: 0.6923


In [18]:
# ========== SAVE A-AGENT MODEL ==========
import gc

a_merged_path = "./hf_models/a_agent_finetuned"
print(f"Saving A-Agent to {a_merged_path}...")
model.save_pretrained_merged(a_merged_path, tokenizer, save_method="merged_16bit")
print(f"A-Agent saved.")

# Free GPU memory
del model, trainer
gc.collect()
torch.cuda.empty_cache()
print("GPU memory freed.")

Saving A-Agent to ./hf_models/a_agent_finetuned...
Detected local model directory: /workspace/AAIPL/hf_models/Qwen3-4B


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 34568.44it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:14<00:00,  4.96s/it]


Unsloth: Merge process complete. Saved to `/workspace/AAIPL/hf_models/a_agent_finetuned`
A-Agent saved.
GPU memory freed.


---
## Phase 3: Fine-Tune Q-Agent

Q-Agent generates hard MCQs ‚Äî scores when opponent's A-Agent fails.

In [23]:
# ========== LOAD Q-AGENT TRAINING DATA ==========
with open("training_data/q_agent_train.json", 'r') as f:
    q_data = json.load(f)

print(f"Loaded {len(q_data)} Q-Agent training examples")
q_dataset = Dataset.from_list(q_data)
print(f"Sample:\n{json.dumps(q_dataset[0], indent=2)[:500]}")

Loaded 150 Q-Agent training examples
Sample:
{
  "conversations": [
    {
      "content": "Generate a difficult MCQ on topic: Logical Reasoning/Syllogisms. Output ONLY valid JSON.",
      "role": "user"
    },
    {
      "content": "{\"topic\": \"Logical Reasoning/Syllogisms\", \"question\": \"Statement I: All cats are animals.\\nStatement II: Some animals are mammals.\\nConclusion I: Some cats are mammals.\\nConclusion II: All animals are mammals.\", \"choices\": [\"A) Only conclusion I follows\", \"B) Only conclusion II follows\", \"C)


In [24]:
# ========== LOAD FRESH QWEN3-4B FOR Q-AGENT ==========
max_seq_length = 1024  # MCQ data is short ‚Äî 1024 saves time

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./hf_models/Qwen3-4B",
    max_seq_length=max_seq_length,
    dtype=torch.bfloat16,
    load_in_4bit=False,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=128,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)
print("Fresh Qwen3-4B loaded for Q-Agent (r=64).")

Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2025.10.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc3.dev39+gf417746ad.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0a0+git1c57644. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


[2026-02-15 08:50:03] INFO modeling.py:987: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Fresh Qwen3-4B loaded for Q-Agent (r=64).


In [25]:
# ========== PREPARE Q-AGENT DATASET ==========
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

q_dataset = standardize_sharegpt(q_dataset)
q_dataset = q_dataset.map(formatting_prompts_func, batched=True, remove_columns=q_dataset.column_names)
q_dataset = q_dataset.filter(lambda x: len(x["text"].strip()) > 0)

print(f"Prepared {len(q_dataset)} Q-Agent training examples")

num_proc must be <= 150. Reducing num_proc to 150 for dataset of size 150.


Unsloth: Standardizing formats (num_proc=150):   0%|          | 0/150 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

Filter:   0%|          | 0/150 [00:00<?, ? examples/s]

Prepared 150 Q-Agent training examples


In [26]:
# ========== TRAIN Q-AGENT ==========
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=q_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    packing=False,
    args=SFTConfig(
        per_device_train_batch_size=32,
        gradient_accumulation_steps=2,
        warmup_steps=5,
        num_train_epochs=2,
        learning_rate=2e-4,
        logging_steps=5,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="q_agent_training_output",
        report_to="none",
        bf16=True,
        dataloader_pin_memory=False,
        remove_unused_columns=True,
        gradient_checkpointing=True,
        dataloader_num_workers=0,
    ),
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
)

FastLanguageModel.for_training(model)
print("Starting Q-Agent training...")
trainer_stats = trainer.train()
print(f"Q-Agent done! Loss: {trainer_stats.training_loss:.4f}")

Unsloth: Tokenizing ["text"] (num_proc=64):   0%|          | 0/150 [00:00<?, ? examples/s]

Map (num_proc=64):   0%|          | 0/150 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Starting Q-Agent training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 150 | Num Epochs = 2 | Total steps = 6
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 2 x 1) = 64
 "-____-"     Trainable parameters = 132,120,576 of 4,154,588,672 (3.18% trained)


Step,Training Loss
5,0.6577


Q-Agent done! Loss: 0.6028


In [27]:
# ========== SAVE Q-AGENT MODEL ==========
q_merged_path = "./hf_models/q_agent_finetuned"
print(f"Saving Q-Agent to {q_merged_path}...")
model.save_pretrained_merged(q_merged_path, tokenizer, save_method="merged_16bit")
print(f"Q-Agent saved.")

del model, trainer
gc.collect()
torch.cuda.empty_cache()
print("Both fine-tuned models saved.")

Saving Q-Agent to ./hf_models/q_agent_finetuned...
Detected local model directory: /workspace/AAIPL/hf_models/Qwen3-4B


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 14429.94it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:13<00:00,  4.39s/it]


Unsloth: Merge process complete. Saved to `/workspace/AAIPL/hf_models/q_agent_finetuned`
Q-Agent saved.
Both fine-tuned models saved.


---
## Phase 4: Update Model Paths and Test

Point `question_model.py` and `answer_model.py` to the fine-tuned models.

In [40]:
# ========== UPDATE MODEL PATHS + FIX QWEN3 <think> TAGS + ROBUST JSON ==========
import re as _re

for fname, new_path in [("agents/question_model.py", "q_agent_finetuned"),
                         ("agents/answer_model.py", "a_agent_finetuned")]:
    code = open(fname, "r").read()
    
    # 1. Fix MODEL_PATH (handle both old and already-updated paths)
    code = _re.sub(
        r'MODEL_PATH\s*=\s*str\(Path\(__file__\)\.parent\.parent\s*/\s*"hf_models"\s*/\s*"[^"]+"\)',
        f'MODEL_PATH = str(Path(__file__).parent.parent / "hf_models" / "{new_path}")',
        code
    )
    
    # 2. Add enable_thinking=False if missing
    if "enable_thinking=False" not in code:
        code = code.replace(
            "add_generation_prompt=True,\n            )",
            "add_generation_prompt=True,\n                enable_thinking=False,\n            )"
        )
    
    # 3. Add <think> tag stripping if missing
    if "re.sub(r'<think>" not in code:
        if "import re\n" not in code:
            code = code.replace("import time\n", "import re\nimport time\n", 1)
        code = code.replace(
            '.strip("\\n")\n            batch_outs.append(content)',
            '.strip("\\n")\n            # Strip Qwen3 <think> tags if present\n            content = re.sub(r\'<think>.*?</think>\', \'\', content, flags=re.DOTALL).strip()\n            batch_outs.append(content)'
        )
    
    open(fname, "w").write(code)
    print(f"‚úÖ {fname} -> {new_path}")

# Verify the changes
for fname in ["agents/question_model.py", "agents/answer_model.py"]:
    code = open(fname).read()
    has_path = "finetuned" in code
    has_think = "enable_thinking=False" in code
    has_strip = "re.sub" in code
    print(f"  {fname}: path={has_path} thinking={has_think} strip={has_strip}")

‚úÖ agents/question_model.py -> q_agent_finetuned
‚úÖ agents/answer_model.py -> a_agent_finetuned
  agents/question_model.py: path=True thinking=True strip=True
  agents/answer_model.py: path=True thinking=True strip=True


In [34]:
# ========== TEST Q-AGENT ==========
!python -m agents.question_agent \
    --output_file "outputs/questions.json" \
    --num_questions 10 \
    --batch_size 5 \
    --verbose

  torch._C._cuda_init()
ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
INFO 02-15 09:17:11 [__init__.py:225] Automatically detected platform rocm.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE_TORCH}:{i}") for i in range(n_gpus)])
==((====))==  Unsloth 2025.10.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc3.dev39+gf417746ad.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0a0+git1c57644. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfl

In [35]:
# ========== CHECK Q-AGENT FILTER PASS RATE ==========
import json

with open("outputs/questions.json", "r") as f:
    questions = json.load(f)
with open("outputs/filtered_questions.json", "r") as f:
    filtered = json.load(f)

pass_rate = len(filtered) / max(len(questions), 1) * 100
print(f"Raw questions: {len(questions)}")
print(f"Passed filter: {len(filtered)}")
print(f"Pass rate: {pass_rate:.1f}%")
if pass_rate < 50:
    print("CRITICAL: Below 50% = DISQUALIFIED. Re-train or adjust prompts.")
else:
    print("Filter pass rate OK.")

Raw questions: 10
Passed filter: 2
Pass rate: 20.0%
CRITICAL: Below 50% = DISQUALIFIED. Re-train or adjust prompts.


In [41]:
# ========== TEST A-AGENT ==========
!python -m agents.answer_agent \
    --input_file "outputs/filtered_questions.json" \
    --output_file "outputs/answers.json" \
    --batch_size 5 \
    --verbose

  torch._C._cuda_init()
ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
#### Unsloth: `hf_xet==1.1.10` and `ipykernel>6.30.1` breaks progress bars. Disabling for now in XET.
#### Unsloth: To re-enable progress bars, please downgrade to `ipykernel==6.30.1` or wait for a fix to
https://github.com/huggingface/xet-core/issues/526
INFO 02-15 09:25:55 [__init__.py:225] Automatically detected platform rocm.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE_TORCH}:{i}") for i in range(n_gpus)])
==((====))==  Unsloth 2025.10.9: Fast Qwen3 patching. Transformers: 4.56.2. vLLM: 0.11.1rc3.dev39+gf417746ad.rocm700.
   \\   /|    . Num GPUs = 1. Max memory: 255.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0a0+git1c57644. ROCm Toolkit: 7.0.51831-a3e329ad8. Triton: 3.4.0
\        /    Bfl

In [36]:
# ========== CALCULATE SCORES ==========
with open("outputs/filtered_questions.json", "r") as f:
    fq = json.load(f)
with open("outputs/filtered_answers.json", "r") as f:
    fa = json.load(f)

N = len(fq)
correct = 0
for q, a in zip(fq, fa):
    if a is not None and q.get('answer', '')[0].upper() == a.get('answer', '').upper():
        correct += 1

accuracy = correct * 100 / max(N, 1)
print(f"{'='*50}")
print(f"Questions: {N}")
print(f"Correct answers: {correct}")
print(f"A-Agent accuracy: {accuracy:.1f}%")
print(f"Q-Agent score (if opponent had same accuracy): {100-accuracy:.1f}%")
print(f"{'='*50}")

FileNotFoundError: [Errno 2] No such file or directory: 'outputs/filtered_answers.json'

---
## Phase 5: Push to GitHub
Push code (NOT `hf_models/`) to GitHub before deadline.

In [39]:
!bash git.sh

UnboundLocalError: cannot access local variable 'child' where it is not associated with a value