# 🧷🔐 Guardrailed Ensemble MAP@3  

> **Changelog (this fork):**  
> • Per‑model *family‑guardrailed decoding* (filter to `True_` / `False_` before writing base results)  
> • Ensemble enforces same family filtering + safe backfill  
> • Threaded, staggered multi‑GPU inference (Qwen & DeepSeek)  
> • Full probability artifacts (`prob_0..24`, `top_classes`) per model  
> • Standardized prompts (“Correct? Yes/No”) and consistent tokenization. Funny thing is this discussion appears right at the top but it is not always being enforced: https://www.kaggle.com/competitions/map-charting-student-math-misunderstandings/discussion/589400

> *Based on original by @kishanvavdara — see [original notebook](https://www.kaggle.com/code/kishanvavdara/ensemble-gemma-qwen-deepseek)*  

## 1) Competition: task & metric  

You’re working on the **MAP: Charting Student Math Misunderstandings** competition.  

**Task:** Given a question, the student’s MC answer, and their explanation, you must output a **top‑3** `Category:Misconception` prediction (space‑separated). Labels include a **family prefix** (`True_` or `False_`) + a coarse label (e.g. `Misconception`, `Neither`, `Correct`).  

**Evaluation:** **MAP@3** — so ordering matters.  

**Data layout:** `train.csv` has true `Category` + `Misconception`, `test.csv` doesn’t. You must map each `row_id` to 3 predictions.  

> This fork’s design emphasizes **consistency and interpretability**, not hyper‑tweaked dominance.


## 2) Pipeline overview  

### 2.1 Family heuristic & pre-processing  
- From `train`, extract a “canonical correct MC answer per QuestionId” among `True_`‑labeled rows; merge to `test` to define `is_correct ∈ {0,1}`.  
- Map that to a **family prefix**: `True_` if `is_correct=1` else `False_`.  

### 2.2 Prompting & tokenization  
- Use short, deterministic prompt templates (e.g. “Correct? Yes/No”).  
- Tokenize with `max_length=256`, truncation, and padding via a `DataCollatorWithPadding`.

### 2.3 Inference per model  
- **Gemma‑2‑9B + LoRA**: load base + adapter; may see `score.weight` warnings if head uninitialized.  
- **Qwen3‑8B & DeepSeekMath‑7B**: run in parallel threads on separate GPUs with small batches and staggered start delays.  
- **Guardrailed decoding**: filter top classes to the row’s family prefix, dedupe, then backfill to exactly 3 using safe placeholders (`Neither:NA`, `Correct:NA` etc.).

### 2.4 Ensemble & blending  
- Each model outputs a probability artifact (with `prob_0..24`, `top_classes`).  
- Merge them on `row_id`.  
- For each candidate class, compute:
  1. weighted sum of probabilities  
  2. agreement (how many models predicted it)  
  3. max weighted prob  
- Score: `0.6·sum + 0.3·agreement + 0.1·max`  
- Final: filter to correct family prefix, sort, take top‑3, safe backfill as needed.


## 3) Remarks & pitfalls  

- The **Gemma head initialization warning** does not break inference, but indicates the classification head may not be well aligned — you’re better off persisting a tuned head.  
- The **family heuristic** (derived from majority vote in train) is approximate; in rarer cases it might mislabel some rows, forcing fallback placeholders.  
- To reduce nondeterminism, **seed** `numpy`, `torch`, and Python’s `random`.  
- The artifact CSVs (probabilities + top_classes) are intentional: they enable **post hoc stacking, calibration, and introspection**.

In [1]:
%%writefile gemma2_inference.py

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import os
from IPython.display import display, Math, Latex
import torch
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
from datasets import Dataset
import pandas as pd, numpy as np
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding
from peft import PeftModel
from scipy.special import softmax
from tqdm import tqdm


os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

lora_path = "/kaggle/input/gemma2-9b-it-cv945"
MAX_LEN = 256
# helpers
def format_input(row):
    x = "Yes"
    if not row['is_correct']:
        x = "No"
    return (
        f"Question: {row['QuestionText']}\n"
        f"Answer: {row['MC_Answer']}\n"
        f"Correct? {x}\n"
        f"Student Explanation: {row['StudentExplanation']}"
    )

# Tokenization function
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256)



le = LabelEncoder()

train = pd.read_csv('/kaggle/input/map-charting-student-math-misunderstandings/train.csv')

train.Misconception = train.Misconception.fillna('NA')
train['target'] = train.Category+":"+train.Misconception
train['label'] = le.fit_transform(train['target'])
target_classes = le.classes_
n_classes = len(target_classes)
print(f"Train shape: {train.shape} with {n_classes} target classes")
idx = train.apply(lambda row: row.Category.split('_')[0],axis=1)=='True'
correct = train.loc[idx].copy()
correct['c'] = correct.groupby(['QuestionId','MC_Answer']).MC_Answer.transform('count')
correct = correct.sort_values('c',ascending=False)
correct = correct.drop_duplicates(['QuestionId'])
correct = correct[['QuestionId','MC_Answer']]
correct['is_correct'] = 1

# Prepare test data
test = pd.read_csv('/kaggle/input/map-charting-student-math-misunderstandings/test.csv')
test = test.merge(correct, on=['QuestionId','MC_Answer'], how='left')
test.is_correct = test.is_correct.fillna(0)
test['text'] = test.apply(format_input, axis=1)


# load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(lora_path)
model = AutoModelForSequenceClassification.from_pretrained(
    "/kaggle/input/gemma2-9b-it-bf16",
    num_labels=n_classes,
    torch_dtype=torch.float16,
    device_map="auto",
)

model = PeftModel.from_pretrained(model, lora_path)
model.eval()

# Tokenize dataset
ds_test = Dataset.from_pandas(test[['text']])
ds_test = ds_test.map(tokenize, batched=True, remove_columns=['text'])

# Create data collator for efficient batching with padding
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    max_length=MAX_LEN,  
    return_tensors="pt")

dataloader = DataLoader(
    ds_test,
    batch_size=8,  
    shuffle=False,
    collate_fn=data_collator,
    pin_memory=True,  
    num_workers=2     
)

# Fast inference loop
all_logits = []
device = next(model.parameters()).device

with torch.no_grad():
    for batch in tqdm(dataloader, desc="Inference"):
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = model(**batch)
        logits = outputs.logits
        
        # Convert bfloat16 to float32 then move to CPU and store
        all_logits.append(logits.float().cpu().numpy())

# Concatenate all logits
predictions = np.concatenate(all_logits, axis=0)

# Convert to probs
probs = softmax(predictions, axis=1)

# Get top predictions (all 65 classes ranked)
top_indices = np.argsort(-probs, axis=1)

# Decode to class names
flat_indices = top_indices.flatten()
decoded_labels = le.inverse_transform(flat_indices)
top_labels = decoded_labels.reshape(top_indices.shape)

# Build per-row family prefix from test.is_correct (1 -> True_, 0 -> False_)
fam_prefix = np.where(test.is_correct.values == 1, "True_", "False_")
valid_labels = set(target_classes)

filtered_top3 = []
for i in range(len(test)):
    pref = fam_prefix[i]
    row = [lab for lab in top_labels[i, :] if lab.startswith(pref)]

    # de-dup while preserving order
    seen = set(); row = [x for x in row if not (x in seen or seen.add(x))]

    # backfill safely from allowed labels (prefer Neither, then Correct when True_)
    fillers = [f"{pref}Neither:NA", f"{pref}Correct:NA"] if pref == "True_" else [f"{pref}Neither:NA"]
    for f in fillers:
        if len(row) >= 3: break
        if f in valid_labels and f not in row:
            row.append(f)
    while len(row) < 3:
        # last resort: keep repeating a valid family-safe filler that's allowed
        f = fillers[0] if fillers and fillers[0] in valid_labels else next(l for l in valid_labels if l.startswith(pref))
        row.append(f)

    filtered_top3.append(" ".join(row[:3]))

joined_preds = filtered_top3  # <— replace previous joined_preds

# Create submission (top 3)
#joined_preds = [" ".join(row[:3]) for row in top_labels]

sub = pd.DataFrame({
    "row_id": test.row_id.values,
    "Category:Misconception": joined_preds
})
sub.to_csv("submission_gemma.csv", index=False)

prob_data = []
for i in range(len(test)):
    prob_dict = {f"prob_{j}": probs[i, top_indices[i, j]] for j in range(25)}  # Top 25
    prob_dict['row_id'] = test.row_id.values[i]
    prob_dict['top_classes'] = " ".join(top_labels[i, :25])  # Top 25 class names
    prob_data.append(prob_dict)

prob_df = pd.DataFrame(prob_data)
prob_df.to_csv("submission_gemma_prob.csv", index=False)

Writing gemma2_inference.py


In [2]:
%%writefile qwen3_deepseek_inference.py

# we do parallel inference, for deepseek and qwen3
import os
import torch
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from datasets import Dataset
import threading
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
from scipy.special import softmax
from tqdm import tqdm
import time

os.environ["TOKENIZERS_PARALLELISM"] = "false"


train = pd.read_csv('/kaggle/input/map-charting-student-math-misunderstandings/train.csv')
test  = pd.read_csv('/kaggle/input/map-charting-student-math-misunderstandings/test.csv')

model_paths = [
    "/kaggle/input/deekseepmath-7b-map-competition/MAP_EXP_09_FULL",
   "/kaggle/input/qwen3-8b-map-competition/MAP_EXP_16_FULL"]

def format_input(row):
    x = "This answer is correct."
    if not row['is_correct']:
        x = "This is answer is incorrect."
    return (
        f"Question: {row['QuestionText']}\n"
        f"Answer: {row['MC_Answer']}\n"
        f"{x}\n"
        f"Student Explanation: {row['StudentExplanation']}")


le = LabelEncoder()
train.Misconception  = train.Misconception.fillna('NA')
train['target']   = train.Category + ':' +train.Misconception
train['label']    = le.fit_transform(train['target'])

n_classes = len(le.classes_)
print(f"Train shape: {train.shape} with {n_classes} target classes")
idx = train.apply(lambda row: row.Category.split('_')[0],axis=1)=='True'
correct = train.loc[idx].copy()
correct['c'] = correct.groupby(['QuestionId','MC_Answer']).MC_Answer.transform('count')
correct = correct.sort_values('c',ascending=False)
correct = correct.drop_duplicates(['QuestionId'])
correct = correct[['QuestionId','MC_Answer']]
correct['is_correct'] = 1

test = test.merge(correct, on=['QuestionId','MC_Answer'], how='left')
test.is_correct = test.is_correct.fillna(0)
test['text'] = test.apply(format_input,axis=1)
ds_test = Dataset.from_pandas(test)


def run_inference_on_gpu(model_path, gpu_id, test_data, output_name):
    """Run inference for one model on one GPU"""
    
    device = f"cuda:{gpu_id}"
    print(f"Loading {output_name} on {device}...")
    
    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_path, 
        device_map=device, 
        torch_dtype=torch.float16
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model.config.pad_token_id = tokenizer.pad_token_id
    model.eval()
    
    # Tokenize function
    def tokenize(batch):
        return tokenizer(batch["text"], 
                        truncation=True,
                        max_length=256)
    
    ds_test = Dataset.from_pandas(test_data[['text']])
    ds_test = ds_test.map(tokenize, batched=True, remove_columns=['text'])
    
    # Data collator
    data_collator = DataCollatorWithPadding(
        tokenizer=tokenizer,
        padding=True,
        return_tensors="pt"
    )
    
    # DataLoader
    dataloader = DataLoader(
        ds_test,
        batch_size=4,
        shuffle=False,
        collate_fn=data_collator,
        pin_memory=True,
        num_workers=0
    )
    
    # Inference
    all_logits = []
    with torch.no_grad():
        for batch in tqdm(dataloader, desc=f"{output_name}"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            all_logits.append(outputs.logits.float().cpu().numpy())
    
    predictions = np.concatenate(all_logits, axis=0)
    
    # Process results
    probs = softmax(predictions, axis=1)
    top_indices = np.argsort(-probs, axis=1)
    
    # Decode labels
    flat_indices = top_indices.flatten()
    decoded_labels = le.inverse_transform(flat_indices)
    top_labels = decoded_labels.reshape(top_indices.shape)
    
    filtered_top3 = []

    # Build per-row family prefix from test.is_correct (1 -> True_, 0 -> False_)
    fam_prefix = np.where(test_data.is_correct.values == 1, "True_", "False_")
    valid_labels = set(le.classes_)  # or: set(le.classes_)
    for i in range(len(test_data)):
        pref = fam_prefix[i]
        row = [lab for lab in top_labels[i, :] if lab.startswith(pref)]
    
        # de-dup while preserving order
        seen = set(); row = [x for x in row if not (x in seen or seen.add(x))]
    
        # backfill safely from allowed labels (prefer Neither, then Correct when True_)
        fillers = [f"{pref}Neither:NA", f"{pref}Correct:NA"] if pref == "True_" else [f"{pref}Neither:NA"]
        for f in fillers:
            if len(row) >= 3: break
            if f in valid_labels and f not in row:
                row.append(f)
        while len(row) < 3:
            # last resort: keep repeating a valid family-safe filler that's allowed
            f = fillers[0] if fillers and fillers[0] in valid_labels else next(l for l in valid_labels if l.startswith(pref))
            row.append(f)
    
        filtered_top3.append(" ".join(row[:3]))
    
    joined_preds = filtered_top3  # <— replace previous joined_preds

    sub = pd.DataFrame({
        "row_id": test_data.row_id.values,
        "Category:Misconception": joined_preds
    })
    sub.to_csv(f"submission_{output_name}.csv", index=False)
    
    # Save probabilities for ensemble
    prob_data = []
    for i in range(len(predictions)):
        prob_dict = {f"prob_{j}": probs[i, top_indices[i, j]] for j in range(25)}
        prob_dict['row_id'] = test_data.row_id.values[i]
        prob_dict['top_classes'] = " ".join(top_labels[i, :25])
        prob_data.append(prob_dict)
    
    prob_df = pd.DataFrame(prob_data)
    prob_df.to_csv(f"submission_{output_name}_probabilities.csv", index=False)
    
    print(f" {output_name} completed - saved submission and probabilities")
    
    # Clean up GPU memory
    del model, tokenizer
    torch.cuda.empty_cache()

print(" Starting multi-GPU inference...")
start_time = time.time()

threads = []
gpu_assignments = [
    (model_paths[0], 0, "deepseek"),
    (model_paths[1], 1, "qwen3"),
]

# Start threads
for model_path, gpu_id, name in gpu_assignments:
    if gpu_id < torch.cuda.device_count():  
        thread = threading.Thread(
            target=run_inference_on_gpu,
            args=(model_path, gpu_id, test, name)
        )
        threads.append(thread)
        thread.start()
        time.sleep(10)  # Stagger starts to avoid memory issues

# Wait for completion
for thread in threads:
    thread.join()

end_time = time.time()
print(f" completed in {end_time - start_time:.2f} seconds!")

Writing qwen3_deepseek_inference.py


## Run inference

In [3]:
import time 
!python /kaggle/working/gemma2_inference.py
time.sleep(10)
!python /kaggle/working/qwen3_deepseek_inference.py

2025-10-15 07:33:34.001152: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760513614.383102      39 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760513614.494998      39 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Train shape: (36696, 9) with 65 target classes
Loading checkpoint shards: 100%|██████████████████| 4/4 [01:43<00:00, 25.98s/it]
Some weights of Gemma2ForSequenceClassification were not initialized from the model checkpoint at /kaggle/input/gemma2-9b-it-bf16 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|███████

## Ensemble 

In [4]:
import pandas as pd
import numpy as np
from collections import defaultdict

# -------------------------
# Build family map
# -------------------------
train = pd.read_csv('/kaggle/input/map-charting-student-math-misunderstandings/train.csv')
test_df = pd.read_csv('/kaggle/input/map-charting-student-math-misunderstandings/test.csv')

train['is_true'] = train['Category'].str.startswith('True')
correct = (train[train.is_true]
           .assign(c=lambda df: df.groupby(['QuestionId','MC_Answer']).MC_Answer.transform('count'))
           .sort_values('c', ascending=False)
           .drop_duplicates(['QuestionId'])[['QuestionId','MC_Answer']])
correct['is_correct'] = 1

fam_map = (test_df.merge(correct, on=['QuestionId','MC_Answer'], how='left')
                  .assign(is_correct=lambda df: df.is_correct.fillna(0).astype(int))
                  .set_index('row_id')['is_correct']
                  .map({1: 'True_', 0: 'False_'}).to_dict())

# -------------------------
# Ensemble
# -------------------------
def extract_class_probabilities(row, model_suffix='', top_k=25):
    """Extract class names and probabilities from a row"""
    classes_col = f'top_classes{model_suffix}'
    if classes_col in row:
        classes = row[classes_col].split(' ')[:top_k]
    else:
        return {}
    class_probs = {}
    for i in range(min(top_k, len(classes))):
        prob_col = f'prob_{i}{model_suffix}'
        if prob_col in row:
            class_probs[classes[i]] = row[prob_col]
    return class_probs


def ensemble_with_disagreement_handling(prob_files, model_weights=None, top_k=3):
    n_models = len(prob_files)
    prob_dfs = []
    final_predictions = []
    
    for file_path in prob_files:
        df = pd.read_csv(file_path)
        prob_dfs.append(df)
    
    # Merge on row_id
    merged_df = prob_dfs[0]
    for i, df in enumerate(prob_dfs[1:], 1):
        merged_df = pd.merge(merged_df, df, on='row_id', suffixes=('', f'_model{i+1}'))
      
    for idx, row in merged_df.iterrows():
        pref = fam_map[row['row_id']]  # family for this row
        
        # Extract probabilities from each model
        all_class_probs = []
        for i in range(n_models):
            suffix = f'_model{i+1}' if i > 0 else ''
            class_probs = extract_class_probabilities(row, suffix, top_k=25)
            all_class_probs.append(class_probs)
        
        # Get all unique classes
        all_classes = set()
        for class_probs in all_class_probs:
            all_classes.update(class_probs.keys())
        
        # Calculate scores
        class_votes = defaultdict(int)
        class_total_prob = defaultdict(float)
        class_max_prob = defaultdict(float)
        
        for i, class_probs in enumerate(all_class_probs):
            weight = model_weights[i]
            for class_name, prob in class_probs.items():
                class_votes[class_name] += 1
                class_total_prob[class_name] += prob * weight
                class_max_prob[class_name] = max(class_max_prob[class_name], prob * weight)
        
        final_scores = {}
        for class_name in all_classes:
            base_score = class_total_prob[class_name]
            agreement_bonus = class_votes[class_name] / n_models
            confidence_bonus = class_max_prob[class_name]
            final_scores[class_name] = (
                base_score * 0.6 +
                agreement_bonus * 0.3 +
                confidence_bonus * 0.1
            )
        
        # -------------------------
        # Family filter
        # -------------------------
        final_scores = {k: v for k, v in final_scores.items() if k.startswith(pref)}
        
        # Sort and get top-k
        sorted_classes = sorted(final_scores.items(), key=lambda x: -x[1])
        top_classes = [class_name for class_name, _ in sorted_classes[:top_k]]
        
        # Backfill if < 3
        fillers = [f"{pref}Neither:NA"] + ([f"{pref}Correct:NA"] if pref == "True_" else [])
        for f in fillers:
            if len(top_classes) >= 3: break
            if f not in top_classes:
                top_classes.append(f)
        while len(top_classes) < 3:
            top_classes.append(fillers[0])
        
        final_predictions.append(' '.join(top_classes))
    
    return final_predictions


# -------------------------
# Run ensemble
# -------------------------
w1, w2, w3 = 1.2, 1.0, 0.8
prob_files = [
    '/kaggle/working/submission_deepseek_probabilities.csv',
    '/kaggle/working/submission_gemma_prob.csv',
    '/kaggle/working/submission_qwen3_probabilities.csv'
]

predictions = ensemble_with_disagreement_handling(
    prob_files, 
    model_weights=[w1, w2, w3],  
    top_k=3
)

submission = pd.DataFrame({
    'row_id': test_df.row_id.values,
    'Category:Misconception': predictions
})

submission.to_csv('submission.csv', index=False)
print(submission.head())

   row_id                             Category:Misconception
0   36696  True_Correct:NA True_Neither:NA True_Misconcep...
1   36697  False_Misconception:WNB False_Neither:NA False...
2   36698  True_Neither:NA True_Correct:NA True_Misconcep...
