# Commonsense MCQA

This notebook benchmarks steering methods on the [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) dataset. We compare the unsteered baseline, few-shot steering with varying numbers of examples, and a LoRA adapter trained with DPO. Using `ControlSpec`, we sweep over the number of few-shot examples to find the minimum configuration that outperforms DPO-LoRA.

## Setup

In [None]:
import os
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import transformers
from datasets import Dataset
from peft import PeftType

from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils import flatten_profiles, get_param_values

transformers.logging.set_verbosity_error()
os.chdir("./examples/notebooks/benchmark_commonsense_mcqa/")

## Building the use case

The use case of interest has already been constructed via the [use case](../../../docs/tutorials/add_new_use_case.md) tutorial and is available at `aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py`.

In [None]:
commonsense_mcqa = CommonsenseMCQA(
    evaluation_data="evaluation_qa.jsonl",
    evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
    num_shuffling_runs=20,
    num_samples=50
)

Two custom metrics have been created for the use case: `MCQAAccuracy` which measures the accuracy statistics of each question (across trials), and `MCQAPositionalBias` which measures the positional bias (via deviation from the uniform distribution across runs). To facilitate computation of these statistics, the use case accepts a keyword argument `num_shuffling_runs` dictating how many times each question should be presented to the (steered) model under a randomized ordering of the choices.

## Preparing the steering data

The benchmark uses steering data consisting of triples `(question, answer_chosen, answer_rejected)` extracted from the CommonsenseQA dataset.

In [None]:
with open("steer_qa.jsonl", "r") as f:
    steering_data = [json.loads(line) for line in f]

len(steering_data), steering_data[0]

For the `FewShot` control, we need to create example pools:

In [None]:
positive_pool = [{"question": row["question"], "answer": row["answer_chosen"]} for row in steering_data]
negative_pool = [{"question": row["question"], "answer": row["answer_rejected"]} for row in steering_data]

len(positive_pool), len(negative_pool)

## Defining the controls

### FewShot with ControlSpec

Instead of using a fixed number of examples, we use `ControlSpec` to sweep over different values of `k_positive`. We fix `k_negative=0` to isolate the effect of positive examples.

In [None]:
few_shot_spec = ControlSpec(
    control_cls=FewShot,
    params={
        "selector_name": "random",
        "positive_example_pool": positive_pool,
        "negative_example_pool": negative_pool,
        "k_negative": 0,
    },
    vars=[{"k_positive": k} for k in [1, 5, 10, 25]],
    name="FewShot",
)

### DPO with LoRA (fixed control)

The DPO-LoRA control serves as our target to beat. It uses the same steering data to fine-tune a LoRA adapter.

In [None]:
train_ds = Dataset.from_list([
    {"prompt": row["question"], "chosen": row["answer_chosen"], "rejected": row["answer_rejected"]}
    for row in steering_data
])

dpo_lora = DPO(
    train_dataset=train_ds,
    output_dir="trl_models/Qwen2.5-0.5B-DPO-Lora-Steer",
    per_device_train_batch_size=4,
    num_train_epochs=2,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="sigmoid",
    max_length=1024,
    max_prompt_length=512,
    disable_dropout=True,
    logging_steps=100,
    save_strategy="no",
    report_to="none",
    seed=123,
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="dpo",
    merge_lora_after_train=False,
)

## Running the benchmark

The benchmark compares:
- **baseline**: Unsteered model
- **few_shot_sweep**: FewShot with varying `k_positive` (1, 5, 10, 25)
- **dpo_lora**: DPO-trained LoRA adapter

In [None]:
benchmark = Benchmark(
    use_case=commonsense_mcqa,
    base_model_name_or_path="Qwen/Qwen2.5-1.5B-Instruct",
    steering_pipelines={
        "baseline": [],
        "few_shot_sweep": [few_shot_spec],
        "dpo_lora": [dpo_lora],
    },
    gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
    device_map="auto"
)

profiles = benchmark.run()

In [None]:
benchmark.export(profiles, save_dir="./profiles/")

## Analysis

We use the evaluation utilities to process and visualize the benchmark results.

### Flatten and summarize results

In [None]:
runs_df = flatten_profiles(
    profiles,
    metric_accessors={
        "accuracy": ("MCQAAccuracy", "question_mean"),
        "positional_bias": ("MCQAPositionalBias", "mean"),
    }
)
runs_df["k_positive"] = get_param_values(runs_df, "FewShot", "k_positive")

runs_df[["pipeline", "k_positive", "accuracy", "positional_bias"]].round(3)

### Compare FewShot vs DPO-LoRA

Find the minimum number of positive examples needed for FewShot to outperform DPO-LoRA.

In [None]:
dpo_accuracy = runs_df[runs_df["pipeline"] == "dpo_lora"]["accuracy"].iloc[0]
baseline_accuracy = runs_df[runs_df["pipeline"] == "baseline"]["accuracy"].iloc[0]
few_shot_df = runs_df[runs_df["pipeline"] == "few_shot_sweep"].sort_values("k_positive")

# find minimum k that beats DPO
beats_dpo = few_shot_df[few_shot_df["accuracy"] > dpo_accuracy]
min_k = int(beats_dpo["k_positive"].min()) if not beats_dpo.empty else None

few_shot_df[["k_positive", "accuracy", "positional_bias"]].round(3)

### Accuracy scaling

The following plot shows how FewShot accuracy scales with the number of positive examples, compared to the DPO-LoRA and baseline reference lines.

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(few_shot_df["k_positive"], few_shot_df["accuracy"], "o-", markersize=8, label="FewShot")
ax.axhline(dpo_accuracy, color="red", linestyle="--", linewidth=2, label="DPO-LoRA")
ax.axhline(baseline_accuracy, color="gray", linestyle=":", linewidth=2, label="Baseline")
ax.set_xlabel("Number of Positive Examples (k_positive)")
ax.set_ylabel("Accuracy")
ax.set_ylim(0, 1.1)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Positional bias

We also examine how positional bias changes with the number of examples.

In [None]:
dpo_bias = runs_df[runs_df["pipeline"] == "dpo_lora"]["positional_bias"].iloc[0]
baseline_bias = runs_df[runs_df["pipeline"] == "baseline"]["positional_bias"].iloc[0]

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(few_shot_df["k_positive"], few_shot_df["positional_bias"], "s-", markersize=8, color="orange", label="FewShot")
ax.axhline(dpo_bias, color="red", linestyle="--", linewidth=2, label="DPO-LoRA")
ax.axhline(baseline_bias, color="gray", linestyle=":", linewidth=2, label="Baseline")
ax.set_xlabel("Number of Positive Examples (k_positive)")
ax.set_ylabel("Positional Bias")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Summary table

The summary table below ranks all methods by accuracy.

In [None]:
summary_data = [
    {"Method": "Baseline", "Accuracy": baseline_accuracy, "Positional Bias": baseline_bias},
    {"Method": "DPO-LoRA", "Accuracy": dpo_accuracy, "Positional Bias": dpo_bias},
]
for _, row in few_shot_df.iterrows():
    summary_data.append({
        "Method": f"FewShot (k={int(row['k_positive'])})",
        "Accuracy": row["accuracy"],
        "Positional Bias": row["positional_bias"],
    })

summary_df = pd.DataFrame(summary_data).sort_values("Accuracy", ascending=False)
summary_df.style.format({"Accuracy": "{:.1%}", "Positional Bias": "{:.3f}"}).background_gradient(subset=["Accuracy"], cmap="RdYlGn")

## Takeaways

The results show that FewShot can match or exceed DPO-LoRA performance with a sufficient number of positive examples, without requiring any fine-tuning. This makes FewShot a simpler alternative when quick iteration is needed.