# Studying Steering Side-Effects

In this notebook, we study the instruction following ability of a model across a range of instruction types. Additionally, we inspect if steering the model to be better at following instructions impacts the model's response quality in general.

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset
from transformers import logging as hf_logging

from aisteer360.algorithms.state_control.pasta.control import PASTA
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.evaluation.use_cases.instruction_following import InstructionFollowing
from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction
from aisteer360.evaluation.metrics.generic.reward_score import RewardScore
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils import (
    flatten_profiles,
    summarize_by_config,
    get_param_values,
    build_per_example_df,
    to_jsonable,
)
from aisteer360.evaluation.utils.viz_utils import (
    create_tradeoff_figure,
    plot_metric_heatmap,
    plot_comparison_bars,
    plot_pareto_frontier,
    plot_tradeoff_scatter,
)

hf_logging.set_verbosity_error()
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

## Data preparation

There are innumerable types of instructions that a model can be prompted with. To better understand a model's instruction following ability in general, we explore model behavior across a wide range of instruction types as organized by the `IFEval` dataset. For the purposes of this study, we make use of our modified version of the IFEval dataset, termed `Split-IFEval`, in which the instructions are explicitly extracted from the prompt (this makes it easier to create interventions that rely directly on these tokens).

In [None]:
ifeval = load_dataset("ibm-research/Split-IFEval")
ifeval_df = ifeval["train"].to_pandas()

cols = ["instructions", "instruction_id_list", "kwargs"]
for col in cols:
    ifeval_df[col] = ifeval_df[col].apply(
        lambda x: x.tolist() if isinstance(x, np.ndarray) else x
    )

ifeval_df

Notice via the `instruction_id_list` column, each prompt can in general contain a number of instructions. We'll focus on the prompts that contain a single example.

In [None]:
ifeval_df["num_instructions"] = ifeval_df["instruction_id_list"].apply(len)
single_instr_df = ifeval_df[ifeval_df["num_instructions"] == 1].copy()
single_instr_df["instruction_id"] = single_instr_df["instruction_id_list"].apply(lambda ids: ids[0])
instruction_group_sizes = (
    single_instr_df["instruction_id"]
    .value_counts()
    .rename_axis("instruction_id")
    .reset_index(name="count")
)

instruction_group_sizes.head(20)

We'll study the following instruction types:

- `keywords:forbidden_words`: describes that the response must avoid using anything from the specified forbidden list.
- `detectable_format:number_highlighted_sections`: describes that the response must contain at least a specified number of highlighted sections using a defined markup pattern.
- `language:response_language`: indicates that the model must generate its entire response in a specific target language.
- `startend:end_checker`: describes that the response must end with an exact required phrase (with nothing extra following it).

In [None]:
instruction_types = [
    "keywords:forbidden_words",
    "detectable_format:number_highlighted_sections",
    "language:response_language",
    "startend:end_checker",
]

filtered_df = single_instr_df[
    single_instr_df["instruction_id"].isin(instruction_types)
].copy()

balanced_filtered = (
    filtered_df.groupby("instruction_id")
    .apply(lambda g: g.sample(min(len(g), 12), random_state=123))
    .reset_index(drop=True)
)

balanced_filtered

Evaluation data takes the form of a prompt (including instructions), the specific instructions (separated from the prompt), the IDs of the instructions, and any associated kwargs for the instructions.

In [None]:
evaluation_data = [
    {
        "prompt": row["prompt"],
        "instructions": to_jsonable(row["instructions"]),
        "instruction_id_list": to_jsonable(row["instruction_id_list"]),
        "kwargs": to_jsonable(row["kwargs"]),
    }
    for _, row in balanced_filtered.iterrows()
]

len(evaluation_data), evaluation_data[0]

## Defining the benchmark

We use the `ControlSpec` class to sweep the steering strength `alpha`. The impacted layers and the method are assumed to be fixed throughout.

In [None]:
pasta_spec = ControlSpec(
    control_cls=PASTA,
    params={
        "head_config": list(range(8, 24)),
        "scale_position": "include",
    },
    vars=[
        {"alpha": 5.0},
        {"alpha": 10.0},
        {"alpha": 20.0},
        {"alpha": 30.0},
        {"alpha": 40.0},
        {"alpha": 50.0},
    ],
    name="PASTA",
)

The instruction following use case is initialized with two metrics: `StrictInstruction` and `RewardScore`. We will be studying the trade-off between these two metrics.

In [None]:
instruction_following = InstructionFollowing(
    evaluation_data=evaluation_data,
    evaluation_metrics=[
        StrictInstruction(),
        RewardScore(
            model_or_id="OpenAssistant/reward-model-deberta-v3-large-v2",
            score_transform="identity",
            batch_size=8,
            max_length=1024,
            return_logits=False,
        )
    ],
)

The benchmark can then be defined on two steering pipelines: the baseline (unsteered) model, and the above `pasta_spec`.

In [None]:
benchmark = Benchmark(
    use_case=instruction_following,
    base_model_name_or_path=MODEL_NAME,
    steering_pipelines={
        "baseline": [],
        "pasta_alpha_sweep": [pasta_spec],
    },
    runtime_overrides={
        "PASTA": {"substrings": "instructions"},
    },
    gen_kwargs={
        "max_new_tokens": 128,
        "do_sample": True,
        "output_attentions": True,
    },
    hf_model_kwargs={
        "attn_implementation": "eager",
    },
    device_map="auto",
    num_trials=5
)

Running the benchmark yields the profiles across the baseline and the full set of configurations in the spec.

In [None]:
profiles = benchmark.run()

## Analysis

We now use the built-in utilities from `aisteer360.evaluation.utils` to process and visualize the profiles.

### Flattening profiles

The `flatten_profiles` utility converts the nested benchmark output into a flat DataFrame with one row per run. We specify which metrics to extract via `metric_accessors`.

In [None]:
# flatten profiles with metric extraction
runs_df = flatten_profiles(
    profiles,
    metric_accessors={
        "strict_prompt_acc": ("StrictInstruction", "strict_prompt_accuracy"),
        "strict_instr_acc": ("StrictInstruction", "strict_instruction_accuracy"),
        "mean_reward": ("RewardScore", "mean_reward"),
    }
)

# extract the swept alpha parameter as a column
runs_df["alpha"] = get_param_values(runs_df, "PASTA", "alpha")

display(
    runs_df
    .drop(columns=["_run", "params"])
    .sort_values(["alpha", "trial_id"], na_position="last")
    .reset_index(drop=True)
)

### Summarizing by configuration

The `summarize_by_config` utility aggregates metrics across trials, computing mean and std for each configuration.

In [None]:
# summarize across trials
summary = summarize_by_config(
    runs_df,
    metric_cols=["strict_prompt_acc", "strict_instr_acc", "mean_reward"],
    group_cols=["pipeline", "config_id", "alpha"],
)

# add a readable config label
summary["config"] = summary["alpha"].apply(
    lambda a: "baseline" if pd.isna(a) else f"alpha={a}"
)

display(summary[[
    "config", "alpha", "n_trials",
    "strict_prompt_acc_mean", "strict_prompt_acc_std",
    "mean_reward_mean", "mean_reward_std"
]].sort_values("alpha", na_position="last").round(4))

### Tradeoff visualization

The `create_tradeoff_figure` utility generates a 3-panel figure showing:

1. Instruction following rate vs steering strength
2. Output quality (reward) vs steering strength  
3. Scatter plot of reward vs accuracy

In [None]:
fig = create_tradeoff_figure(
    summary,
    x_metric="strict_prompt_acc",
    y_metric="mean_reward",
    sweep_col="alpha",
    baseline_pipeline="baseline",
    x_label="Strict Prompt Accuracy",
    y_label="Mean Reward Score",
    sweep_label="Alpha",
    save_path="pasta_tradeoff_analysis.png",
)

### Per-example analysis

The `build_per_example_df` utility extracts per-example results from a single run, making it easy to analyze individual cases.

In [None]:
def get_run_by_config(runs_df: pd.DataFrame, pipeline: str, alpha=None, trial_id: int = 0):
    """Get a specific run from the flattened DataFrame."""
    if pipeline == "baseline":
        mask = (runs_df["pipeline"] == "baseline") & (runs_df["trial_id"] == trial_id)
    else:
        mask = (runs_df["pipeline"] == pipeline) & (runs_df["alpha"] == alpha) & (runs_df["trial_id"] == trial_id)
    return runs_df.loc[mask, "_run"].iloc[0]

In [None]:
pasta_summary = summary[summary["config"] != "baseline"]
strongest_alpha = pasta_summary["alpha"].min()

baseline_run = get_run_by_config(runs_df, "baseline")
strong_run = get_run_by_config(runs_df, "pasta_alpha_sweep", strongest_alpha)

baseline_ex = build_per_example_df(
    baseline_run,
    generation_fields=["prompt", "response", "instruction_id_list"],
    metric_lists={
        "followed": ("StrictInstruction", "follow_all_instructions"),
        "reward": ("RewardScore", "rewards"),
    }
)
strong_ex = build_per_example_df(
    strong_run,
    generation_fields=["prompt", "response", "instruction_id_list"],
    metric_lists={
        "followed": ("StrictInstruction", "follow_all_instructions"),
        "reward": ("RewardScore", "rewards"),
    }
)

# find cases where steering fixed instruction following
comparison = baseline_ex[["idx", "followed", "reward"]].merge(
    strong_ex[["idx", "followed", "reward"]],
    on="idx", suffixes=("_base", "_strong")
)
fixed = comparison[(~comparison["followed_base"]) & (comparison["followed_strong"])].copy()
fixed["reward_delta"] = fixed["reward_strong"] - fixed["reward_base"]

fixed.sort_values("reward_delta")[["idx", "reward_base", "reward_strong", "reward_delta"]]

## Additional plots

### Per-instruction-type breakdown

We can use `plot_metric_heatmap` to visualize performance across instruction types and steering strengths.

In [None]:
def extract_per_instruction_results(profiles, evaluation_data):
    """Break down results by instruction type and alpha."""
    rows = []

    for pipeline_name, runs in profiles.items():
        for run in runs:
            alpha = (run.get("params", {}) or {}).get("PASTA", {}).get("alpha", None)
            if pipeline_name == "baseline":
                alpha = 0.0

            generations = run["generations"]
            followed_list = run["evaluations"]["StrictInstruction"]["follow_all_instructions"]
            rewards = run["evaluations"]["RewardScore"]["rewards"]

            for i, (gen, followed, reward) in enumerate(zip(generations, followed_list, rewards)):
                instr_id = gen["instruction_id_list"][0] if gen.get("instruction_id_list") else None
                rows.append({
                    "alpha": alpha,
                    "instruction_type": instr_id.split(":")[-1] if instr_id else None,
                    "followed": followed,
                    "reward": reward,
                    "trial_id": run["trial_id"],
                })

    return pd.DataFrame(rows)

per_instr_df = extract_per_instruction_results(profiles, evaluation_data)

# aggregate by instruction type and alpha
instr_summary = (
    per_instr_df
    .groupby(["instruction_type", "alpha"])
    .agg(
        follow_rate=("followed", "mean"),
        mean_reward=("reward", "mean"),
        n=("followed", "count")
    )
    .reset_index()
)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

follow_pivot = instr_summary.pivot(index="instruction_type", columns="alpha", values="follow_rate")
reward_pivot = instr_summary.pivot(index="instruction_type", columns="alpha", values="mean_reward")

plot_metric_heatmap(
    follow_pivot,
    ax=axes[0],
    title="instruction following by type & alpha",
    xlabel="alpha (0 = baseline)",
    vmin=0, vmax=1,
    cbar_label="follow rate",
)

plot_metric_heatmap(
    reward_pivot,
    ax=axes[1],
    title="response quality by type & alpha",
    xlabel="alpha (0 = baseline)",
    fmt=".1f",
    cbar_label="reward",
)

plt.tight_layout()
plt.show()

### Which instructions benefit from steering?

We use `plot_comparison_bars` to compare baseline vs steered performance.

In [None]:
baseline_by_type = instr_summary[instr_summary["alpha"] == 0.0].set_index("instruction_type")
best_alpha = 10.0
steered_by_type = instr_summary[instr_summary["alpha"] == best_alpha].set_index("instruction_type")

comparison_by_type = pd.DataFrame({
    "instruction_type": baseline_by_type.index,
    "follow_delta": steered_by_type["follow_rate"].values - baseline_by_type["follow_rate"].values,
    "reward_delta_scaled": (steered_by_type["mean_reward"].values - baseline_by_type["mean_reward"].values) / 3,
}).sort_values("follow_delta", ascending=False)

plot_comparison_bars(
    comparison_by_type,
    metric_cols=["follow_delta", "reward_delta_scaled"],
    group_col="instruction_type",
    title=f"change from baseline (alpha={best_alpha})",
    ylabel="delta from baseline",
    colors=["steelblue", "coral"],
)
plt.show()

### Efficiency Frontier

We can overlay a Pareto frontier on the scatter plot using `plot_pareto_frontier`.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

# filter to non-baseline for scatter
pasta_summary = summary[summary["config"] != "baseline"]
baseline_row = summary[summary["config"] == "baseline"].iloc[0] if len(summary[summary["config"] == "baseline"]) > 0 else None

# plot scatter with tradeoff
plot_tradeoff_scatter(
    pasta_summary,
    x_metric="strict_prompt_acc",
    y_metric="mean_reward",
    color_col="alpha",
    label_col="config",
    baseline_row=baseline_row,
    ax=ax,
    title="quality–compliance tradeoff with pareto frontier",
    xlabel="strict prompt accuracy",
    ylabel="mean reward score",
)

# overlay Pareto frontier
plot_pareto_frontier(
    pasta_summary,
    x_metric="strict_prompt_acc",
    y_metric="mean_reward",
    ax=ax,
    maximize_x=True,
    maximize_y=True,
)

plt.tight_layout()
plt.show()

## Takeaways

PASTA steering can improve instruction following, but the optimal alpha depends on the acceptable quality trade-off. For this model and task, moderate steering (α ∈ [10, 20]) typically offers the best balance.