# Choosing a Steering Efficiency Metric

The goal of this notebook is to validate some intuitions and design choices about the steering metric. 

We are primarily focusing on the logit difference as our downstream metric, this has been used in prior work as a measure of indirect effect [Linear Representation Hypothesis, Sparse Feature Circuits]

Caveats 
- Previous work in steering vectors uses the average key probability instead, we do a comparison w/ this metric 
- The steering metric doesn't differentiate examples by base propensity, we investigate whether steering is substantially different in these two situations
- The steering metric doesn't differentiate positive vs negative steering performance, we investigate whether this matters




In [None]:
from repepo.steering.sweeps.constants import (
    ALL_ABSTRACT_CONCEPT_DATASETS,
    ALL_TOKEN_CONCEPT_DATASETS, 
    ALL_LANGUAGES,
    ALL_LLAMA_7B_LAYERS,
    ALL_MULTIPLIERS
)

from repepo.steering.sweeps.configs import (
    get_abstract_concept_config,
    get_token_concept_config
)

from repepo.steering.run_sweep import (
    run_sweep, 
    load_sweep_results
)

from repepo.steering.plots.utils import (
    get_config_fields,
    make_results_df
)

In [None]:
# Define the sweep to run over. 

from itertools import product

debug_setting = {
    "datasets": ["power-seeking-inclination"],
    "layers": [13],
    "multipliers": [-1.0, 0.0, 1.0]
}


def iter_config(setting):
    for dataset, layer, multiplier in product(
        setting["datasets"], 
        setting["layers"], 
        setting["multipliers"]
    ):
        yield get_abstract_concept_config(
            dataset=dataset,
            layer=layer,
            multiplier=multiplier
        )


In [None]:
# Optionally, run the sweep and load results. 
# If sweep was already run, set RUN = False.
RUN = False

configs = list(iter_config(debug_setting))
if RUN:
    run_sweep(configs, force_rerun_apply=True)

results = load_sweep_results(configs)

In [None]:
# Construct a DataFrame from the results.
df = make_results_df(results)
print(len(df))
df.head()

In [None]:
# Plot the change in positive prob and negative prob for one example. 

import seaborn as sns 
import matplotlib.pyplot as plt

def plot(df):
    example = df.iloc[0]
    df = df[df["test_positive_example.text"] == example["test_positive_example.text"]]
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))
    # Plot positive token logit, negative token logit.
    sns.lineplot(data=df, x="multiplier", y="test_positive_token.logprob", label="Positive logprob", ax=ax)
    sns.lineplot(data=df, x="multiplier", y="test_negative_token.logprob", label="Negative logprob", ax=ax)

plot(df)

In [None]:
# Plot the change in positive token logit and negative token logit for one example. 

import seaborn as sns 
import matplotlib.pyplot as plt

def plot(df):
    example = df.iloc[0]
    df = df[df["test_positive_example.text"] == example["test_positive_example.text"]]
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))
    # Plot positive token logit, negative token logit.
    sns.lineplot(data=df, x="multiplier", y="test_positive_token.logit", label="Positive logit", ax=ax)
    sns.lineplot(data=df, x="multiplier", y="test_negative_token.logit", label="Negative logit", ax=ax)
    # Also plot the logit_mean
    sns.lineplot(data=df, x="multiplier", y="test_positive_token.logit_mean", label="Positive logit mean", ax=ax)
    # Also plot the logit_mean
    # sns.lineplot(data=df, x="multiplier", y="test_positive_token.logit_mean", label="Logit mean", ax=ax)
    # sns.lineplot(data=df, x="multiplier", y="test_negative_token.logit_mean", label="Logit mean", ax=ax)

plot(df)

In [None]:
import pandas as pd
import numpy as np

def calculate_steering_efficiency(
    df: pd.DataFrame, 
    base_metric_name: str = "logit_diff"
):
    df = df.copy()
    # Group by examples
    fields_to_group_by = get_config_fields()
    fields_to_group_by.remove("multiplier")
    fields_to_group_by += ["test_positive_example.text"]

    grouped = df.groupby(fields_to_group_by)

    def fit_linear_regression(df: pd.DataFrame):
        # Fit a linear regression of the base metric on the multiplier
        # Return the slope and error of the fit 
        x = df["multiplier"].to_numpy()
        y = df[base_metric_name].to_numpy()        
        (slope, intercept), residuals, _, _, _ = np.polyfit(x, y, 1, full=True)
        # Return a dataframe with the slope and residuals
        return pd.DataFrame({
            "slope": [slope],
            "residual": [residuals.item()]
        })

    # Apply a linear-fit to each group using grouped.apply
    slopes = grouped.apply(fit_linear_regression, include_groups = False)
    df = df.merge(slopes, on=fields_to_group_by, how='left')
    return df 

df = calculate_steering_efficiency(df)
print(len(df))

# Scatter plot of the slopes and residuals
fig, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(data=df, x="slope", y="residual", ax=ax)

In [None]:
from IPython.display import display, HTML

pd.set_option('display.max_colwidth', None)

def pretty_print(df):
    return display( HTML( df.to_html().replace("\\n","<br>") ) )

# Print the top 5 examples by slope. 
def print_top_k_by_slope(df, k: int = 5):
    df = df.copy()
    df = df.sort_values("slope", ascending=False)
    df = df[['test_positive_example.text', 
         'slope', 
         'residual', 
         'logit_diff', 
         'test_positive_token.logit', 
         'test_negative_token.logit'
    ]]
    df = df.drop_duplicates(subset=['test_positive_example.text'])
    pretty_print(df.head(k))

print_top_k_by_slope(df)

Negative logit diff here means that the model was initially going to give the wrong answer, but was able to give the right an

In [None]:
sns.scatterplot(data=df, x="slope", y="residual", hue="logit_diff", palette="icefire")

Remarks: 
- Here, we want the steering efficiency to be high, while residual is low (i.e. bottom right corner is best). 
- We observe that the best steering occurs for examples where the logit difference was around 0 (i.e. the model was already uncertain)

## What does high residual look like?  

Intuitively this means the steering effect is not a line. We visualize logit diff vs multiplier for the examples with top 3 residual. 

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

def plot(df):   
    # Select top 3 and bottom 3 examples by residual
    temp_df = df.copy()
    temp_df = temp_df[temp_df['multiplier'] == 0]
    temp_df = temp_df.sort_values("residual", ascending=False)
    # Assert no duplicates
    assert temp_df['test_positive_example.text'].is_unique
    # temp_df = temp_df.drop_duplicates(subset=['test_positive_example.text'])
    top_3 = temp_df.head(3)['test_positive_example.text']
    bottom_3 = temp_df.tail(3)['test_positive_example.text']
    # select middle 3 examples
    # middle_3 = temp_df.iloc[(len(temp_df) // 2 - 1):(len(temp_df) // 2 + 2)]['test_positive_example.text']
    combined = pd.concat([top_3, bottom_3])

    # Create 'group' category for whether the example is top, middle or bottom
    df['group'] = 'other'
    df.loc[df['test_positive_example.text'].isin(top_3), 'group'] = 'top'
    # df.loc[df['test_positive_example.text'].isin(middle_3), 'group'] = 'middle'
    df.loc[df['test_positive_example.text'].isin(bottom_3), 'group'] = 'bottom'

    # Filter df by the selected examples
    df = df[df['test_positive_example.text'].isin(combined)]
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))
    # Plot logit diff. 
    sns.lineplot(data=df, x="multiplier", y="logit_diff", hue="group", ax=ax)

plot(df)

Remarks: 
- Even when the residual is high, it seems the effect is still monotonic

# 1. Investigating the effect of base propensity

The steering metric doesn't differentiate examples by base propensity. 
As a result, we investigate whether steering is substantially different in these two situations

First, let's plot the base propensity of all examples in the dataset. Here, we use logit difference as a measure of propensity

In [None]:
# Plot histogram of logit diff
fig, ax = plt.subplots(figsize=(8, 8))
sns.histplot(data=df[df['multiplier'] == 0], x="logit_diff", bins=20, ax=ax)
# Add vertical line at 0
ax.axvline(0, color='black', linestyle='--')

Let's pick the top 3, middle 3, and bottom 3 examples and plot the logit diff vs multiplier. 

In [None]:
# Plot the change in positive token logit and negative token logit for one example. 

import seaborn as sns 
import matplotlib.pyplot as plt

def plot(df):   
    """ Plot the propensity vs multiplier for the top 3, middle 3, and bottom 3 examples by base propensity """
    # Select top 3 and bottom 3 examples by logit diff at zero multiplier
    temp_df = df.copy()
    temp_df = temp_df[temp_df['multiplier'] == 0]
    temp_df = temp_df.sort_values("logit_diff", ascending=False)
    # Assert no duplicates
    assert temp_df['test_positive_example.text'].is_unique
    # temp_df = temp_df.drop_duplicates(subset=['test_positive_example.text'])
    top_3 = temp_df.head(3)['test_positive_example.text']
    bottom_3 = temp_df.tail(3)['test_positive_example.text']
    # select middle 3 examples
    middle_3 = temp_df.iloc[(len(temp_df) // 2 - 1):(len(temp_df) // 2 + 2)]['test_positive_example.text']
    combined = pd.concat([top_3, middle_3, bottom_3])

    # Create 'group' category for whether the example is top, middle or bottom
    df['group'] = 'other'
    df.loc[df['test_positive_example.text'].isin(top_3), 'group'] = 'top'
    df.loc[df['test_positive_example.text'].isin(middle_3), 'group'] = 'middle'
    df.loc[df['test_positive_example.text'].isin(bottom_3), 'group'] = 'bottom'

    # Filter df by the selected examples
    df = df[df['test_positive_example.text'].isin(combined)]

    # Print the average slope within group
    print(df.groupby('group')['slope'].mean())
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 5))
    # Plot logit diff. 
    sns.lineplot(data=df, x="multiplier", y="logit_diff", hue="group", ax=ax)

plot(df)

Remarks: 
- Steering seems to "work best" when the base propensity is close to 0, reflecting that the model was uncertain. 
- Steering does not seem capable of influencing models' behaviour at very high or very low base propensity

There is a better way to visualize this, which is to do a scatter plot of (propensity) vs (base propensity). 

In [None]:
import numpy as np
import seaborn as sns
import seaborn.algorithms
from statsmodels.nonparametric.smoothers_lowess import lowess


def regplot_lowess_ci(data, x, y, ci_level, n_boot, ax, **kwargs):
    x_ = data[x].to_numpy()
    y_ = data[y].to_numpy()
    x_grid = np.linspace(start=x_.min(), stop=x_.max(), num=1000)

    def reg_func(_x, _y):
        return lowess(exog=_x, endog=_y, xvals=x_grid)

    beta_boots = seaborn.algorithms.bootstrap(
        x_, y_,
        func=reg_func,
        n_boot=n_boot,
    )
    err_bands = sns.utils.ci(beta_boots, ci_level, axis=0)
    y_plt = reg_func(x_, y_)

    sns.lineplot(x=x_grid, y=y_plt, ax = ax, **kwargs)
    sns.scatterplot(x=x_, y=y_, ax = ax, **kwargs)
    ax.fill_between(x_grid, *err_bands, alpha=.15, **kwargs)
    return ax

#### Fig 1.1

In [None]:
# Plot the change in positive token logit and negative token logit for one example. 

import seaborn as sns 
import matplotlib.pyplot as plt

def plot(df):   
    # Get "base propensity", which is the logit_diff at multiplier = 0
    base_propensity = df[df['multiplier'] == 0]
    base_propensity = base_propensity[['test_positive_example.text', 'logit_diff']]
    # Rename logit diff to base logit diff
    base_propensity = base_propensity.rename(columns={'logit_diff': 'base_logit_diff'})
    assert base_propensity['test_positive_example.text'].is_unique
    # Merge the base propensity into df
    df = df.merge(base_propensity, on='test_positive_example.text', how='left')

    # Scatter plot of propensity vs base propensity with hue as multiplier
    fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    palette = sns.diverging_palette(250, 30, l=65, center="dark", as_cmap=True)
    sns.scatterplot(data=df, y="logit_diff", x="base_logit_diff", hue="multiplier", ax=ax, palette=palette)

    # Plot slope vs base propensity. 
    # We use a nonparametric fit with lowess smoother to visualize the nonlinear relationship
    fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    regplot_lowess_ci(data=df, x="base_logit_diff", y="slope", ci_level=99, n_boot=100, ax=ax)

    # # Plot nonparametric fit w/ lowess smoother. 
    # sns.regplot(data=df, x="base_logit_diff", y="slope", ax=ax,
    #        lowess=True, line_kws={"color": "C1"})
    
plot(df)

Remarks. 
- Steering efficiency seems to be significantly lower when base propensity is high
- One potential conclusion from this: steering cannot adjust a model's behaviour if it is already highly confident. 

Alternative hypotheses which need to be disambiguated to make this claim: 
- The steering vector extracted from all samples is simply OoD w.r.t the examples with high base propensity. To investigate these, we need to figure out (i) is the steering vector for high base propensity examples significantly different from that for the general example? (ii) if so, does steering with the high-propensity vector work on the high-propensity examples? 

# Conclusion

We set out to answer 3 questions: 
- Previous work in steering vectors uses the average key probability instead, we do a comparison w/ this metric 
- The steering metric doesn't differentiate examples by base propensity, we investigate whether steering is substantially different in these two situations
- The steering metric doesn't differentiate positive vs negative steering performance, we investigate whether this matters


We conclude:
- TODO: compare w. average key probability (maybe not so important?)
- Base propensity seems to be an important factor deciding the steerability of an example
- Circumstantial evidence (Fig 1.1) that steerability is similar in both directions

Caveats of this study:
- Only did it for one dataset; need to see if trends hold for larger data