## Steering Vectors: Activation Engineering in GPT-2

This notebook implements **Activation Steering**, a technique to control model behavior and prediction by intervening in the residual stream during inference. A good reference I came across for this is [here](https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering).

**Core Concept:**
$$ h_{steered} = h_{original} + \alpha \cdot \vec{v}_{concept} $$

1. **Extract** a "sentiment direction" using contrastive prompt pairs
2. **Visualize** this direction using PCA (building our previous work)
3. **Intervene** with Pytorch hooks to steer generation towards positive or negative sentiment
4. **Validate** the effect quantitatively

In [1]:
# imports
import torch
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device}")

Running on: cuda


In [2]:
# do the model load stuff
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## Data Prep (Contrastive Pairs)

we need pairs of prompts that differ *only* in the target concept (sentiment).

**Critical:** We must ensure both prompts tokenize to the same length so we can subtract their residual streams directly!

In [3]:
def get_sentiment_pairs():
    """
    Returns list of (positive, negative) prompt tuples.

    Ref: https://www.kaggle.com/code/shakka/sentiment-analysis-using-contrastive-learning
    """

    return [
        ("I think this movie is amazing", "I think this movie is terrible"),
        ("The food at this restaurant is delicious", "The food at this restaurant is disgusting"),
        ("I am feeling very happy today", "I am feeling very sad today"),
        ("The product quality is excellent", "The product quality is awful"),
        ("My experience was wonderful", "My experience was horrible"),
        ("He is a very kind person", "He is a very mean person"),
        ("The weather is beautiful", "The weather is nasty"),
        ("This solution is perfect", "This solution is useless")
    ]

In [4]:
pairs = get_sentiment_pairs()

In [5]:
# quick verification on token lengths
print(f"{'Positive':<40} | {'Negative':<40} | {'Match?'}")
print("-"*90)
valid_pairs = []
for pos, neg in pairs:
    tok_pos = tokenizer(pos, return_tensors='pt')['input_ids'][0]
    tok_neg = tokenizer(neg, return_tensors='pt')['input_ids'][0]
    match = len(tok_pos) == len(tok_neg)
    print(f"{pos:<40} | {neg:<40} | {match} ({len(tok_pos)} vs {len(tok_neg)})")
    if match:
        valid_pairs.append((pos, neg))

print(f"\nUsing {len(valid_pairs)} valid contrastive pairs")

Positive                                 | Negative                                 | Match?
------------------------------------------------------------------------------------------
I think this movie is amazing            | I think this movie is terrible           | True (6 vs 6)
The food at this restaurant is delicious | The food at this restaurant is disgusting | True (7 vs 7)
I am feeling very happy today            | I am feeling very sad today              | True (6 vs 6)
The product quality is excellent         | The product quality is awful             | True (5 vs 5)
My experience was wonderful              | My experience was horrible               | True (4 vs 4)
He is a very kind person                 | He is a very mean person                 | True (6 vs 6)
The weather is beautiful                 | The weather is nasty                     | True (4 vs 4)
This solution is perfect                 | This solution is useless                 | True (4 vs 4)

Using 8 valid 