## Steering Vectors: Activation Engineering in GPT-2

This notebook implements **Activation Steering**, a technique to control model behavior and prediction by intervening in the residual stream during inference. A good reference I came across for this is [here](https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering).

**Core Concept:**
$$ h_{steered} = h_{original} + \alpha \cdot \vec{v}_{concept} $$

1. **Extract** a "sentiment direction" using contrastive prompt pairs
2. **Visualize** this direction using PCA (building our previous work)
3. **Intervene** with Pytorch hooks to steer generation towards positive or negative sentiment
4. **Validate** the effect quantitatively

In [None]:
# imports
import torch
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device}")

In [None]:
# do the model load stuff
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model.eval()