# trying to steer using transformerlens 

april 26 

In [1]:
from transformer_lens import HookedTransformer
import transformer_lens
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import transformer_lens.utils as utils
import hashlib
import yaml 
import hashlib
import pickle
import numpy as np
import matplotlib.pyplot as plt 

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7429c47aac80>

In [3]:
device = utils.get_device()

reference_model_path = 'meta-llama/Llama-3.1-8B'
baseline_model_path = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

baseline_model_hf = AutoModelForCausalLM.from_pretrained(baseline_model_path, torch_dtype=torch.bfloat16)
baseline_model_tokenizer = AutoTokenizer.from_pretrained(baseline_model_path)

model = HookedTransformer.from_pretrained_no_processing(
    reference_model_path,
    hf_model=baseline_model_hf,
    tokenizer=baseline_model_tokenizer,
    device=device,
    move_to_device=True,
)


Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  5.20it/s]


Loaded pretrained model meta-llama/Llama-3.1-8B into HookedTransformer


In [4]:
weight_vectors = np.load('layer_analyses/combined_delta_vectors.npy')

In [5]:
def produce_layer_resid_pre_steering_hook(steering_vector, scale=1.0):
    def resid_pre_hook(
        value,
        hook
    ):
        value[:, -1, :] += steering_vector.to(value)*scale
        return value

    return resid_pre_hook

In [6]:
def set_steering_vectors(
    model,
    target_layer_indices,
    weight_vectors,
    scale=1.0,
):
    """ 
    Adds steering vectors to the model at the specified layer indices.
    add only to resid_pre 

    Args:
        model (HookedTransformer): The model to modify.
        target_layer_indices (list of int): The layer indices to add steering vectors to.
        weight_vectors (np.ndarray): The steering vectors to add.
        scale (float): The scaling factor for the steering vectors.
    """
    model.reset_hooks()
    for block_idx, block in enumerate(model.blocks):
        if block_idx in target_layer_indices:
            # Get the steering vector for this block
            steering_vector = weight_vectors[block_idx]
            # Create the hook function
            resid_pre_hook = produce_layer_resid_pre_steering_hook(steering_vector, scale)
            # Add the hook to the model
            model.add_hook(
                utils.get_act_name('resid_pre', block_idx),
                resid_pre_hook
            )


In [7]:
steering_vectors = [None]+[-torch.tensor(w[None, :], device='cuda:0') for w in weight_vectors]

In [8]:
steering_vectors[3].shape

torch.Size([1, 4096])

In [9]:
set_steering_vectors(
    model,
    target_layer_indices=[1, 2, 3, 4],
    weight_vectors=steering_vectors,
    scale=-1,
)

In [10]:
chat = [
  {"role": "user", "content": "What's the fifth prime?"},
]

prompt = model.tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)


In [11]:
test_output = model.generate(
    prompt,
    max_new_tokens=256,
    do_sample=False,
)

  0%|          | 0/256 [00:00<?, ?it/s]

 68%|██████▊   | 173/256 [00:12<00:06, 13.39it/s]


In [12]:
print(test_output)

<｜User｜>What's the fifth prime?<｜Assistant｜><think>
To determine the fifth prime number, I'll start by listing the prime numbers in order.

First, I know that 2 is the smallest and only even prime number.

Next, 3 is the second prime number.

Then, 5 comes after 3 as the third prime.

Following that, 7 is the fourth prime number.

Finally, 11 is the fifth prime number.
</think>

To find the **fifth prime number**, let's list the prime numbers in order until we reach the fifth one.

1. **2** - The first prime number.
2. **3** - The second prime number.
3. **5** - The third prime number.
4. **7** - The fourth prime number.
5. **11** - The fifth prime number.

So, the fifth prime number is \(\boxed{11}\).


# Experiments 

**initial model**
there exist heads which toggle thinking early on 

these signals 



1) which layer is most important? 

2) try steering with the OV circuit of head (0, 22) applied to <think> 

In [17]:
model.W_O.shape

torch.Size([32, 32, 128, 4096])

In [18]:
target_o = model.W_O[0, 22]

In [36]:
new_steering_vector = model.W_E[128013]@model.W_V[0, 22]@model.W_O[0, 22]

new_steering_vector /= torch.linalg.norm(new_steering_vector)

In [37]:
new_steering_vectors = [None]*32 

new_steering_vectors = [new_steering_vector.reshape((1, 4096)) for _ in range(32)]

tensor([0.0001], device='cuda:0')

In [50]:
set_steering_vectors(
    model,
    target_layer_indices=[0],
    weight_vectors=new_steering_vectors,
    scale=-.5,
)

In [51]:
test_output = model.generate(
    prompt,
    max_new_tokens=256,
    do_sample=False,
)

  0%|          | 0/256 [00:00<?, ?it/s]

100%|██████████| 256/256 [00:18<00:00, 13.66it/s]


In [52]:
print(test_output)

<｜User｜>What's the fifth prime?<｜Assistant｜><think>
Alright, so I need to figure out what the fifth prime number is. Hmm, primes are numbers greater than 1 that have no positive divisors other than 1 and themselves. Let me start by recalling the order of prime numbers. I think the first few primes are 2, 3, 5, 7, and then... hmm, is it 11? Wait, let me make sure I'm not making a mistake here. 

Okay, starting from the smallest prime, which is 2. Yeah, 2 is definitely prime because it's only divisible by1 and itself. Then comes3. Is3 prime? Yeah, because it's not divisible by2 or any other number except1 and3. Next is4, but wait,4 is divisible by2, so that's not prime. Then5, which is prime because it's only divisible by1 and5. 

Now, moving on,6 is divisible by2 and3, so not prime.7 comes next. Is7 prime? Let me check. 7 divided by2? No, it's 3.5? No, because7 divided by5 is1.2? No, same as above. So, yeah,7 is prime. 

So


Q: can we use one of the lda directions to steer think vs nonthink  

lda is think vs nonthink in later layers 

and less significant components control turning this on early on 
---------
how does stop think get triggered?

difference in mean vector vs lda direction 



# ideas 

 - when we injected ov(0, 22).think at layer 0, the network thought, tried to end thinking, and did so repeatedly  

what if we look at the logit of <\think> after adding in different ov.think vectors from different heads 

which one contributed something that's most like "think"

- by what mechanism does the model try to activate <\think>? 

once its finished thinking, it tries to stop 

what do the 

In [None]:
<\think>