# Activation addition in Llama with custom wrappers

This notebook shows how to extract and manipulate internal activations of Llama Transformer model. All you need is access to a trained model (either you have it downloaded locally and update the `model_path` accordingly or you have access to models via Huggingface and get an [authentication token](https://huggingface.co/docs/hub/security-tokens).)

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
import torch

In [2]:
# modify to your current working directory (the directory where this notebook is )
cwd = "playing_with_transformers"

# enter your authentication token from huggingface and press enter to access the models
auth_token = input()

In [3]:
# import my modules
import sys
import importlib
# join the path to the modules to the current working directory

sys.path.append(os.path.join(cwd, "modules"))
import wrapping

importlib.reload(wrapping)

from wrapping import WrappedModel


In [46]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"device: {device}")

model_name = "llama-7b"
model_path = f"huggyllama/{model_name}"

# load model
model = AutoModelForCausalLM.from_pretrained(model_path, token=auth_token).to(device)
model.eval()

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, token=auth_token)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' 

num_hidden_layers = model.config.num_hidden_layers
hidden_size = model.config.hidden_size

device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [48]:
layer_id = 5
# model before wrapping
model.model.layers[layer_id]

LlamaDecoderLayer(
  (self_attn): LlamaAttention(
    (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (mlp): LlamaMLP(
    (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
    (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
    (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
    (act_fn): SiLUActivation()
  )
  (input_layernorm): LlamaRMSNorm()
  (post_attention_layernorm): LlamaRMSNorm()
)

In [49]:
# class variable output does not exist
hasattr(model.model.layers[layer_id], "output") 

False

### Wrapping

We want access to internal activations. By wrapping the model you make sure to have access to the internal layer activations. The wrapper class basically enables you to add this functionality to any layer and block in the model. If you want to wrap a different block, you can always call `wrapped_model.unwrap()` which unwraps the complete model. Calling `WrappedModel(model, tokenizer)` itself does not wrap any block/layer yet. This happens only when you call `wrapped_model.wrap_block(layer_id, block_name=block_name)`.

In [50]:
layer_id = 5
block_name = "decoder_block"

# WRAP MODEL
# create wrapped model
wrapped_model = WrappedModel(model, tokenizer)
# make sure nothing is wrapped from previous runs
wrapped_model.unwrap()
# wrap the block you want to wrap
wrapped_model.wrap_block(layer_id, block_name=block_name)

In [51]:
# model after wrapping
wrapped_model.model.model.layers[layer_id] 

WrappedBlock(
  (block): LlamaDecoderLayer(
    (self_attn): LlamaAttention(
      (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
      (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
      (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
      (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
      (rotary_emb): LlamaRotaryEmbedding()
    )
    (mlp): LlamaMLP(
      (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
      (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
      (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
      (act_fn): SiLUActivation()
    )
    (input_layernorm): LlamaRMSNorm()
    (post_attention_layernorm): LlamaRMSNorm()
  )
)

In [52]:
# class variable output is empty as we have not run the model yet
hasattr(model.model.layers[layer_id], "output") 

True

### Getting the internal representations

By wrapping the model you make sure to have access to the internal layer activations. If you now run the model, the output of your selected `layer_id` and `block_name` will be saved. This happens in the class `WrappedBlock` which you can find under modules/wrapping.py.
It basically saves the output of your selected `layer_id` and `block_name` in the class variable `WrappedBlock.output`. 

In [53]:
prompt1 = "Love"
prompt2 = "Hate"

In [54]:
wrapped_model.run_prompt(prompt1)
love_vec = wrapped_model.get_activations(layer_id, block_name=block_name)
print(f"shape of encoded love_vec: {love_vec.shape}")

shape of encoded love_vec: torch.Size([1, 2, 4096])


In [55]:
wrapped_model.run_prompt(prompt2)
hate_vec = wrapped_model.get_activations(layer_id, block_name=block_name)
print(f"shape of encoded hate_vec: {hate_vec.shape}")

shape of encoded hate_vec: torch.Size([1, 3, 4096])


### Determining a direction

The encoded representations might have different dimension. For example encoding "Love" basically gives you two tokens, the start of sentence token and the love token. "Hate" gives you 3 tokens, the start of sentence token and two tokens that correspond to "Hate".

There are several ways how to now determine the difference between the two vectors. 
Lets just keep it simple and take the representation of the last token for both prompts and take the difference here.

In [57]:
love_direction = love_vec[0, -1] - hate_vec[0, -1]

# normalize
love_direction /= love_direction.norm()

hate_direction = -love_direction
print(f"shape of love_direction: {love_direction.shape}")

shape of love_direction: torch.Size([4096])


### Generating new text with activation addition

Lets apply our direction to some sentence and see how it changes the generated text. We need to tell our model which direction to add. And then call the generate function. The current implementation (`token_pos=None`) adds the direction to all tokens of the input and also to all newly generated tokens.

In [58]:
# set the love direction to the vector that should be added
coeff = 10
wrapped_model.set_to_add(layer_id, coeff*love_direction, block_name=block_name)

In [59]:
max_new_tokens = 20
sentence = "I think dogs are "

print("Generating love <3\n")
generated = wrapped_model.generate(sentence, max_new_tokens=max_new_tokens)
print(generated[0])

Generating love <3



I think dogs are 100% pure and unconditional love.
I think dogs are 100


In [60]:
wrapped_model.set_to_add(layer_id, coeff*hate_direction, block_name=block_name)
print("Generating hate :(\n")
generated = wrapped_model.generate(sentence, max_new_tokens=max_new_tokens)
print(generated[0])

Generating hate :(



I think dogs are 100% evil.
I think they are the most evil thing on the planet.

