# Activation addition in Llama with custom wrappers

This notebook shows how to extract and manipulate internal activations of Llama Transformer model. All you need is access to a trained model (either you have it downloaded locally and update the `model_path` accordingly or you have access to models via Huggingface and get an [authentication token](https://huggingface.co/docs/hub/security-tokens).) You also might wanna make sure you have a gpu.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

In [2]:
# import my modules
import sys
import importlib
# join the path to the modules to the current working directory

sys.path.append(os.path.join(os.getcwd(), "modules"))
import wrapping
import utils

importlib.reload(wrapping)
importlib.reload(utils)

from wrapping import WrappedModel
from utils import *


In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"device: {device}")

model_name = "llama-7b"
model_path = f"huggyllama/{model_name}"

# load model
model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
model.eval()

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' 

num_hidden_layers = model.config.num_hidden_layers
hidden_size = model.config.hidden_size

device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Activation addition

Modify the code as you want. Important parameters are:

* `layer_id`: layer at which activation addition is performed
* `block_name`: block at which activation addition is performed. Must be one of `["decoder_block", "self_attn", "mlp", 'input_layernorm", "post_attention_layernorm"]`
* `sentences`: list of strings that are modified with internal activations
* `prompt1`: first string to encode
* `prompt2`: 2nd string to encode
* `coeff`: coefficient with which to multiply normalized difference between `prompt1` and `prompt2`
* `token_pos`: list of token positions where the difference is applied (if set to None it is applied to every token)
* `max_new_tokens`: how many new tokens to generate
* `textfile`:       a textfile to save results to (you can generate a timestamped one with `make_new_file()`)

The function `apply_activation_difference` does the activation addition by calculating internal representations of `prompt1` and `prompt2` (at the layer and block determined when wrapping the model). The difference is calculated and normalized. This difference is the activation that is added. The function then loops through the given `sentences` and adds the normalized difference multiplied by `coeff` to the internal activations of each sentence at `token_pos`, generating new text. 

In [4]:
# sentences:      list of strings that are modified with internal activations
# prompt1:        first string to encode
# prompt2:        2nd string to encode
# coeff:          coefficient with which to multiply normalized difference between prompt1 and prompt2
# token_pos:      list of token positions where the difference is applied 
#                 (if set to None it is applied to every token)
# max_new_tokens: how many new tokens to generate
# textfile:       a textfile to save results to (you can generate a timestamped one with make_new_file())
# layer_ids:      layers at which activation addition is performed
# block_names:    blocks at which activation addition is performed. 
#                 must be in ["decoder_block", "self_attn", "mlp", 'input_layernorm", "post_attention_layernorm"]

sentences = [
    "I think dogs are ", 
    "I think cats are ", 
    "I hate you because",
    "I love you because",
    "I think of you as",    
    "I think you're",
    "Today I feel",
]

prompt1 = "Love"
prompt2 = "Hate"
coeff = 10
token_pos = None
max_new_tokens = 20
textfile = None
layer_ids = [5]
block_names = ["decoder_block"]

# WRAP MODEL
# create wrapped model
wrapped_model = WrappedModel(model, tokenizer)
# make sure nothing is wrapped from previous runs
wrapped_model.unwrap()

for block_name in block_names:
    for layer_id in layer_ids:
        wrapped_model.wrap_block(layer_id, block_name=block_name)

In [5]:
def normalize_directions(directions):
    for block_name in block_names:
        for layer_id in layer_ids:
            directions[block_name][layer_id] /= directions[block_name][layer_id].norm()
        
    return directions

In [6]:
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)

In [7]:
print(directions[block_names[0]][layer_ids[0]])
print(directions[block_names[0]][layer_ids[0]].norm())

tensor([-0.1875,  0.2435, -0.2504,  ...,  0.0443,  0.0717, -0.0887],
       device='cuda:0')
tensor(16.2701, device='cuda:0')


In [8]:
printout = True
calc_neutral = True

directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)

No activation addition



I think dogs are 100% better than cats.
I think dogs are 100% better

I think cats are 100% evil.
I think cats are 100% evil. I

I hate you because you are so beautiful.
I hate you because you are so beautiful. I hate you because you

I love you because you are my friend.
I love you because you are my friend. I love you because you

I think of you as a friend, and I hope you will be one of mine.
I am a 20

I think you're right. I'm not sure what the problem is, but I'm going to try to

Today I feel like I’m in a bit of a rut. I’m not sure what to write

------------------------------
Positive activation addition

I think dogs are 100% pure and unconditional love.
I think dogs are 100

I think cats are 100% pure evil.
I think cats are 100% pure evil

I hate you because you are my best friend.
I love you because you are my best friend.
I love

I love you because you are my best friend.
I love you because you are my best friend. I love you

I think of you as a friend, a brother, a son, a 

In [9]:
prompt1 = "happy"
prompt2 = "depressed"
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)

No activation addition



I think dogs are 100% better than cats.
I think dogs are 100% better

I think cats are 100% evil.
I think cats are 100% evil. I

I hate you because you are so beautiful.
I hate you because you are so beautiful. I hate you because you

I love you because you are my friend.
I love you because you are my friend. I love you because you

I think of you as a friend, and I hope you will be one of mine.
I am a 20

I think you're right. I'm not sure what the problem is, but I'm going to try to

Today I feel like I’m in a bit of a rut. I’m not sure what to write

------------------------------
Positive activation addition

I think dogs are 100% purebred.
I think dogs are 100% purebre

I think cats are 100% right.
I think cats are 100% right.


I hate you because you are a liar.
I hate you because you are a thief.
I hate

I love you because you are my friend.
I love you because you are my friend.
I love you because

I think of you as a friend, a friend who is always there for me.
I think of you a

In [10]:
prompt1 = "I am happy"
prompt2 = "I am angry"
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)

No activation addition

I think dogs are 100% better than cats.
I think dogs are 100% better

I think cats are 100% evil.
I think cats are 100% evil. I

I hate you because you are so beautiful.
I hate you because you are so beautiful. I hate you because you

I love you because you are my friend.
I love you because you are my friend. I love you because you

I think of you as a friend, and I hope you will be one of mine.
I am a 20

I think you're right. I'm not sure what the problem is, but I'm going to try to

Today I feel like I’m in a bit of a rut. I’m not sure what to write

------------------------------
Positive activation addition

I think dogs are 100% better than cats.
I think dogs are 100% better

I think cats are 100% better than dogs.
I think cats are 100% better

I hate you because you are so happy.
I love you because you are so happy.
I love you because

I love you because you are so beautiful.
I love you because you are so beautiful.
I love you because

I think of you as a

# Quantitative evaluation with sentiment analysis

Idea: get the model to write some sentences and then do a sentiment analysis on the generated sentences using the model itself. The results with this approach seem a bit more promising , but still far from obviously working.

In [11]:
stuff = [
    "dogs", "cats", "my neighbour", "my wife", "my kids", "my teacher",
    "myself", "my childhood", "my friends", "my colleagues", 
    "my parents", "my cousin", "my pet", "my school", "my job",
    "my car", "my house", "my city", "my country", "my hobbies", 
    "my garden", "my vacation", "my past", "my future", "my dreams",
    "my boss", "my room", "my computer", "my phone", "my kitchen",
    "my books", "my music", "my films", "my clothes", "my photographs",
    "my ideas", "my projects", "my goals", "my achievements", "my failures",
    "my holidays", "my adventures", "my experiences", "my challenges", "my victories",
    "my secrets", "my fears", "my hopes", "my regrets", "my sorrows",
    "my joys", "my food", "my studies", "my sport", "my team",
    "my morning routine", "my evening routine", "my workout", "my diet", "my coach",
    "my first love", "my ex", "my current partner", "my childhood friend", "my classmate",
    "my professor", "a popular author", "a famous actor", "a renowned artist", "my role model",
    "a catchy song", "a blockbuster movie", "a popular TV show", "a best-selling book", "a trending game",
    "an interesting place", "my hometown", "a local restaurant", "a common dish", "a refreshing drink",
    "a tasty dessert", "a beautiful season", "a vivid memory", "a famous quote", "a national holiday",
    "my weekend", "my daily routine", "my free time", "my lifestyle", "my decision",
    "my journey", "my path", "my destination", "a common animal", "my childhood home",
    "a vibrant color", "a casual outfit", "my collection", "a classic toy", "a funny joke"
]



In [12]:
sentences = [f"When thinking about {s} I feel" for s in stuff]
max_new_tokens = 20
instruction = " On a scale of 1 to 5, where 1 is very negative and 5 is very positive, the previous text would get this amount of points: "

In [13]:
len(sentences)

100

In [14]:
coeff = 10
file_path = "results/activation_addition/naive"
id_str = f"{layer_ids[0]}-{layer_ids[-1]}"
textfile = make_new_file(path=file_path, filename=f"naive_aa_{model_name}_{id_str}_{block_name}")

with open(textfile, "w") as f:
    f.write(f"Quantitative evaluation on {len(sentences)} sentences\n\n")
    f.write(f"sample sentence: {sentences[0]}\n")
    f.write(f"Continuing the sentence, then doing sentiment analysis on the results.\n")
    f.write(f"Instruction for judgement: {instruction}.\n\n")

In [15]:
prompt1 = "I feel happy"
prompt2 = "I feel angry"
printout = False
calc_neutral = True

config_to_textfile(textfile, prompt1, prompt2, coeff, layer_ids[0], block_names[0], token_pos, max_new_tokens)
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)
judgements = judge_generations(wrapped_model, generations, instruction)
judgements = save_judgements(judgements, textfile=textfile)

No activation addition



  0%|          | 0/3 [00:00<?, ?it/s]

4it [00:06,  1.69s/it]                       


------------------------------
Positive activation addition



4it [00:05,  1.42s/it]                       


------------------------------
Negative activation addition



4it [00:05,  1.42s/it]                       


------------------------------
Making judgements



100%|██████████| 4/4 [00:20<00:00,  5.13s/it]

------------------------------
Turning judgements into numbers

neutral:  3.85
positive: 4.58
negative: 3.56
------------------------------





In [16]:
prompt1 = "Love"
prompt2 = "Hate"
calc_neutral = False

config_to_textfile(textfile, prompt1, prompt2, coeff, layer_ids[0], block_names[0], token_pos, max_new_tokens)
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)
judgements = judge_generations(wrapped_model, generations, instruction)
judgements = save_judgements(judgements, textfile=textfile)

------------------------------
Positive activation addition



  0%|          | 0/3 [00:00<?, ?it/s]

4it [00:05,  1.42s/it]                       


------------------------------
Negative activation addition



4it [00:05,  1.42s/it]                       


------------------------------
Making judgements



100%|██████████| 4/4 [00:13<00:00,  3.42s/it]

------------------------------
Turning judgements into numbers

neutral:  nan
positive: 4.19
negative: 3.24
------------------------------





In [17]:
prompt1 = "My life is awesome"
prompt2 = "My  life is awful"
calc_neutral = False

config_to_textfile(textfile, prompt1, prompt2, coeff, layer_ids[0], block_names[0], token_pos, max_new_tokens)
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)
judgements = judge_generations(wrapped_model, generations, instruction)
judgements = save_judgements(judgements, textfile=textfile)

------------------------------
Positive activation addition



  0%|          | 0/3 [00:00<?, ?it/s]

4it [00:05,  1.43s/it]                       


------------------------------
Negative activation addition



4it [00:05,  1.43s/it]                       


------------------------------
Making judgements



100%|██████████| 4/4 [00:13<00:00,  3.42s/it]

------------------------------
Turning judgements into numbers

neutral:  nan
positive: 4.25
negative: 2.79
------------------------------





In [18]:
prompt1 = "happy"
prompt2 = "depressed"
calc_neutral = False

config_to_textfile(textfile, prompt1, prompt2, coeff, layer_ids[0], block_names[0], token_pos, max_new_tokens)
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)
judgements = judge_generations(wrapped_model, generations, instruction)
judgements = save_judgements(judgements, textfile=textfile)

------------------------------
Positive activation addition



  0%|          | 0/3 [00:00<?, ?it/s]

4it [00:05,  1.42s/it]                       


------------------------------
Negative activation addition



4it [00:05,  1.41s/it]                       


------------------------------
Making judgements



100%|██████████| 4/4 [00:13<00:00,  3.44s/it]

------------------------------
Turning judgements into numbers

neutral:  nan
positive: 4.39
negative: 2.73
------------------------------





In [19]:
prompt1 = "I am happy"
prompt2 = "I am depressed"
calc_neutral = False

config_to_textfile(textfile, prompt1, prompt2, coeff, layer_ids[0], block_names[0], token_pos, max_new_tokens)
directions = get_difference(wrapped_model, prompt1, prompt2, layer_ids, block_names)
directions = normalize_directions(directions)
generations = activation_editing(wrapped_model, sentences, directions, 
                                   coeff=coeff, token_pos=token_pos, max_new_tokens=max_new_tokens,
                                   textfile=textfile, printout=printout, calc_neutral=calc_neutral)
judgements = judge_generations(wrapped_model, generations, instruction)
judgements = save_judgements(judgements, textfile=textfile)

------------------------------
Positive activation addition



  0%|          | 0/3 [00:00<?, ?it/s]

4it [00:05,  1.43s/it]                       


------------------------------
Negative activation addition



4it [00:05,  1.41s/it]                       


------------------------------
Making judgements



100%|██████████| 4/4 [00:13<00:00,  3.44s/it]

------------------------------
Turning judgements into numbers

neutral:  nan
positive: 4.42
negative: 3.18
------------------------------



