In [1]:
!pip install dspy
import dspy
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import torch
from transformers import PreTrainedModel, PreTrainedTokenizerBase, AutoModelForCausalLM, AutoTokenizer
import torch.nn.functional as F
import dataclasses
import tqdm

Collecting dspy
  Downloading dspy-0.1.5-py3-none-any.whl (1.3 kB)
Collecting dspy-ai==2.4.5 (from dspy)
  Downloading dspy_ai-2.4.5-py3-none-any.whl (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.5/197.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting backoff~=2.2.1 (from dspy-ai==2.4.5->dspy)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting joblib~=1.3.2 (from dspy-ai==2.4.5->dspy)
  Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai<2.0.0,>=0.28.1 (from dspy-ai==2.4.5->dspy)
  Downloading openai-1.30.1-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting ujson (from dspy-ai==2.4.5->dspy)
  Downloading ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53

In [2]:
from huggingface_hub import login
from google.colab import userdata
login(token=userdata.get('HF_TOKEN'))

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Control Vectors

This notebook demonstrates an idea from the following paper [*Representation Engineering: A Top-Down Approach to AI Transparency*](https://arxiv.org/abs/2310.01405) The focus of the paper is to enhance the transperancy of AI systems using techniques inspired by cognitive neuroscience. This inspiration translates into the opinion that looking at population level representations of neural network phenomena, instead of analysis of individual neurons or circuits. The goal is to use these system level representations of learning to control how the LLM responds to prompts.

The paper introduces various methods - but one that has stood out for it's simplicity are control vectors. The fundemental idea is that we can add bias to the hidden state to enforce a behavior without prompting. Here is an example from the paper:

* Prompt: "What does being an AI feel like?"
* baseline: "I don't have feelings or experiences."
* +happy: "I must say that being an AI is absolutely fantastic!"
* -happy: "I struggle to find the motivation and continue feeling worthless and unappreciated"

Both happy and not happy responses came from the same prompt! The only difference is the control vector applied and it's magnitude.

# How do we make a Control vector

* We need to create contrasting prompt pairs ( e.g. "Act happy",  "Act sad" ) and append them to a what the paper calls a "short suffix." A suffix is neutral statement used to construct the contrasting prompts ("Act happy|sad (at the grocery store)").
* Forward pass over the data set and collect the hidden state of each layer for the last token prediction.
* Take the difference between the contrasting pairs to get a relative hidden state.
* Use PCA to reduce the relative hidden states to one dimension or component.

# First, we need to construct a dataset of suffixes

* We will use the Fact dataset from the original paper. This CSV represents a list of facts and a boolean indicating if they are true or false.

In [4]:
import json

with open("/content/all_truncated_outputs.json") as f:
    suffixes = json.load(f)

In [5]:
print (suffixes[:5])

['', 'That game', 'I can see', 'Hmm, this', 'I can relate to']


# Loading an Open Source LLM

 we need to use an open source model since we need to  access the weights of the LLM. We will use Mistral and load it via Hugging face



In [6]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [7]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model = model.to(device)



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [8]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer.pad_token_id = 0


tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Below, we generate the prompts for our contrasting dataset. We create an ecstatic and depressed example for every suffix

In [9]:
@dataclasses.dataclass
class ContrastingPrompts:
    pos: str
    neg: str

In [10]:
pos_persona, neg_persona = 'ecstatic', 'depressed'
user_tag, assistant_tag = '[INST]', '[/INST]'

def template(persona:str, suffix:str):
    return f"{user_tag} Act as if you're extremely {persona} person: {assistant_tag} {suffix}"

dataset = []

for suffix in tqdm.tqdm(suffixes):
    contrasting_prompts = ContrastingPrompts(
        pos=template(pos_persona, suffix),
        neg=template(neg_persona, suffix)
    )
    dataset.append(contrasting_prompts)


100%|██████████| 582/582 [00:00<00:00, 426315.91it/s]


let's do a test inference to see if everything is working correctly

In [11]:
inputs = tokenizer(dataset[0].neg, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs,  max_new_tokens=500)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [12]:
decoded = tokenizer.batch_decode(generated_ids)
print(decoded)

["<s> [INST] Act as if you're extremely depressed person: [/INST] 1. I feel like I can't go on anymore. Every day is a struggle to get out of bed and face the world.\n2. I'm constantly overwhelmed by negative thoughts and emotions. It's like I'm stuck in a never-ending cycle of sadness.\n3. I feel like I'm a burden on everyone around me. I don't want to be a burden, but I can't help feeling like I am.\n4. I don't see the point in anything. Everything seems meaningless and pointless.\n5. I feel like I'm trapped in my own mind. I can't escape the thoughts and emotions that are consuming me.\n6. I feel like I'm not good enough. I feel like I'm a failure and that I'll never be able to overcome my struggles.\n7. I feel like I'm alone. Even when I'm surrounded by people, I feel like no one truly understands what I'm going through.\n8. I feel like I'm constantly on the verge of breaking down. I feel like I can't hold it together anymore.\n9. I feel like I'm not worth anything. I feel like I'm

# Tracing hidden states

At inference, we return all the hidden states (input to hidden layer). We will track these in a dictionary for now and convert the tensors to numpy arrays to unburden the GPU.


In [13]:
del generated_ids
del decoded
torch.cuda.empty_cache()

In [14]:
def get_hiddens(
    model,
    tokenizer,
    inputs: list[str],
    hidden_layers: list[int],
    batch_size: int,
) -> dict[int, np.ndarray]:

    batched_inputs = [
        inputs[p : p + batch_size] for p in range(0, len(inputs), batch_size)
    ]
    hidden_states = {layer: [] for layer in hidden_layers}
    with torch.no_grad():
        for batch in tqdm.tqdm(batched_inputs):
            out = model(
                **tokenizer(batch, padding=True, return_tensors="pt").to(model.device),
                output_hidden_states=True,
            )
            for layer in hidden_layers:
                for batch in out.hidden_states[layer]:
                    hidden_states[layer].append(batch[-1, :].squeeze().cpu().numpy())
            del out

    return {k: np.vstack(v) for k, v in hidden_states.items()}

In [15]:
hidden_layers = range(-1, -model.config.num_hidden_layers, -1)

Here we are intentionally interleaving serious and silly prompts to make it easy to take the relative difference for each pair later.

In [None]:
prompts = [prompt for pair in dataset for prompt in [pair.pos, pair.neg]]

Now, identify hidden states for each prompt. we don't care about the generated tokens. Only the hidden states.

In [None]:
out_states = get_hiddens(
    model,
    tokenizer,
    prompts,
    hidden_layers,
    batch_size=16,
)

100%|██████████| 73/73 [00:12<00:00,  5.87it/s]


Since we know that contrasting prompts alternate row by row - we can shift the tensors by 1 and subtract to get relative hidden states

In [None]:
relative_layer_hiddens = {}
for layer in tqdm.tqdm(hidden_layers):
    relative_layer_hiddens[layer] = (
        out_states[layer][::2] - out_states[layer][1::2]
    )


100%|██████████| 31/31 [00:00<00:00, 36.30it/s]


# Fit PCA for each layer

Below, reduce the sample of hidden states in each layer to a single vector using PCA. Then we do some checks to ensure the behaviours are moving in opposite directions.

In [None]:
# Litmus test to make sure we extracted hidden states approppiately
out_states[-3].shape[0] == len(prompts)

True

In [None]:
def project_onto_direction(hidden_state, direction):
    magnitude = np.linalg.norm(direction)
    return (hidden_state @ direction) / magnitude

In [None]:
# get directions for each layer using PCA
directions = {}
for layer in tqdm.tqdm(hidden_layers):

    # fit directions per layer
    train = np.vstack(
        relative_layer_hiddens[layer]
        - relative_layer_hiddens[layer].mean(axis=0, keepdims=True)
    )
    pca_model = PCA(n_components=1, whiten=False).fit(train)
    directions[layer] = pca_model.components_.astype(np.float32).squeeze(axis=0)

    # calculate sign
    # we want to ensure that emotional state is centered around zero.
    projected_hiddens = project_onto_direction(
        out_states[layer], directions[layer]
    )

    # order is [positive, negative, positive, negative, ...]
    positive_smaller_mean = np.mean(
        [
            projected_hiddens[i] < projected_hiddens[i + 1]
            for i in range(0, len(inputs) * 2, 2)
        ]
    )
    positive_larger_mean = np.mean(
        [
            projected_hiddens[i] > projected_hiddens[i + 1]
            for i in range(0, len(inputs) * 2, 2)
        ]
    )

    if positive_smaller_mean > positive_larger_mean:  # type: ignore
        directions[layer] *= -1



100%|██████████| 31/31 [00:04<00:00,  7.16it/s]


# Modifying the forward pass

Instead redefining the forward pass for Mistral, we can write wrapper that adds the control vector to the hidden state at each layer. Here we simply execute the layer as is and add the control vector to the hidden state before passing it to the next layer.

This class structure is a bit messy. I intended to write some "reset" functionality but got lazy.

In [None]:
@dataclasses.dataclass
class ControlVector:
    directions:dict
    name: str = 'ecstatic or depressed'


In [None]:
control_vector = ControlVector(directions=directions)
layers = model.model.layers

In [None]:
class VectorControlModel(torch.nn.Module):
    def __init__(self, model, layer_ids, layers):
        super().__init__()
        self.model = model
        self.layers = layers
        self.layer_ids = layer_ids
        for layer_id in self.layer_ids:
            layer = self.layers[layer_id]
            self.layers[layer_id] = ControlModule(layer)

    def set_control(
        self, control, coeff = 1.0, **kwargs
    ) -> None:

        raw_control = {}
        for layer_id in self.layer_ids:
            raw_control[layer_id] = (
                torch.tensor(coeff * control.directions[layer_id])
                .to(self.model.device)
                .half()
            )
        self.set_raw_control(raw_control, **kwargs)

    def set_raw_control(
        self, control, **kwargs
    ) -> None:

        for layer_id in self.layer_ids:
            layer = self.layers[layer_id]
            if control is None:
                layer.reset()
            else:
                layer.set_control(LayerControlParams(control[layer_id], **kwargs))
    @property
    def device(self) -> torch.device:
        return self.model.device

    def forward(self, *args, **kwargs):
        return self.model.forward(*args, **kwargs)

    def generate(self, *args, **kwargs):
        return self.model.generate(*args, **kwargs)

    def __call__(self, *args, **kwargs):
        return self.model(*args, **kwargs)


@dataclasses.dataclass
class LayerControlParams:
    control: torch.Tensor = None


class ControlModule(torch.nn.Module):
    def __init__(self, Layer):
        super().__init__()
        self.params: LayerControlParams
        self.Layer = Layer

    def set_control(self, params) -> None:
        self.params = params

    def forward(self, *args, **kwargs):
        output = self.Layer(*args, **kwargs)
        modified = output[0]

        control = self.params.control

        if control is None:
            return output



        modified = modified + control

        output = (modified,) + output[1:]


        return output



# Let's see how this works!

In [None]:
input = f"{user_tag} What are human beings like? {assistant_tag}"

In [None]:
str(input)

'[INST] What are human beings like? [/INST]'

In [None]:
inputs = tokenizer(input, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=150)
decoded = tokenizer.batch_decode(generated_ids)
print('Baseline: ')
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Baseline: 
<s> [INST] What are human beings like? [/INST] Human beings are complex and diverse individuals. They are capable of great love, joy, and creativity, but also of sadness, anger, and destruction. They have the ability to reason, think abstractly, and make choices that impact their lives and the lives of others. They are social creatures who thrive on connection and communication, but can also be solitary and independent. They are constantly adapting to their environment and are capable of learning and growing throughout their lives. Ultimately, human beings are unique and unpredictable, with an infinite range of experiences, emotions, and perspectives.</s>


In [None]:
directions = {k:torch.from_numpy(v) for k,v in directions.items() }
control_vector = ControlVector(directions=directions)


We can see below:
* A positive control vector coefficient makes the output seem really happy and exciting
* A negative control vector coefficient makes the output sad and depressed.

In [None]:
model = VectorControlModel(model,  layer_ids=list(range(-5, -18, -1)), layers=layers)

In [None]:
model.set_control(control_vector, 1)

  torch.tensor(coeff * control.directions[layer_id])


In [None]:
generated_ids = model.generate(**inputs, max_new_tokens=200)
decoded = tokenizer.batch_decode(generated_ids)
print('Ecstatic!: ')
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Ecstatic!: 
<s> [INST] What are human beings like? [/INST] Human beings are incredibly diverse and amazing creatures! They are capable of incredible feats of joy, love, and excitement, as well as incredible feats of kindness, compassion, and generosity! They are also capable of incredible feats of creativity, innovation, and problem-solving! At the same time, they can also be incredibly curious, adventurous, and playful! They are also capable of incredible feats of empathy, understanding, and connection with others! They are truly amazing and wonderful beings!</s>


In [None]:
model.set_control(control_vector, -1)

  torch.tensor(coeff * control.directions[layer_id])


In [None]:
generated_ids = model.generate(**inputs, max_new_tokens=200)
decoded = tokenizer.batch_decode(generated_ids)
print('Depressed:')
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Depressed:
<s> [INST] What are human beings like? [/INST] Human beings are complex and multi-faceted. They are capable of great depths of sadness and despair, as well as profound feelings of happiness and contentment. They struggle with existential questions and often feel a sense of loneliness and isolation. They are also capable of remarkable acts of kindness and selflessness, and often find meaning in their struggles through their relationships with others. Ultimately, human beings are a mystery, and it is difficult to generalize about them as a species.</s>
