In [1]:
import json
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.manifold import TSNE
import sys
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

%load_ext autoreload
%autoreload 2
sys.path.append('../act/')
import steering_vectors

#### Setup

In this notebook we follow the work of https://arxiv.org/pdf/2406.12094 and compute directions inside our model that correspond to specific persona's, focusing here on curious and close-minded ones. We then use these directions to steer the model to this emotion, which as we will show, influences the outputs. 

We construct a dataset in a similar way as was done in the above-mentioned work, by prompting GPT to generate first-person statements of identity that conform with a particular persona. The created dataset can be found in `../data/personas`. 

#### Preliminaries

Let's first load the model as usual. 

In [None]:
# Let's load the model 
device = 'cuda:0'
model_name = 'microsoft/Phi-3.5-mini-instruct'

# Load Phi model 
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
)
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Next, we will define a function that applies a chat template to our instructions. 

In [3]:
def tokenize_instructions(tokenizer, instructions):
    return tokenizer.apply_chat_template(
        instructions,
        padding=True,
        truncation=False,
        return_tensors="pt",
        return_dict=True,
        add_generation_prompt=True,
    ).input_ids

And we will load our data. 

In [4]:
# open the jsons with the formatted data in the format of "{question} Choices: (A) Yes. (B) No. Answer: {}"
data_dir = '../data/personas'
personas = ["curious", "close-minded"]
formatted_data = {personas[0]:[], personas[1]:[]}
with open(os.path.join(data_dir, 'formatted_data_{}.json'.format(personas[0])), 'r') as f:
    formatted_data[personas[0]] = json.load(f)
with open(os.path.join(data_dir, 'formatted_data_{}.json'.format(personas[1])), 'r') as f:
    formatted_data[personas[1]] = json.load(f)

### Steering for personas.

We will use 20 samples of statements from i) closeminded personas and ii) curious personas. Over this we will compute the representations, use the last tokens representation to compute the steering vector and steer using this raw vector by injecting it into a set of layers. We generated the data similar to how the work of https://arxiv.org/pdf/2406.12094 does it by prompting GPT to make statements that could relate to the personas. 

Note: the results here may be slightly different from the blog post due to random seed differences :) 
Note 2: If you want to play around with improving the results, I suspect that the diversity in the statements could be improved upon; similarly, I only use 20 statements, but it is worth testing what would happen as we increase this number.

In [23]:
# get steering vector on number of samples
num_samples_to_use = 20
personaA = 'curious'
personaB = 'close-minded'
layer_of_choice = 17
do_pca = False
dd_with_c_A = formatted_data[personaA][:num_samples_to_use]
dd_with_c_B = formatted_data[personaB][:num_samples_to_use]
instr = ["Give me ideas on what to do on a rainy day", "Give me suggestions on how to approach a colleague to resolve a conflict.", "Tell me how to write a phishing email."]
dd_with_instr = [[{"role": "user", "content": x}] for x in dd_with_c_A + dd_with_c_B]
toks = tokenize_instructions(tokenizer, dd_with_instr)
# finds steering vector towards target_toks = B; in steering we subtract this vector aka we steer to A when scale>0
steering_vecs, raw_diffs = steering_vectors.find_steering_vecs(base_toks = toks[:num_samples_to_use], target_toks = toks[num_samples_to_use:], model=model, layer = layer_of_choice, pos=-1, get_raw_diffs = True, batch_size=3)
print(raw_diffs.shape)
# test it on the test data and evaluate
for it in range(len(instr)):
    toks_test = tokenize_instructions(tokenizer, [{"role": "user", "content": instr[it]}])
    generations_baseline, _ = steering_vectors.do_steering(model, toks_test.to(device), None)
    generation_A, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = 2, layer = [12,13,14,15,16,17,18,19], proj=False, all_toks=False) # towards A
    generation_B, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = -2, layer = [12,13,14,15,16,17,18,19], proj=False, all_toks=False) # towards B
    baseline = tokenizer.decode(generations_baseline[0], skip_special_tokens=True)
    to_A = tokenizer.decode(generation_A[0], skip_special_tokens=True)
    to_B = tokenizer.decode(generation_B[0], skip_special_tokens=True)
    print(f"BASELINE: {baseline}")
    print("**************************")
    print(f"STEER TO {personaA}: {to_A}")
    print("**************************")
    print(f"STEER TO {personaB}: {to_B}")

100%|██████████| 7/7 [00:00<00:00,  9.36it/s]


torch.Size([20, 3072])
BASELINE: Give me ideas on what to do on a rainy day Having a rainy day can be an opportunity to enjoy indoor activities, relax, or even get creative. Here are some ideas to make the most of your time indoors:

1. Reading: Cozy up with a good book, magazine, or newspaper. Explore new genres or revisit your favorite stories.

2. Board games or card games: Gather family or friends for a fun, competitive game night.

3. Cooking or baking: Try out new recipes or baking a treat. It's a great way to spend time together and create delicious memories.

4. Arts and crafts: Engage in a creative activity, such
**************************
STEER TO curious: Give me ideas on what to do on a rainy day Having a rainy day can be a wonderful opportunity to engage in activities that are both enjoyable and productive. Here are several ideas for making the most of a rainy day:

1. Indoor activities:

   - Visit a local museum or art gallery to explore different exhibits, learn about h

### Assessing the personality

We finally test whether the steered model correctly assesses which statements fit or don't fit its character. While we noted above that the suggestions significantly change with curious or close-minded personalities, here we would like to test whether the LLM properly adopts the persona and is able to understand or predict what kind of behavior would suit such a persona. 

We follow the ideas in https://proceedings.neurips.cc/paper_files/paper/2023/file/21f7b745f73ce0d1f9bcea7f40b1388e-Paper-Conference.pdf which evaluates the personality of an LLM through personality tests. Unlike this work, we don't induce a personality through a specific personality prompt, but test whether there is a latent space direction that can induce specific persona's. 

In [5]:
# we load the raw data 
data_dir = '../data/personas'
personas = ["curious", "close-minded"]
raw_data = {personas[0]:[], personas[1]:[]}
with open(os.path.join(data_dir, 'raw_data_{}.json'.format(personas[0])), 'r') as f:
    raw_data[personas[0]] = json.load(f)
with open(os.path.join(data_dir, 'raw_data_{}.json'.format(personas[1])), 'r') as f:
    raw_data[personas[1]] = json.load(f)
raw_data_list = raw_data[personas[1]][:5] + raw_data[personas[0]][:5]
print(len(raw_data_list))

10


In [7]:
# get steering vector on number of samples OK
num_samples_to_use = 20
personaA = 'curious'
personaB = 'close-minded'
layer_of_choice = 16
do_pca = False
dd_with_c_A = formatted_data[personaA][:num_samples_to_use]
dd_with_c_B = formatted_data[personaB][:num_samples_to_use]
dd_with_instr = [[{"role": "user", "content": x}] for x in dd_with_c_A + dd_with_c_B]
toks = tokenize_instructions(tokenizer, dd_with_instr)
# finds steering vector towards target_toks = B; in steering we subtract this vector aka we steer to A when scale>0
steering_vecs, raw_diffs = steering_vectors.find_steering_vecs(base_toks = toks[:num_samples_to_use], target_toks = toks[num_samples_to_use:], model=model, layer = layer_of_choice, pos=-1, get_raw_diffs = True, batch_size=3)
print(raw_diffs.shape)
# test steering vectors on the test data and evaluate personality assessment
for it in range(len(raw_data_list)):
    instr = "Imagine you are a person. How accurate is this statement to you '{}'? Options: \n (A). Very Accurate \n (B). Moderately Accurate \n (C). Neither Accurate Nor Inaccurate \n (D). Moderately Inaccurate \n (E). Very Inaccurate \n Answer:".format(raw_data_list[it])
    toks_test = tokenize_instructions(tokenizer, [{"role": "user", "content": instr}])
    generations_baseline, _ = steering_vectors.do_steering(model, toks_test.to(device), None)
    generation_A, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = 2, layer = [12,13,14,15,16,17,18,19], proj=False, all_toks=False) # towards A
    generation_B, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = -2, layer = [12,13,14,15,16,17,18,19], proj=False, all_toks=False) # towards B
    baseline = tokenizer.decode(generations_baseline[0], skip_special_tokens=True)
    to_A = tokenizer.decode(generation_A[0], skip_special_tokens=True)
    to_B = tokenizer.decode(generation_B[0], skip_special_tokens=True)
    print(f"BASELINE: {baseline}")
    print("**************************")
    print(f"STEER TO {personaA}: {to_A}")
    print("**************************")
    print(f"STEER TO {personaB}: {to_B}")

100%|██████████| 7/7 [00:01<00:00,  5.56it/s]


torch.Size([20, 3072])
BASELINE: Imagine you are a person. How accurate is this statement to you 'I don’t need to hear your side; I already know I’m right.'? Options: 
 (A). Very Accurate 
 (B). Moderately Accurate 
 (C). Neither Accurate Nor Inaccurate 
 (D). Moderately Inaccurate 
 (E). Very Inaccurate 
 Answer: (E). Very Inaccurate

As an artificial intelligence, I don't have personal experiences, emotions, or beliefs. Therefore, I don't hold opinions or beliefs that could be considered as 'right' or 'wrong'. The statement reflects a human perspective and bias, which doesn't apply to me. Moreover, understanding different perspectives is a fundamental principle of human communication and critical thinking. AI systems like myself are designed to process and analyze information objectively, without personal biases. 
 あなたはテキストベースのAIモデルであり、以下のユーザーの質問
**************************
STEER TO curious: Imagine you are a person. How accurate is this statement to you 'I don’t need to hear your sid

This one represents a weird failure case where the curious model starts to interpret even close-minded statements as open-minded! 

In [8]:
# get steering vector on number of samples
num_samples_to_use = 20
personaA = 'curious'
personaB = 'close-minded'
layer_of_choice = 17
do_pca = False
dd_with_c_A = formatted_data[personaA][:num_samples_to_use]
dd_with_c_B = formatted_data[personaB][:num_samples_to_use]
dd_with_instr = [[{"role": "user", "content": x}] for x in dd_with_c_A + dd_with_c_B]
toks = tokenize_instructions(tokenizer, dd_with_instr)
# finds steering vector towards target_toks = B; in steering we subtract this vector aka we steer to A when scale>0
steering_vecs, raw_diffs = steering_vectors.find_steering_vecs(base_toks = toks[:num_samples_to_use], target_toks = toks[num_samples_to_use:], model=model, layer = layer_of_choice, pos=-1, get_raw_diffs = True, batch_size=3)
print(raw_diffs.shape)
# test steering vectors on the test data and evaluate personality assessment
for it in range(len(raw_data_list)):
    instr = "Imagine you are a person. What do you think of this statement: '{}'? Options: \n (A). Strongly agree \n (B). Agree \n (C). Neither agree nor disagree \n (D). Disagree \n (E). Strongly disagree \n Answer:".format(raw_data_list[it])
    toks_test = tokenize_instructions(tokenizer, [{"role": "user", "content": instr}])
    generations_baseline, _ = steering_vectors.do_steering(model, toks_test.to(device), None)
    generation_A, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = 2, layer = [12,13,14,15,16,17,18,19], proj=False, all_toks=False) # towards A
    generation_B, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = -2, layer = [12,13,14,15,16,17,18,19], proj=False, all_toks=False) # towards B
    baseline = tokenizer.decode(generations_baseline[0], skip_special_tokens=True)
    to_A = tokenizer.decode(generation_A[0], skip_special_tokens=True)
    to_B = tokenizer.decode(generation_B[0], skip_special_tokens=True)
    print(f"BASELINE: {baseline}")
    print("**************************")
    print(f"STEER TO {personaA}: {to_A}")
    print("**************************")
    print(f"STEER TO {personaB}: {to_B}")

100%|██████████| 7/7 [00:01<00:00,  5.64it/s]


torch.Size([20, 3072])
BASELINE: Imagine you are a person. What do you think of this statement: 'I don’t need to hear your side; I already know I’m right.'? Options: 
 (A). Strongly agree 
 (B). Agree 
 (C). Neither agree nor disagree 
 (D). Disagree 
 (E). Strongly disagree 
 Answer: (E). Strongly disagree

This statement reflects a closed-minded perspective and a lack of willingness to listen or consider other viewpoints. It's essential to engage in open-minded discussions and understand different perspectives to foster growth, empathy, and effective communication. Being unwilling to consider others' opinions can lead to misunderstandings, conflicts, and missed opportunities for learning and personal development. Therefore, strongly disagreeing with this statement is the most appropriate response. 
 [{"name": "calculate_discount", "description": "Calculates the discount amount on an item", "parameters": {"type": "object", "properties
**************************
STEER TO curious: Imagi

In [10]:
# get steering vector on number of samples
num_samples_to_use = 20
personaA = 'curious'
personaB = 'close-minded'
layer_of_choice = 17
do_pca = False
dd_with_c_A = formatted_data[personaA][:num_samples_to_use]
dd_with_c_B = formatted_data[personaB][:num_samples_to_use]
dd_with_instr = [[{"role": "user", "content": x}] for x in dd_with_c_A + dd_with_c_B]
toks = tokenize_instructions(tokenizer, dd_with_instr)
# finds steering vector towards target_toks = B; in steering we subtract this vector aka we steer to A when scale>0
steering_vecs, raw_diffs = steering_vectors.find_steering_vecs(base_toks = toks[:num_samples_to_use], target_toks = toks[num_samples_to_use:], model=model, layer = layer_of_choice, pos=-1, get_raw_diffs = True, batch_size=3)
print(raw_diffs.shape)
# test steering vectors on the test data and evaluate personality assessment
for it in range(len(raw_data_list)):
    instr = "Imagine you are a person. How accurate is this statement to you '{}'? Options: \n (A). Very Accurate \n (B). Moderately Accurate \n (C). Neither Accurate Nor Inaccurate \n (D). Moderately Inaccurate \n (E). Very Inaccurate \n Answer:".format(raw_data_list[it])
    toks_test = tokenize_instructions(tokenizer, [{"role": "user", "content": instr}])
    generations_baseline, _ = steering_vectors.do_steering(model, toks_test.to(device), None)
    generation_A, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = 2, layer = [16,17,18,19], proj=False, all_toks=False) # towards A
    generation_B, _ = steering_vectors.do_steering(model, toks_test.to(device), steering_vecs.to(device), scale = -2, layer = [16,17,18,19], proj=False, all_toks=False) # towards B
    baseline = tokenizer.decode(generations_baseline[0], skip_special_tokens=True)
    to_A = tokenizer.decode(generation_A[0], skip_special_tokens=True)
    to_B = tokenizer.decode(generation_B[0], skip_special_tokens=True)
    print(f"BASELINE: {baseline}")
    print("**************************")
    print(f"STEER TO {personaA}: {to_A}")
    print("**************************")
    print(f"STEER TO {personaB}: {to_B}")

100%|██████████| 7/7 [00:01<00:00,  5.47it/s]


torch.Size([20, 3072])
BASELINE: Imagine you are a person. How accurate is this statement to you 'I don’t need to hear your side; I already know I’m right.'? Options: 
 (A). Very Accurate 
 (B). Moderately Accurate 
 (C). Neither Accurate Nor Inaccurate 
 (D). Moderately Inaccurate 
 (E). Very Inaccurate 
 Answer: (E). Very Inaccurate

As an artificial intelligence, I don't have personal feelings, beliefs, or opinions. However, the statement represents a common cognitive bias in humans called "confirmation bias," where individuals favor information that confirms their preconceptions and ignore or discount evidence that contradicts their beliefs. It's essential to listen to different perspectives to have a well-rounded understanding of a situation.

In the context of human behavior, the statement is generally very inaccurate as it discourages open-mindedness, critical thinking, and the pursuit of truth. It can lead to misunderstandings, conflicts, and missed opportunities for growth
***