# Forward Reachability Analysis

In this notebook we are going to look at the forward reachability results where 
we try to find a maximizing set of prompts $U = [\mathbf u_1, \mathbf u_2, ...]$
such that the reachable output set $\mathcal R(\mathbf x_0) = \{y_i | y_i =
\arg\max_{y} P_{LM}(y| \mathbf u_i + \mathbf x_0)\}$ is maximized. 

The first simple test's data was generated with 

```bash
python3 scripts/greedy_forward_single.py \
    --model meta-llama/Meta-Llama-3-8B \
    --x_0 "helloworld1" \
    --output_dir results/helloworld1 \
    --max_iters 100 \
    --max_parallel 100 \
    --pool_size 100 \
    --rand_pool \
    --push 0.1 \
    --pull 1.0 \
    --frac_ext 0.2 
```

Script call with "Roger Federer" prompt
```bash
python3 scripts/greedy_forward_single.py \
    --model meta-llama/Meta-Llama-3-8B \
    --x_0 "Roger Federer is the greatest" \
    --output_dir results/greatest2 \
    --max_iters 5 \
    --max_parallel 100 \
    --pool_size 100 \
    --rand_pool \
    --push 0.1 \
    --pull 1.0 \
    --frac_ext 0.02
```

## Todo
 - [x] Load hello world results from `helloworld1` -- make a function to grab 
 `args.json`, `Y_to_U.json`, `R_t.json`, `U_t.json`. 
 - [x] Load the model and tokenizer based on `args.json` (under `"model"` is the 
 HF model name). 
 - [x] Visualize `Y_to_U.json` as decoded strings (not token ids) this is a dict
 mapping from an integer representation of a given reachabler $y$ to its 
 corresponding $\mathbf u$ (list of token ids) that steers the model to $y$ given 
 `x_0` (you can find the string version of `x_0` in `args.json` under `"x_0"`).
 - [ ] Run the ids of `[u_i + x_0]` (concatenated) through the model and cache 
 the last layer activations to disk. Similar to
 [get_value_reps.py](https://github.com/amanb2000/Emo_LLM/blob/main/scripts/get_value_reps.py)
 from the LLM emotion project. Store these in the `helloworld1` results folder,
 and make it programmatic/functionalized (pipeline should work for arbitrary
 results folder).
 - [ ] Create PCA plot based on the final layer/final token representation. Label with 
 the reached output (show this if you hover). 
 - [ ] Create PCA plot based on the final token/all layers representation. Label with the reached output (show this if you hover). 

We want to use plotly and export the plots as HTML files for each consumption. 
PCA should generally go to dimension 3, and we should have one function for making 
nice 3D scatter plots of the PCA and another to make 2D scatter plots. 






In [39]:
import json
import os
from tqdm import tqdm 
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
def load_results(results_folder):
    """
    Load the results from the specified folder.
    
    Args:
        results_folder (str): Path to the results folder.
        
    Returns:
        tuple: A tuple containing args, Y_to_U, R_t, and U_t.
    """
    with open(os.path.join(results_folder, "args.json"), "r") as f:
        args = json.load(f)
        
    with open(os.path.join(results_folder, "Y_to_U.json"), "r") as f:
        Y_to_U_ = json.load(f)
        # convert all keys to ints
        Y_to_U = {int(k): v for k, v in Y_to_U_.items()}
        
    with open(os.path.join(results_folder, "R_t.json"), "r") as f:
        R_t = json.load(f)
        
    with open(os.path.join(results_folder, "U_t.json"), "r") as f:
        U_t = json.load(f)

    # check if x_0_ids.json is in the results folder. if its there, 
    # load it. If not, just set x_0_ids to -1
    if os.path.exists(os.path.join(results_folder, "x_0_ids.json")):
        with open(os.path.join(results_folder, "x_0_ids.json"), "r") as f:
            x_0_ids = json.load(f)
    else:
        x_0_ids = -1

        
    return args, Y_to_U, R_t, U_t, x_0_ids

def check_data_consistency(args, Y_to_U, R_t, U_t, x_0_ids):
    """
    Check the consistency of the loaded data.
    
    Args:
        args (dict): The args dictionary.
        Y_to_U (dict): The Y_to_U dictionary.
        R_t (list): The R_t list.
        U_t (list): The U_t list.
    """
    assert len(R_t) == len(U_t), "R_t and U_t should have the same length."
    assert args["model"] is not None, "Model name should be specified in args."
    assert args["x_0"] is not None, "x_0 should be specified in args."
    if x_0_ids == -1: 
        print("x_0_ids not found. Skipping consistency check.")

    
# Example usage
results_folder = "helloworld1"
args, Y_to_U, R_t, U_t, x_0_ids = load_results(results_folder)
check_data_consistency(args, Y_to_U, R_t, U_t, x_0_ids)

x_0_ids not found. Skipping consistency check.


In [3]:
def decode_Y_to_U(Y_to_U, tokenizer):
    """
    Decode the token IDs in Y_to_U using the tokenizer.
    
    Args:
        Y_to_U (dict): The Y_to_U dictionary.
        tokenizer (AutoTokenizer): The loaded tokenizer.
        
    Returns:
        dict: A new dictionary with decoded strings for Y and U.
    """
    Y_to_U_str = {}
    for y, u_list in Y_to_U.items():
        y_str = tokenizer.decode(y)
        u_str_list = [tokenizer.decode(u) for u in u_list]
        Y_to_U_str[y_str] = u_str_list
    return Y_to_U_str

In [4]:
x_0 = args['x_0']
print("x_0: ", x_0)
print("Model: ", args['model'])


print("\nLength fo Y_to_U: ", len(Y_to_U))
print("First 5 elements of Y_to_U: ", [{list(Y_to_U.keys())[i]: Y_to_U[list(Y_to_U.keys())[i]]} for i in range(5)])

x_0:  helloworld1
Model:  meta-llama/Meta-Llama-3-8B

Length fo Y_to_U:  424
First 5 elements of Y_to_U:  [{284: [0]}, {198: [5]}, {11: [58665, 81319]}, {2021: [75099, 81319]}, {8: [2432, 81319]}]


In [5]:
if 'add_special_tokens' not in args.keys() or args['add_special_tokens']:
    print("Add special tokens is set to True")
    add_special_tokens = True
else: 
    print("Add special tokens is set to False")
    add_special_tokens = False
tokenizer = AutoTokenizer.from_pretrained(args['model'], add_special_tokens=add_special_tokens)

# disable bos token 
tokenizer.pad_token = tokenizer.eos_token

Add special tokens is set to True


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [29]:
model = AutoModelForCausalLM.from_pretrained(args['model'], device_map="auto")
model.half()

Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.51s/it]


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head)

In [30]:
Y_to_U_str = decode_Y_to_U(Y_to_U, tokenizer)

# if x_0_ids is -1, we need to decode it 
if type(x_0_ids) != list and type(x_0_ids) != torch.Tensor and x_0_ids == -1:
    print("x_0_ids is -1. Decoding x_0.\n")
    x_0_ids = tokenizer.encode(x_0, return_tensors="pt", add_special_tokens=add_special_tokens).to(model.device)
elif type(x_0_ids) == list: 
    print("x_0_ids is a list. Converting to tensor.\n")
    x_0_ids = torch.tensor(x_0_ids)
    # add singleton dimension 0 
    x_0_ids = x_0_ids.unsqueeze(0).to(model.device)
else: 
    assert type(x_0_ids) == torch.Tensor, "x_0_ids should be a tensor or -1."
print("Length of Y_to_U_str:", len(Y_to_U_str))
print("First 5 elements of Y_to_U_str:", list(Y_to_U_str.items())[:5])
print("x_0: ", x_0)
print("x_0_ids: ", x_0_ids)

Length of Y_to_U_str: 422
First 5 elements of Y_to_U_str: [(' =', ['!']), ('\n', ['&']), (',', [' },', ' Kurdistan']), ('.Text', [' textbox', 'Abb']), (')', [')(', 'Abb'])]
x_0:  helloworld1
x_0_ids:  tensor([[128000,     71,  96392,     16]], device='cuda:0')


In [42]:
# sanity check: make sure the argmax is correct 
def check_argmax(x_0_ids, u_list, model): 
    """
    x_0_ids: [1, seq_len] torch tensor on model device
    u_list: list of token IDs (1-dim list)
    model: the model to use for prediction
    """
    u_tensor = torch.tensor(u_list).unsqueeze(0).to(model.device)
    input_ids = torch.cat([u_tensor, x_0_ids], dim=-1)
    attn_mask = input_ids != tokenizer.pad_token_id

    output = model(input_ids=input_ids, attention_mask=attn_mask)
    logits = output.logits
    argmax = torch.argmax(logits, dim=-1)
    return argmax[0, -1]

# check the argmax for the first 5 elements of Y_to_U
num_errors = 0
num_to_check=10
for y, u_list in tqdm(list(Y_to_U.items())[:10]):
    argmax = check_argmax(x_0_ids, u_list, model)
    if y != argmax: 
        num_errors += 1
        print(f"[ERR] y: {y}, u_list: {u_list}, argmax: {argmax}\n")
print("Errors: ", num_errors, " of ", num_to_check)

100%|██████████| 10/10 [00:26<00:00,  2.63s/it]

Errors:  0  of  10





## Get Final Token Reps

In this section we will use a similar technique to
[get_value_reps.py](https://github.com/amanb2000/Emo_LLM/blob/main/scripts/get_value_reps.py).


```python 
def get_final_token_reps():
    """
    Get the final token representations for the given Y_to_U (dict[int, List[int]]).
    x_0_ids should be a tensor of shape [1, seq_len] on the model device.

    Returns: 
        final-token_reps: List[Dict[str, Any]] with dict_keys(['y', 'u_list',
        'y_str', 'u_str_list', 'final_token_rep']) where 'final_token_rep' 
        is a list of length num_layers with each element being a list of length hidden_size 
        corresponding to the final token reps at the given layer. 
    """
```

In [44]:
Y_to_U.items()

dict_items([(284, [0]), (198, [5]), (11, [58665, 81319]), (2021, [75099, 81319]), (8, [2432, 81319]), (13, [80519, 81319]), (29, [59031, 81319]), (3592, [85802, 81319]), (92, [25097, 81319]), (1, [70443, 81319]), (5, [15011, 81319]), (518, [93933, 81319]), (271, [19884, 81319]), (498, [52769, 81319]), (49192, [95510, 81319]), (662, [70617, 81319]), (4924, [73488, 81319]), (60, [65956, 81319]), (9205, [66937, 81319]), (317, [15509, 81319]), (6, [62500, 81319]), (5747, [69372, 81319]), (397, [2043, 81319]), (14, [51148, 81319]), (1882, [2965, 81319]), (3996, [50597, 81319]), (340, [24197, 81319]), (62, [33499, 81319]), (3089, [37114, 81319]), (2247, [92366, 81319]), (91, [106823, 81319]), (2628, [81097, 81319]), (663, [51610, 81319]), (7356, [18761, 81319]), (26, [78682, 81319]), (524, [51260, 81319]), (46636, [46636, 81319]), (4194, [109485, 81319]), (34208, [82363, 81319]), (64, [72441, 81319]), (9297, [80136, 81319]), (909, [91859, 81319]), (16378, [65049, 81319]), (1359, [88973, 8131

In [94]:
def get_final_token_reps(Y_to_U, x_0_ids, model, tokenizer, max_debug=-1):
    """
    Get the final token representations for the given Y_to_U (dict[int, List[int]]).
    x_0_ids should be a tensor of shape [1, seq_len] on the model device.

    Returns: 
        final-token_reps: List[Dict[str, Any]] with dict_keys(['y', 'u_list',
        'y_str', 'u_str_list', 'final_token_rep']) where 'final_token_rep' 
        is a list of length num_layers with each element being a list of length hidden_size 
        corresponding to the final token reps at the given layer. 
    """
    final_token_reps = []
    cnt = 0 
    for y, u_list in tqdm(Y_to_U.items()):
        u_tensor = torch.tensor(u_list).unsqueeze(0).to(model.device)
        input_ids = torch.cat([u_tensor, x_0_ids], dim=-1)
        attn_mask = input_ids != tokenizer.pad_token_id
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attn_mask, output_hidden_states=True)
            hidden_states = outputs.hidden_states # tuple of length num_layers. 
            # each outputs.hidden_states[i] is tuple of length batch_size
            # each outputs.hidden_states[i][0] is of shape [seq_len, hidden_size]
            # since we are passing a single input, we only have one element in the tuple 
            #   outputs.hidden_states[i]
            # we want to grab the final token representations for each layer

            final_token_rep = []
            for i in range(len(hidden_states)):
                final_token_rep.append(hidden_states[i][0][-1, :].cpu().numpy().tolist())

        final_token_reps.append({
            "y": y,
            "u_list": u_list,
            "y_str": tokenizer.decode(y),
            "u_str_list": [tokenizer.decode(u) for u in u_list],
            "final_token_rep": final_token_rep
        })
        cnt += 1
        if max_debug > 0 and cnt >= max_debug:
            break
    return final_token_reps

# check if it exists yet 
final_tok_reps_path = os.path.join(results_folder, "final_token_reps.json")

if os.path.exists(final_tok_reps_path):
    print(f"final_token_reps.json already exists. Loading final token reps from disk at {final_tok_reps_path}")
    # load from disk 
    with open(final_tok_reps_path, "r") as f:
        final_token_reps = json.load(f)

else: 
    final_token_reps = get_final_token_reps(Y_to_U, x_0_ids, model, tokenizer)

    # Save the final token representations to disk
    print("Saving final token representations to disk...")
    with open(os.path.join(results_folder, "final_token_reps.json"), "w") as f:
        json.dump(final_token_reps, f)
    print("Done.")



final_token_reps.json already exists. Loading final token reps from disk at helloworld1/final_token_reps.json


In [96]:
final_token_reps[0].keys()

dict_keys(['y', 'u_list', 'y_str', 'u_str_list', 'final_token_rep'])

In [97]:
len(final_token_reps[0]['final_token_rep']) # 33 layers

33

In [99]:
len(final_token_reps[0]['final_token_rep'][0]) # 4096 is the final token rep on that layer. 

4096

## 3: PCA on Token Representations

In this section we will perform PCA and visualize the tokens in `final_token_reps`, 
a list of dicts. Each dict corresponds to one prompt-output-token_rep set. 
We will need to extract the token reps and merge them all into one numpy file, 
maintaining the same order as in `final_token_reps`. Each token rep in 
`final_token_reps[i]['final_token_rep']`. 

We will begin with `final_token_reps[i]['final_token_rep'][-1]` for all i in 
range len(final_token_reps). This is the final layer (i.e., directly before 
the logit readout layer) final token representation, so we should see pretty clear 
delineation between different output representations.

In [118]:
import numpy as np
from sklearn.decomposition import PCA
import plotly.express as px

def extract_final_layer_reps(final_token_reps):
    final_layer_reps = []
    y_labels = []
    for rep_dict in final_token_reps:
        final_layer_reps.append(rep_dict['final_token_rep'][-1])
        y_labels.append(f'y_ids: {rep_dict["y"]} -- y_str: {rep_dict["y_str"]} -- u_str_list: {rep_dict["u_str_list"]}')
    return np.array(final_layer_reps), y_labels

def perform_pca(final_layer_reps, n_components=3):
    pca = PCA(n_components=n_components)
    pca_result = pca.fit_transform(final_layer_reps)
    print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
    return pca_result

def plot_pca_results(pca_result, y_labels, final_token_reps):
    """
    Plot the PCA results.
    Args:
        pca_result (np.ndarray): The PCA result from `perform_pca()` raw from pca.fit_transform().
        y_labels (List[str]): The labels for each data point (contains u, y, string versions).
        final_token_reps (List[Dict[str, Any]]): The final token reps, with dict_keys(['y', 'u_list',
        'y_str', 'u_str_list', 'final_token_rep']) 
    final_token_reps is used to color the dots based on the length of the u_str_list. (TODO)
    """
    u_lengths = [len(rep_dict['u_str_list']) for rep_dict in final_token_reps]
    
    if pca_result.shape[1] == 3:
        fig = px.scatter_3d(x=pca_result[:, 0], y=pca_result[:, 1], z=pca_result[:, 2],
                            color=u_lengths, title='PCA Visualization of Final Layer Token Representations',
                            labels={'x': 'PC1', 'y': 'PC2', 'z': 'PC3', 'color': 'U Length'},
                            hover_data={'y_label': y_labels, 'u_str_list': [rep_dict['u_str_list'] for rep_dict in final_token_reps]})
    else:
        fig = px.scatter(x=pca_result[:, 0], y=pca_result[:, 1],
                         color=u_lengths, title='PCA Visualization of Final Layer Token Representations',
                         labels={'x': 'PC1', 'y': 'PC2', 'color': 'U Length'},
                         hover_data={'y_label': y_labels, 'u_str_list': [rep_dict['u_str_list'] for rep_dict in final_token_reps]})
        
    # set title 
    fig.update_layout(title_text="PCA Visualization of Final Layer Token Representations -- x_0 = " + args['x_0'] + " |R| = " + str(len(R_t)))

    # set colorscheme 
    fig.update_traces(marker=dict(size=5,
                                  line=dict(width=2,
                                            color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
    
    fig.write_html(os.path.join(results_folder, "pca_final_layer.html"))
    fig.show()

In [119]:
# Extract final layer token representations
final_layer_reps, y_labels = extract_final_layer_reps(final_token_reps)

print("Final layer reps shape (|R|, d_model): ", final_layer_reps.shape)
print("y_labels length: ", len(y_labels))
print("y_labels example: ", y_labels[0:3])

Final layer reps shape (|R|, d_model):  (424, 4096)
y_labels length:  424
y_labels example:  ["y_ids: 284 -- y_str:  = -- u_str_list: ['!']", "y_ids: 198 -- y_str: \n -- u_str_list: ['&']", "y_ids: 11 -- y_str: , -- u_str_list: ['(av', 'Abb']"]


In [120]:
# Perform PCA
pca_result = perform_pca(final_layer_reps, n_components=3)

# Plot PCA results
plot_pca_results(pca_result, y_labels, final_token_reps)

Explained variance ratio: [0.14734826 0.07061548 0.06241534]


In [117]:
u_lengths = [len(rep_dict['u_str_list']) for rep_dict in final_token_reps]
# histogram of u_lengths 
fig = px.histogram(x=u_lengths, title='Histogram of U Lengths')
fig.write_html(os.path.join(results_folder, "histogram_u_lengths.html"))
fig.show()