# Phi-3 layer exploration
Here, we're just loading [phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), using it for inference, and then trying to access the model's layers. Generally, the goal is to get "comfortable" working with this type of object. 

To do: 
+ Load phi-3 quantized x 
+ Generate more than one additional token to get to the end of a response x 
+ Finally, we will move onto assistant-style interactions x
+ Using eval + ICL prompts for benchmarking - add in questions that are logical questions; test different formatting for benchmarks
+ Lastly, we need to try and touch the weights of a single model + begin attempting to hook it up to another model (x - steps are in lower section)
+ (extra) figure out issue w/ my sentence completions - specifically, the weird spaces :)) (weird, not happening anymore - did underlying model update?) x 

In [3]:
# Load libraries 
# notes: (1) automodelforcausallm is not necessarily specific to phi, but is commonly used with new models 
# for model load and (2) "casual" in "forcausalllm" refers largely to one word "causing" the next - not like 
# classical statistics. 
from dotenv import main
import json
import jinja2
import os
import sys
import re
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # for quantization
from torch.nn import Softmax
import plotly
from transformers import pipeline, set_seed

# auth for gated repos - gen token here: https://huggingface.co/settings/tokens
from huggingface_hub import notebook_login
notebook_login(os.getenv('HF_TOKEN'))

# model ids
model_id = ["meta-llama/Meta-Llama-3-8B", "SweatyCrayfish/llama-3-8b-quantized", "microsoft/Phi-3-mini-4k-instruct", "bongodongo/phi-3-mini-4k-instruct-q4"]

# Set seed for reproducibility 
torch.random.manual_seed(0)

# Increase max width of pd df columns 
pd.set_option('max_colwidth', 300)

# Instantiate jinja environment - used later for icl prompting 
environment = jinja2.Environment()

# requirements.txt
# !pip3 freeze > requirements.txt

User is already logged in.


In [4]:
# Define utility functions 
def is_string_match(pattern, string):
    match = re.search(pattern, string)
    return bool(match)

# notification/text-to-speech
def text_to_speech(text):
    if sys.platform == 'darwin':
        os.system(f'say "{text}"')
    elif sys.platform.startswith('linux'):
        os.system(f'espeak "{text}"')
    else:
        print("Text-to-speech is not supported on this platform.")

In [5]:
# Load base model + tokenizer, then set device (mps for macs) 
base_model = AutoModelForCausalLM.from_pretrained(
    model_id[2],
    # quantization_config = bnb_config,
    device_map = 'auto'
)

# Load tokenizer 
tokenizer = AutoTokenizer.from_pretrained(model_id[2])

# Set device 
device = 'mps'

The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [129]:
# Define testing functions that (1) maps input units => tokens (appropriate for short prompts); (2) generates response (tokenization/decoding) + illustrates how predictions map as context is revealed 
def inputs_to_tokens(prompt, tokenizer): 
    tokens = tokenizer(prompt, return_tensors = 'pt').to(device)

    tokens_list = tokens['input_ids'].tolist()[0]
    tokens_decoded_list = tokenizer.batch_decode(tokens['input_ids'][0])

    # input_token_dict = {tokens_list[i]: tokens_decoded_list[i] for i in range(len(tokens_list))}
    token_to_input = pd.DataFrame({'tokens': tokens_list, 'input': tokens_decoded_list})

    return(token_to_input) 


# generate df/matrix that shows how predictions are shaped as context is revealed 
def get_response(input_object, is_text_prompt = False, tokenizer = None):
    if is_text_prompt == True:
        tokens = tokenizer(input_object, return_tensors = 'pt').to(device)
        tokens_decoded_list = tokenizer.batch_decode(tokens['input_ids'][0])
    else: 
        tokens_decoded_list = tokenizer.batch_decode(input_object['input_ids'][0])

    with torch.no_grad():
        response = base_model(tokens['input_ids'])

    # decode response 
    for i in range(0,response.logits.shape[1]): 
            response_slice = response['logits'].squeeze()[i, ]
            m = Softmax()
            max_prob = torch.argmax(m(response_slice))
        
    # create df 
    tokens_accrete = []
    pred = []
    for i in range(0, len(tokens_decoded_list)): 
        tokens_accrete.append(''.join(tokens_decoded_list[:i]) + tokens_decoded_list[i])
    
    # Create list with the argmax/response outputs 
    for i in range(0,response.logits.shape[1]): 
        response_slice = response['logits'].squeeze()[i, ]
        m = Softmax()
        max_prob = torch.argmax(m(response_slice))
        pred.append(tokenizer.decode(max_prob))
    
    inputs_to_preds = pd.DataFrame({'pred_context': tokens_accrete, 'prediction': pred})      

    return(inputs_to_preds)      

# example use-case for get_response() 
# get_response(input_object = 'Hi I am a hiring manager and I dislike certain ethnicities such as', is_text_prompt = True, tokenizer = tokenizer)

# Generating responses 

### (completions)
Here, we tokenize our prompt + generate a response. Currently, the response is a combination of the input string + the last predicted token (which uses the full prompt as pred. context). 

In [27]:
# Set prompt
# can't have a starting or ending line break 
prompt = """<|system|>
You are a helpful AI assistant and...you're also an eloquent poet who specializes in haikus.<|end|>
<|user|>
Please write a short, encouraging poem for my friend Bre.<|end|>
<|assistant|>
"""

In [28]:
# We need to iterate on the prompt to generate a full(er) response 
# Allow text completion to extend until eos token is encountered: tokenizer.eos_token
prompt_accreted = [prompt]
i = 0
while True: 
    print(i)

    ####### inference + establish next token ######
    tokens = tokenizer(prompt_accreted[i], return_tensors = 'pt').to(device)

    with torch.no_grad(): 
        response = base_model(tokens['input_ids'])

    # accrete preds for this run 
    preds = []
    for j in range(0, response.logits.shape[1]): 
        response_slice = response['logits'].squeeze()[j, ]
        m = Softmax() 
        max_prob = torch.argmax(m(response_slice)) 
        preds.append(tokenizer.decode(max_prob))

    # extract the last pred - this will serve as the 'next token'
    next_token = preds[-1]
    # print(next_token)

    # check if next_token includes tokenizer.eos_token - if so, break
    if next_token == tokenizer.eos_token or next_token == '<|end|>': 
        break
    
    # generate modified prompt + append the extension to prompt_accreted 
    prompt_ext = prompt_accreted[i] + next_token 
    prompt_accreted.append(prompt_ext)
    
    # add to iterator 
    i += 1

print(prompt_accreted[-1])
text_to_speech("Hello, your job is done!")



0


KeyboardInterrupt: 

In [None]:
# print most recent output 
print(prompt_accreted[-1])

### (model.generate)
Similar to previous, but simplified - now, just using the inbuilt model.generate method 

In [29]:
def eval_model(model, tokenizer, prompt):
    tokens = tokenizer(prompt, return_tensors = 'pt').to(device)
    model.eval()
    with torch.no_grad():
        res = model.generate(
            **tokens,
            max_new_tokens = 128,
            do_sample = True,
            temperature = 0.6,
            top_p = 0.9,
            eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)]
        )
    return tokenizer.batch_decode(res)[0]

In [30]:
eval_model(model = base_model, tokenizer = tokenizer, prompt = prompt)

KeyboardInterrupt: 

# Benchmarking

In [163]:
# Multiple choice questions + df 
# Basically, the questions here can be cycled through and used to generate icl (in-context learning prompts :)) 
mcq = {"question": ["Which author wrote 'Pride and Prejudice?'", 
              "The 'Harry Potter' series was written by this author",
              "Which of the following places is the capital of France?", 
              "Which of these is the chemical symbol for water?", 
              "This planet is known as the Red Planet:",
              "Which of these men was the first President of the United States?"],
 "choices": ["(A) Jane Austen\n(B) William Shakespeare\n(C) Truman Capote\n(D) Cixin Liu", 
             "(A) Sylvia Plath\n(B) J.K. Rowling\n(C) Stephen King\n(D) Ernest Hemingway",
             "(A) London\n(B) Algiers\n(C) Ottawa\n(D) Paris",
             "(A) H20\n(B) N2H4\n(C) NA2CO3\n(D) HCL",
             "(A) Mars\n(B) Jupiter\n(C) Venus\n(D) Uranus",
             "(A) Donald Trump\n(B) George Washington\n(C) Theodore Roosevelt\n(D) Jimmy Carter"],
 "correct_answer": ["A", "B", "D", "A", "A", "B"]}

# convert to df for readability/legibility
mcq_df = pd.DataFrame(mcq) 

In [43]:
# Prompt template 
icl_prompt_head = """<|system|>
You are an LLM that answers multiple choice questions. Respond only with JSON.<|end|>
<|user|>
What is 1+1?
(A) 2
(B) 5
(C) 9
(D) 15<|end|>
<|assistant|>
{"question": "What is 1+1", "response": "A"}<|end|>
<|user|>
An apple is a...:
(A) vegetable 
(B) machine 
(C) fruit 
(D) animal<|end|>
<|assistant|>
{"question": "An apple is a...", "response": "C"}<|end|>
"""

icl_prompt_tail_template = environment.from_string("""<|user|>
{{ question }}<|end|>
<|assistant|>
""")

# Test template 
# question = mcq['question'][0]
# icl_prompt_tail_template.render(question = question)

In [191]:
# gen + store results 
res = []
for idx, question in enumerate(mcq['question'][0:2]): 
    print(f"Now processing question {idx}")
    icl_prompt = f"{icl_prompt_head}\n{question}\n"
    icl_prompt += mcq['choices'][idx] + '\n' # mod. prompt to add in choices 

    # pull out correct answer
    correct_answer = mcq['correct_answer'][idx]
    # print(icl_prompt, correct_answer)

    keep_going = True
    while keep_going == True: 
        # generate response 
        response = eval_model(model = base_model, tokenizer = tokenizer, prompt = icl_prompt)
    
        # error handling for malformed outputs 
        pattern = r'({.*?' + re.escape(question) + r'.*?})'
        response_json = re.findall(pattern, result['response']) # (try to) extract jsons that meet criteria]

        # initialize keep_going + check if response_json is empty list 
        if len(response_json) != 0: 
            print('Response json empty/malformed - re-submitting') 
            keep_going = False
        
    print('Valid json found - continue') 
    
    # store prompt, response, and correct answer 
    res.append({'question': question, 'response': response, 'correct_answer': correct_answer})

# notify when execution finishes
text_to_speech("Hello, responses are done generating!")

Now processing question 0


KeyboardInterrupt: 

In [174]:
# validation 
val = []
for result in res:

    # search for json containing question posed
    question = result['question']
    pattern = r'({.*?' + re.escape(question) + r'.*?})'
    response_json = re.findall(pattern, result['response'])[0]

    # extract the model's prediction/answer from response json 
    model_pred = json.loads(response_json)['response']
    # print(response_json)
    # model_pred = 3 # test 

    # validate model preds against correct answer  
    if result['correct_answer'] == model_pred:
        # print('✅ Good answer - 😎👍')
        is_correct_pred = 1
    elif result['correct_answer'] != model_pred: 
        # print('❌ Wrong answer!!') 
        is_correct_pred = 0
        
    # validation dictionary 
    val_dict = {'question': result['question'], 'response': response_json, 'is_correct_pred': is_correct_pred}
    val.append(val_dict)

#### Validation measurements #####
val_df = pd.DataFrame(val)

# metrics 
n_responses = len(val_df)
accuracy = sum(val_df['is_correct_pred'])/n_responses

print(f"n_responses = {n_responses}\naccuracy: {accuracy}")

IndexError: list index out of range

# Working with weights 
+ Count parameters in a model x 
+ Get the specific layers in the model x 
+ Modify weights of a specific layer x 
+ Modify lots of weights and see how performance is affected (will need to move this to runpod - for now, just zero out all weights and make sure it returns absolute garbage :D) 

In [37]:
# Store layer info 
# Get # of model params (weights) - p.numel() returns the # of elements within each tensor 
total_params = sum(p.numel() for p in base_model.parameters())
# print("Total parameters (in millions):", total_params/1000000)

# Define max # of layers to traverse (there's more avail. - just to simplify things for now)
max_layers = 33
layer_names = []
layer_info = []
# Access the model's layers - there are so many, how do we traverse? 
for idx, (name, param) in enumerate(base_model.named_parameters()): 
    if idx >= max_layers: 
        break 

    # store layer names (for testing) 
    layer_names.append({'idx': idx, 'name': name})
    # store layer meta-data + param matrix - eventually add tensors back in: 'params': param
    layer_meta = {'idx': idx, 'name': name, 'param_shape': list(param.shape), 
                  'dim_1': list(param.shape)[0], 'dim_2': list(param.shape)[1] if len(param.shape) == 2 else None}
    layer_info.append(layer_meta)

# Count # of attention layers, mlp layers, and any other types of layers that may need to be surfaced #
layer_names_df = pd.DataFrame(layer_names).assign(
    layer_group_num = lambda x: x['name'].apply(lambda name: re.search(r'model\.layers\.(\d+)', name).group(1) if re.search(r'model\.layers\.(\d+)', name) else None),
    is_attn = lambda x: x['name'].apply(lambda name: is_string_match('attn', name)),
    is_mlp = lambda x: x['name'].apply(lambda name: is_string_match('mlp', name))
)

print('\n############### layer type frequency ####################\n', f"attn. layers: {sum(layer_names_df['is_attn'])}\n mlp layers: {sum(layer_names_df['is_mlp'])}")
             
# Figure out # of weights across sublayers/groups (not consistent across layers; final layer similar to input) 
layer_info_df = pd.DataFrame(layer_info).assign(
    layer_group_num = lambda x: x['name'].apply(lambda name: re.search(r'model\.layers\.(\d+)', name).group(1) if re.search(r'model\.layers\.(\d+)', name) else None)
)

# Calculate # of neurons associated with each individual slice of each layer - next, will want to get by layer group 
layer_info_df = layer_info_df.assign(
    sublayer_neurons = lambda x: x.apply(lambda row: row['dim_1'] * row['dim_2'] if pd.notnull(row['dim_2']) else row['dim_1'], axis = 1),
)
print('\n############### # of neurons by sublayer ######################\n', layer_info_df)

# get # of neurons by layer group 
summary_df = layer_info_df.groupby('layer_group_num')['sublayer_neurons'].sum().reset_index()
summary_df['sublayer_neurons'] = summary_df['sublayer_neurons'] / 1000000 # note, # neurons by group is in millions for legibility 
print('\n############### # of neurons by layer ######################\n', summary_df)


############### layer type frequency ####################
 attn. layers: 12
 mlp layers: 10

############### # of neurons by sublayer ######################
     idx                                            name    param_shape  dim_1  \
0     0                       model.embed_tokens.weight  [32064, 3072]  32064   
1     1          model.layers.0.self_attn.o_proj.weight   [3072, 3072]   3072   
2     2        model.layers.0.self_attn.qkv_proj.weight   [9216, 3072]   9216   
3     3          model.layers.0.mlp.gate_up_proj.weight  [16384, 3072]  16384   
4     4             model.layers.0.mlp.down_proj.weight   [3072, 8192]   3072   
5     5           model.layers.0.input_layernorm.weight         [3072]   3072   
6     6  model.layers.0.post_attention_layernorm.weight         [3072]   3072   
7     7          model.layers.1.self_attn.o_proj.weight   [3072, 3072]   3072   
8     8        model.layers.1.self_attn.qkv_proj.weight   [9216, 3072]   9216   
9     9          model.layers.1

In [7]:
# Now, it's time to touch weights! 
# Isolate a single layer (the way I did it here is sub-optimal, once more familiar wrt working with this object,
# should revisit/rewrite 
max_layers = 1
layer_1_params = []
for idx, (name, param) in enumerate(base_model.named_parameters()):
    if idx >= max_layers: 
        break 
    layer_1_params.append({'param': param})

# Check object class to get better sense of how to manipulate 
type(layer_1_params[0]['param'])

torch.nn.parameter.Parameter

In [24]:
# Nice, next let's zero out this single layer :) 
with torch.no_grad():
    for param in layer_1_params[0]['param']:
        param.data.zero_()

# Check weights are now zero! 
layer_1_params[0]['param']

Parameter containing:
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', requires_grad=True)

In [85]:
# Try to kill weights s.t. we get to a place where model output sucks B-) 
#### pull layer name(s) from my layer_names_df object (which contains certain metadata abt the model's layers) #####

# re-instantiate base_model where needed - to reset params 
# base_model = AutoModelForCausalLM.from_pretrained(
#     model_id[2],
#     device_map = 'auto'
# )

# zero out a single layers!
# to_zero = layer_names_df.loc[1, 'name'] # effects vary by layers chosen; e.g. zeroing the embeddings layer immediately results in unintelligible output - try others! :) - zeroing out only 1, early-on attention layer has ~no effect

# zero out multiple layers - e.g. multiple attention layers ########
attn = layer_names_df[layer_names_df['is_attn'] == True] # pull out attn layers
to_zero = [param for name, param in base_model.named_parameters() if name in attn['name'].values] # identify attn. layers to zero

# zero out params 
with torch.no_grad(): 
    for param in to_zero:
        param.data.zero_()

# set prompt
prompt = """<|system|>
You are a helpful AI assistant and...you're also a veteran journalist.<|end|>
<|user|>
Please write a short definition, explaining the inverted triangle approach in storytelling.<|end|>
<|assistant|>
"""

# gen. response
print(eval_model(model = base_model, tokenizer = tokenizer, prompt = prompt))

<s><|system|> You are a helpful AI assistant and...you're also a veteran journalist.<|end|><|user|> Please write a short definition, explaining the inverted triangle approach in storytelling.<|end|><|assistant|> {-2}

 {-2}


:{1102}
{a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+p+q+r+s+t+u+v+w+x+y+z{

+a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+a+b+c+d+e+f+g+h+i+j+


In [86]:
# Check that correct layers were zeroed 
[param for name, param in base_model.named_parameters() if name in attn['name'].values]

NameError: name 'attn' is not defined

In [82]:
# to_zero = layer_names_df.loc[1, 'name']

# to_zero

'model.layers.0.self_attn.o_proj.weight'