# Llama 3.2 1B Evaluation
Ryan Roi Cayas\
2022-22085

In this notebook, we replicate the published scores of the Llama 3.2 1B on the MATH dataset.

## 1. Prerequisites


### 1.1 Load libraries and set-up CUDA

In [1]:
# Import the necessary libraries
import os
import json
import re
import random
import pandas as pd
from tqdm import tqdm

import torch
from torch.utils.data import DataLoader
from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
from datasets import load_dataset, concatenate_datasets # Huggingface datasets (https://huggingface.co/docs/datasets/)
from huggingface_hub import login

In [2]:
# Set-up CUDA device
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
# use a specific GPU
os.environ["CUDA_VISIBLE_DEVICES"]="4"

# Use GPU for inference
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print the device being used
print(f"Using device: {device}")

# Check the GPU name
if device.type == 'cuda':
    gpu_name = torch.cuda.get_device_name(0)  # 0 because CUDA_VISIBLE_DEVICES=4 means GPU 4 is now 0
    print("Using GPU:", gpu_name)

Using device: cuda
Using GPU: NVIDIA A100-SXM4-40GB


### 1.2 Load the pre-trained tokenizer and model

In [3]:
# Paths to model and tokenizer
model_dir = "../../../../../../llm/llama/Llama-3.2-1B-Instruct"

# Load tokenizer and model
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_dir, padding_side="left")
model = LlamaForCausalLM.from_pretrained(model_dir)

# Set the eos_token as the padding token
tokenizer.pad_token = tokenizer.eos_token

# Move the model to the GPU
model.to(device)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

### 1.3 Load the MATH Dataset

In [4]:
math_dataset_test = load_dataset('competition_math', split='test')
math_dataset_test.to_pandas().head()

Unnamed: 0,problem,level,type,solution
0,How many vertical asymptotes does the graph of...,Level 3,Algebra,The denominator of the rational function facto...
1,What is the positive difference between $120\%...,Level 1,Algebra,One hundred twenty percent of 30 is $120\cdot3...
2,Find $x$ such that $\lceil x \rceil + x = \dfr...,Level 4,Algebra,"First, we note that $x$ must be positive, sinc..."
3,Evaluate $i^5+i^{-25}+i^{45}$.,Level 5,Algebra,We have $i^5 = i^4\cdot i = 1\cdot (i) = i$. ...
4,"If $2^8=4^x$, what is the value of $x$?",Level 1,Algebra,Rewrite $4$ as $2^2$ to find $4^x=2^{2x}$. Si...


In [7]:
math_dataset_train = load_dataset('competition_math', split='test')
math_dataset_train.to_pandas().head()

Unnamed: 0,problem,level,type,solution
0,How many vertical asymptotes does the graph of...,Level 3,Algebra,The denominator of the rational function facto...
1,What is the positive difference between $120\%...,Level 1,Algebra,One hundred twenty percent of 30 is $120\cdot3...
2,Find $x$ such that $\lceil x \rceil + x = \dfr...,Level 4,Algebra,"First, we note that $x$ must be positive, sinc..."
3,Evaluate $i^5+i^{-25}+i^{45}$.,Level 5,Algebra,We have $i^5 = i^4\cdot i = 1\cdot (i) = i$. ...
4,"If $2^8=4^x$, what is the value of $x$?",Level 1,Algebra,Rewrite $4$ as $2^2$ to find $4^x=2^{2x}$. Si...


In [5]:
print("Problem:",math_dataset_test[0]['problem'])
print("Level:",math_dataset_test[0]['level'])
print("Type:",math_dataset_test[0]['type'])
print("Solution:",math_dataset_test[0]['solution'])

Problem: How many vertical asymptotes does the graph of $y=\frac{2}{x^2+x-6}$ have?
Level: Level 3
Type: Algebra
Solution: The denominator of the rational function factors into $x^2+x-6=(x-2)(x+3)$. Since the numerator is always nonzero, there is a vertical asymptote whenever the denominator is $0$, which occurs for $x = 2$ and $x = -3$.  Therefore, the graph has $\boxed{2}$ vertical asymptotes.


### 1.4 Load evaluation data from Meta

The evaluation data is publicly released by Meta and is available here: https://huggingface.co/datasets/meta-llama/Llama-3.2-1B-Instruct-evals

We find here the following:
- evaluation config
- full input prompts used by Meta
- parsed answers

In [6]:
# Load token from config.json
with open("config.json") as f:
    config = json.load(f)

hf_token = config["hf_token"]
login(hf_token)
eval_dataset_FROM_META = load_dataset("meta-llama/Llama-3.2-1B-Instruct-evals", "Llama-3.2-1B-Instruct-evals__math__details")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /data/students/ryan/.cache/huggingface/token
Login successful


In [7]:
eval_dataset_FROM_META_df = eval_dataset_FROM_META['latest'].to_pandas()
eval_dataset_FROM_META_df.head()

Unnamed: 0,task_type,task_name,subtask_name,input_question,input_choice_list,input_final_prompts,input_correct_responses,output_prediction_text,output_parsed_answer,output_choice_completions,output_choice_negative_log_likelihoods,output_metrics,is_correct,input_question_hash,input_final_prompts_hash,benchmark_label,eval_config
0,Generative,math_chat_new,,Find the imaginary part of \[(\cos12^\circ+i\s...,,[<|start_header_id|>user<|end_header_id|>\n\nS...,[0],[## Step 1: Express the given expression in po...,0,,,"{'correct_format': 1.0, 'em': 1.0, 'em_maj1@1'...",True,5a2e4b5ae809e4fe13e8e25fedadc00309a38c663a0872...,[adcde01feac66b5a5a928be8804031ffddcb481ef73e4...,"MATH (4-shot, CoT)","{'max_gen_len': '5120', 'max_prompt_len': '307..."
1,Generative,math_chat_new,,"What is the least, positive four-digit multipl...",,[<|start_header_id|>user<|end_header_id|>\n\nS...,[1001],[## Step 1: Determine the smallest four-digit ...,1007,,,"{'correct_format': 1.0, 'em': 0.0, 'em_maj1@1'...",False,05be5a24c13a7f03934ca04c01bd595e757234538b3118...,[97745be9a6acd60858e258a46fbd4946e2904dd320c6c...,"MATH (4-shot, CoT)","{'max_gen_len': '5120', 'max_prompt_len': '307..."
2,Generative,math_chat_new,,What is $(1 + 2 \cdot 3 \cdot 4 \cdot 5) \div ...,,[<|start_header_id|>user<|end_header_id|>\n\nS...,[11],[## Step 1: Calculate the product inside the p...,21.909...,,,"{'correct_format': 1.0, 'em': 0.0, 'em_maj1@1'...",False,fc91c2faca18e68429a3ad12d1edcaf5eaa84183506aaf...,[2a840b3673983db48b20ed473579b710ad1f111764425...,"MATH (4-shot, CoT)","{'max_gen_len': '5120', 'max_prompt_len': '307..."
3,Generative,math_chat_new,,"Two ropes, 18 meters in length and 24 meters i...",,[<|start_header_id|>user<|end_header_id|>\n\nS...,[6],[## Step 1: Identify the key constraint\nThe p...,1,,,"{'correct_format': 1.0, 'em': 0.0, 'em_maj1@1'...",False,9e87e5f82a292976ff7a0503c803bb2fcbb026b1954d59...,[c7952d5c97db4006ebe8fba59bf41c9caa6dc35c59590...,"MATH (4-shot, CoT)","{'max_gen_len': '5120', 'max_prompt_len': '307..."
4,Generative,math_chat_new,,"In the diagram below, lines $k$ and $\ell$ are...",,[<|start_header_id|>user<|end_header_id|>\n\nS...,[60],[## Step 1: The problem provides a diagram wi...,30,,,"{'correct_format': 1.0, 'em': 0.0, 'em_maj1@1'...",False,30625ff210cf5a08933c65b91f4ff6b36fa1d03087ea1a...,[857c3d7d39e2cef68320fb00d5271b66676ba06244744...,"MATH (4-shot, CoT)","{'max_gen_len': '5120', 'max_prompt_len': '307..."


In [10]:
# check META's eval_config
eval_dataset_FROM_META_df['eval_config'][0] 

{'max_gen_len': '5120',
 'max_prompt_len': '3072',
 'num_few_shot': '4',
 'num_generations': '1',
 'prompt_fn': "functools.partial(<function jinja_dialog_format at 0x7fe9828b28c0>, template={'prompt': 'Solve the following math problem efficiently and clearly:\\n\\n- For simple problems (2 steps or fewer):\\nProvide a concise solution with minimal explanation.\\n\\n- For complex problems (3 steps or more):\\nUse this step-by-step format:\\n\\n## Step 1: [Concise description]\\n[Brief explanation and calculations]\\n\\n## Step 2: [Concise description]\\n[Brief explanation and calculations]\\n\\n...\\n\\nRegardless of the approach, always conclude with:\\n\\nTherefore, the final answer is: $\\\\boxed{answer}$. I hope it is correct.\\n\\nWhere [answer] is just the final number or expression that solves the problem.\\n\\nProblem: {{ problem }}\\n', 'answer': '{{ solution }}. I hope it is correct.'}, append_gen_prefix=False)",
 'return_logprobs': 'false',
 'seed': '42',
 'temperature': '0.0'

In [8]:
eval_config = {
    'max_gen_len': 5120,
    'max_prompt_len': 3072,
    'num_few_shot': 4,
    'num_generations': 1,
    'prompt_fn': "functools.partial(<function jinja_dialog_format at 0x7fe9828b28c0>, template={'prompt': 'Solve the following math problem efficiently and clearly:\\n\\n- For simple problems (2 steps or fewer):\\nProvide a concise solution with minimal explanation.\\n\\n- For complex problems (3 steps or more):\\nUse this step-by-step format:\\n\\n## Step 1: [Concise description]\\n[Brief explanation and calculations]\\n\\n## Step 2: [Concise description]\\n[Brief explanation and calculations]\\n\\n...\\n\\nRegardless of the approach, always conclude with:\\n\\nTherefore, the final answer is: $\\\\boxed{answer}$. I hope it is correct.\\n\\nWhere [answer] is just the final number or expression that solves the problem.\\n\\nProblem: {{ problem }}\\n', 'answer': '{{ solution }}. I hope it is correct.'}, append_gen_prefix=False)",
    'return_logprobs': False,
    'seed': 42,
    'temperature': 0.0,
    'top_k': 0,
    'top_p': 0.0
}

In [9]:
# set seed for reproducibility
torch.manual_seed(eval_config['seed'])
random.seed(eval_config['seed'])

## 2. Model Inference

### 2.1 Preprocess Data

Here, we use functions provided by Meta in their Llama recipes (based on llm eval harness): https://github.com/meta-llama/llama-recipes/blob/2501f519c7a775e3fab82ff286916671023ca9c6/tools/benchmarks/llm_eval_harness/meta_eval/meta_template/math_hard/utils.py

In [10]:
# %pip install lm_eval

import math_utils
import importlib
importlib.reload(math_utils) 

from math_utils import *

In [11]:
math_dataset_test_processed = process_docs(math_dataset_test) # get numerical answers

In [12]:
import functools

# Add the instruction (and sample problems if few shot) before each problem
cot_prompt = "<|start_header_id|>user<|end_header_id|>\n\nSolve the following math problem efficiently and clearly:\n\n- For simple problems (2 steps or fewer):\nProvide a concise solution with minimal explanation.\n\n- For complex problems (3 steps or more):\nUse this step-by-step format:\n\n## Step 1: [Concise description]\n[Brief explanation and calculations]\n\n## Step 2: [Concise description]\n[Brief explanation and calculations]\n\n...\n\nRegardless of the approach, always conclude with:\n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.\n\n"
problems = ["Find the domain of the expression $\\frac{\\sqrt{x-2}}{\\sqrt{5-x}}$.",
            "If $\\det \\mathbf{A} = 2$ and $\\det \\mathbf{B} = 12,$ then find $\\det (\\mathbf{A} \\mathbf{B}).$",
            "Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights instead, how many times must Terrell lift them in order to lift the same total weight?",
            "If the system of equations\n\n\\begin{align*}\n6x-4y&=a,\\\\\n6y-9x &=b.\n\\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero, find $\\frac{a}{b},$ assuming $b$ is nonzero."]
solutions = ["## Step 1: Consider the expression inside the first square root\nThe expression inside the first square root is $x-2$, and it must be non-negative. This means that $x-2 \\ge 0$, which simplifies to $x \\ge 2$.\n\n## Step 2: Consider the expression inside the second square root\nThe expression inside the second square root is $5-x$, and it must be non-negative. This means that $5-x \\ge 0$, which simplifies to $x \\le 5$.\n\n## Step 3: Consider the denominator of the expression\nThe denominator of the expression is $\\sqrt{5-x}$, and it cannot be equal to zero. This means that $5-x>0$, which simplifies to $x<5$.\n\n## Step 4: Combine the results from the previous steps\nCombining the results from the previous steps, we have $x \\ge 2$ and $x < 5$. This means that the domain of the expression is $[2,5)$.\n\nThe final answer is: $\\boxed{[2,5)}$. I hope it is correct.",
             "## Step 1:  Understand the properties of determinants\nWe know that the determinant of a product of two matrices is equal to the product of their determinants.\n\n## Step 2: Apply the property to the given matrices\nUsing the property, we can write $\\det (\\mathbf{A} \\mathbf{B}) = (\\det \\mathbf{A})(\\det \\mathbf{B})$.\n\n## Step 3: Substitute the given values\nWe are given that $\\det \\mathbf{A} = 2$ and $\\det \\mathbf{B} = 12$. Substituting these values, we get $\\det (\\mathbf{A} \\mathbf{B}) = (2)(12)$.\n\n## Step 4: Calculate the final answer\nEvaluating the product, we find that $\\det (\\mathbf{A} \\mathbf{B}) = 24$.\n\nThe final answer is: $\\boxed{24}$. I hope it is correct.",
             "## Step 1: Calculate the total weight Terrell lifts with two 20-pound weights\nTerrell lifts two 20-pound weights 12 times, so the total weight he lifts is $2 \\cdot 12 \\cdot 20 = 480$ pounds.\n\n## Step 2: Calculate the total weight Terrell lifts with two 15-pound weights for n times\nIf Terrell lifts two 15-pound weights instead for $n$ times, he will lift a total of $2 \\cdot 15 \\cdot n = 30n$ pounds of weight.\n\n## Step 3: Equate the total weight lifted with 15-pound weights to 480 pounds and solve for n\nEquating the total weight lifted with 15-pound weights to 480 pounds, we can solve for $n$: $30n = 480$.\n\n## Step 4: Solve for n\nTo solve for $n$, divide both sides by 30: $n = \\frac{480}{30}$.\n\n## Step 5: Calculate the value of n\nThe value of $n$ is $\\frac{480}{30} = 16$.\n\nThe final answer is: $\\boxed{16}$. I hope it is correct.",
             "## Step 1: Understand the problem\nWe are given a system of linear equations with two variables, $x$ and $y$, and two constants, $a$ and $b$. Our goal is to find the ratio $\\frac{a}{b}$, assuming $b$ is nonzero.\n\n## Step 2: Multiply the first equation by a suitable factor\nWe multiply the first equation by $-\\frac{3}{2}$ to obtain an equation with the same coefficients for $y$ as the second equation: $$-\\frac{3}{2}(6x-4y)=a\\Rightarrow 6y-9x=-\\frac{3}{2}a.$$\n\n## Step 3: Equate the right-hand sides of the two equations\nSince we know that $6y-9x=b$ from the second equation, we can equate the right-hand sides of the two equations to obtain $$-\\frac{3}{2}a=b.$$\n\n## Step 4: Solve for the ratio $\\frac{a}{b}$\nFinally, we can solve for the ratio $\\frac{a}{b}$ by dividing both sides of the equation by $b$, giving us $$\\frac{a}{b}=-\\frac{2}{3}.$$\n\nThe final answer is: $\\boxed{-\\frac{2}{3}}. I hope it is correct."]

def add_prompt(question_set, cot_prompt, few_shot = False, problems = None, solutions = None):
    "question set contains 'problem', 'solution', 'type', and 'level'"
    if few_shot and (problems is None or solutions is None):
        raise ValueError("Few shot prompts require sample problems and answers")
    
    prompt = cot_prompt

    if few_shot:
        # Concatenate problems and answers with the specified formatting
        for i in range(len(problems)): 
            prompt += f"Problem: {problems[i]}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{solutions[i]}<|eot_id|>"
            prompt += cot_prompt
        
    question_set['input_final_prompt'] = f"{prompt}Problem: {question_set['problem']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    return question_set

# Create a partial function
add_prompt_fewshot_partial = functools.partial(add_prompt, cot_prompt=cot_prompt, few_shot=True, problems=problems, solutions=solutions)

# Apply the function to the dataset
math_dataset_test_processed = math_dataset_test_processed.map(add_prompt_fewshot_partial)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [42]:
math_dataset_test_processed[2]

{'problem': 'Find $x$ such that $\\lceil x \\rceil + x = \\dfrac{23}{7}$. Express $x$ as a common fraction.',
 'level': 'Level 4',
 'type': 'Algebra',
 'solution': 'First, we note that $x$ must be positive, since otherwise $\\lceil x \\rceil + x$ is nonpositive. Next, we know that the decimal part of $x$ must be $\\dfrac{2}{7}$. We write $x$ as $n+\\dfrac{2}{7}$, where $n$ is the greatest integer less than $x.$ Then, $\\lceil x \\rceil = n + 1.$ Therefore, we can write $\\lceil x \\rceil + x$ as $n+1+n+\\dfrac{2}{7}=\\dfrac{23}{7}$. Solving, we get $n=1$. Therefore, the only value $x$ that satisfies the equation is $1+\\dfrac{2}{7}=\\boxed{\\dfrac{9}{7}}$.',
 'answer': '\\dfrac{9}{7}',
 'input_final_prompt': '<|start_header_id|>user<|end_header_id|>\n\nSolve the following math problem efficiently and clearly:\n\n- For simple problems (2 steps or fewer):\nProvide a concise solution with minimal explanation.\n\n- For complex problems (3 steps or more):\nUse this step-by-step format:\n\n#

### 2.2 Generate model outputs and evaluate score

In [13]:
generate_config = {
    'max_new_tokens': eval_config['max_gen_len'],
    'num_return_sequences': eval_config['num_generations'],
    'output_scores': eval_config['return_logprobs'],
    'pad_token_id': tokenizer.eos_token_id,
    'do_sample': False,  # since temperature, top_k, top_p are 0
    'top_p': None,
    'temperature': None,
    'top_k': None
}

In [53]:
# Generate a sample inference
sample_idx = 2

input_prompt = math_dataset_test_processed['input_final_prompt'][sample_idx]
input_tokenized = tokenizer(input_prompt, return_tensors="pt", padding=True, truncation=False).to(device)

# Set the model to evaluation mode
model.eval()

output = model.generate(**input_tokenized, 
                        **generate_config
                        )

# Find the length of the input tokens
input_length = len(input_tokenized.input_ids[0])

# Extract the new tokens from the generated output
new_tokens = output[0][input_length:]

# Decode the new tokens to get the new text
new_text = tokenizer.decode(new_tokens, skip_special_tokens=True)


print(new_text)

## Step 1: Understand the problem
We are given an equation involving the ceiling function, which rounds a number up to the nearest integer. The equation is $\lceil x \rceil + x = \dfrac{23}{7}$, and we need to find the value of $x$ as a common fraction.

## Step 2: Recognize that $\lceil x \rceil$ is an integer
Since $\lceil x \rceil$ is the ceiling function, it rounds $x$ up to the nearest integer. This means that $\lceil x \rceil$ is always an integer.

## Step 3: Express $x$ as the sum of an integer and a fraction
Let's express $x$ as the sum of an integer $n$ and a fraction $\frac{f}{7}$, where $f$ is an integer. Then, we have $\lceil x \rceil = n$ and $x = n + \frac{f}{7}$.

## Step 4: Substitute $x$ into the original equation
Substituting $x = n + \frac{f}{7}$ into the original equation, we get $\lceil n + \frac{f}{7} \rceil + n + \frac{f}{7} = \dfrac{23}{7}$.

## Step 5: Simplify the equation
Since $\lceil n + \frac{f}{7} \rceil = n + 1$ (because $n$ is an integer and $\lceil n 

In [51]:
output_numerical_answer = normalize_final_answer(remove_boxed(last_boxed_only_string(new_text)))

print("Numerical Answer:", output_numerical_answer)

Numerical Answer: \dfrac{23+2f}{14}


In [81]:
# # Code for generating new text for a batch of examples
# def generate_new_text_batch(examples, tokenizer, model, generate_config):
#     model.eval()

#     input_prompts = examples['input_final_prompt']
#     input_tokenized = tokenizer(input_prompts, return_tensors="pt", padding=True, truncation=False).to(device)

#     # Generate output for the batch
#     output = model.generate(**input_tokenized, **generate_config)

#     new_texts = []
#     generated_answers = []
#     exact_matches = []
#     for i in range(len(input_prompts)):
#         input_length = len(input_tokenized.input_ids[i])
#         new_tokens = output[i][input_length:]

#         new_text = tokenizer.decode(new_tokens, skip_special_tokens=True)
#         new_texts.append(new_text)

#         last_boxed_string = last_boxed_only_string(new_text)
#         if not last_boxed_string:
#             # No boxed string found, so we can't evaluate
#            generated_answer = None
#            exact_match = 0
#         else:
#             generated_answer = normalize_final_answer(remove_boxed(last_boxed_only_string(new_text)))
#             ground_truth_answer = examples['answer'][i]

#             if generated_answer.strip() == ground_truth_answer.strip() or is_equiv(generated_answer, ground_truth_answer):
#                 exact_match = 1
#             else:
#                 exact_match = 0
#         generated_answers.append(generated_answer)
#         exact_matches.append(exact_match)

#     # Add the new texts to the examples
#     examples['generated_text'] = new_texts
#     examples['generated_answer'] = generated_answers
#     examples['exact_match'] = exact_matches
#     return examples

In [82]:
# # Generate new text for the first 10 problems
# small_dataset = math_dataset_test_processed.select(range(10))

# # Apply the function to the small subset
# small_dataset_processed = small_dataset.map(
#     lambda examples: generate_new_text_batch(examples, tokenizer, model, generate_config),
#     batched=True,
#     batch_size=5
# )

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [87]:
# math_dataset_test_processed = math_dataset_test_processed.map(
#     lambda examples: generate_new_text_batch(examples, tokenizer, model, generate_config),
#     batched=True,
#     batch_size=32
# )

# # Save the processed dataset

# # Create a directory to store the results
# output_dir = "outputs"
# os.makedirs(output_dir, exist_ok=True)

# # Save the results to a file
# results_file = os.path.join(output_dir, f"results_4shot.json")
# with open(results_file, "w") as f:
#     json.dump(math_dataset_test_processed, f)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [14]:
# Create a DataLoader
batch_size = 32
test_data_loader = DataLoader(math_dataset_test_processed, batch_size=batch_size, shuffle=False)

# Set the model to evaluation mode
model.eval()

# Initialize lists to store results
all_generated_texts = []
all_generated_answers = []
all_exact_matches = []

for batch in tqdm(test_data_loader, total=len(test_data_loader)):
    # Tokenize the batch
    input_prompts = batch['input_final_prompt']
    input_tokenized = tokenizer(input_prompts, return_tensors="pt", padding=True, truncation=False).to(device)

    # Generate output for the batch
    output = model.generate(**input_tokenized, **generate_config)

    new_texts = []
    generated_answers = []
    exact_matches = []
    for i in range(len(input_prompts)):
        input_length = len(input_tokenized.input_ids[i])
        new_tokens = output[i][input_length:]
        new_text = tokenizer.decode(new_tokens, skip_special_tokens=True)
        new_texts.append(new_text)


        last_boxed_string = last_boxed_only_string(new_text)
        if not last_boxed_string:
            # No boxed string found, so we can't evaluate
           generated_answers.append(None)
           exact_matches.append(0)
        else:
            generated_answer = normalize_final_answer(remove_boxed(last_boxed_only_string(new_text)))
            ground_truth_answer = batch['answer'][i]

            if generated_answer.strip() == ground_truth_answer.strip() or is_equiv(generated_answer, ground_truth_answer):
                exact_matches.append(1)
            else:
                exact_matches.append(0)
            generated_answers.append(generated_answer)

    # Store results
    all_generated_texts.extend(new_texts)
    all_generated_answers.extend(generated_answers)
    all_exact_matches.extend(exact_matches)

# Add the results back to the dataset
math_dataset_test_processed['generated_text'] = all_generated_texts
math_dataset_test_processed['generated_answer'] = all_generated_answers
math_dataset_test_processed['exact_match'] = all_exact_matches

# Save the processed dataset

# Create a directory to store the results
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)

# Save the results to a file
results_file = os.path.join(output_dir, f"results_4shot.json")
with open(results_file, "w") as f:
    json.dump(math_dataset_test_processed, f)

  0%|          | 0/200 [00:00<?, ?it/s]Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
  2%|▏         | 4/200 [22:27<18:20:46, 336.97s/it]


KeyboardInterrupt: 

In [2]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## 3. EXTENSION: Fine-tune model on MGSM training data (OLD CODE)

https://github.com/huggingface/huggingface-llama-recipes/blob/main/fine_tune/peft_finetuning.py

### 3.1 Preprocess train dataset

In [13]:
def combine_fields(question_set):
    language = question_set['language']
    combined_text = f"{question_set['question']}\n\n{question_set['answer']}\n\n{question_set['equation_solution']}\n\n{answer_translations[language]} {question_set['answer_number']}"
    question_set['text'] = combined_text
    return question_set

# Apply the function to the dataset
mgsm_train_combined = mgsm_train.map(combine_fields)

In [376]:
print(mgsm_train_combined['text'][5])

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?

Step-by-Step Answer: 5 bagels for $3 each should cost 5 * 3 = 15 dollars. Olivia had $23 in the beginning, so now she has 23 - 15 = 8 dollars left. The answer is 8.

5 * 3 = 15. 23 - 15 = 8.

Answer: 8


### 3.2 Fine-tune the model on the training set

In [14]:
# %pip install peft trl bitsandbytes

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import BitsAndBytesConfig, EarlyStoppingCallback

# UNCOMMENT if need to unload memory
# torch.cuda.empty_cache()
# del trainer

QLoRA = True
if QLoRA:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4"
    )
    
    lora_config = LoraConfig(
        r=8,
        target_modules="all-linear",
        bias="none",
        task_type="CAUSAL_LM",
    )
else:
    lora_config = None


sft_config = SFTConfig(
    output_dir="./finetune_results",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    logging_dir='./finetune_logs',
    logging_steps=25,
    # load_best_model_at_end = True,
    # evaluation_strategy = "steps",
    dataset_text_field="text",  # Set the dataset text field here
    max_seq_length=2048  # Set the max sequence length
)

# tokenizer.padding_side = "right"

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_config,
    peft_config=lora_config,
    train_dataset=mgsm_train_combined,
    # callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mcayasryan[0m ([33mcayasryan-university-of-the-philippines-diliman[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
25,1.3041
50,1.0155
75,0.9129
100,0.8346


TrainOutput(global_step=110, training_loss=1.0022768627513539, metrics={'train_runtime': 144.5373, 'train_samples_per_second': 6.088, 'train_steps_per_second': 0.761, 'total_flos': 2485041893277696.0, 'train_loss': 1.0022768627513539, 'epoch': 10.0})

In [15]:
# Load fine tuned model
# del model # delete base model to free up GPU space
model_dir = "./finetune_results/checkpoint-110"
model_finetuned = LlamaForCausalLM.from_pretrained(model_dir)
model_finetuned.to(device)

# Parameters for batch processing
loader_config = {'batch_size': 25,
                 'num_workers': 4, 
                 'prefetch_factor': 2,
                 'shuffle': False}

# Create a DataLoader for the dataset
test_data_loader = DataLoader(mgsm_test, **loader_config)

# Initializa a dictionary to store the results
results_mgsm_finetuned_2 = {'question': mgsm_test['question'], 
                'language': mgsm_test['language'], 
                'ground_truth': mgsm_test['answer_number'], 
                'generated_answer': []}

# Disable parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Iterate over the DataLoader
for batch_samples in tqdm(test_data_loader, total=len(test_data_loader)):

    # Add input prompt to the batch problems
    batch_problems = [add_input_prompt(question, language) for question, language in zip(batch_samples['question'], batch_samples['language'])]
    
    # Tokenize the input problems
    inputs = tokenizer(batch_problems, return_tensors="pt", padding=True, truncation=True).to(device)

    # Generate outputs from the model for the entire batch
    with torch.no_grad():
        outputs = model_finetuned.generate(**inputs, **generate_config)

    # Get numerical answer from output texts
    output_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    numerical_answers = [maybe_remove_comma(find_number(output_text, answer_translations[language])) for output_text, language in zip(output_texts, batch_samples['language'])]

    # Add the numerical answers to the results dictionary
    results_mgsm_finetuned_2['generated_answer'].extend(numerical_answers)

# Create a directory to store the results
output_dir = "outputs"
results_var_name = "results_mgsm_finetuned_2"
os.makedirs(output_dir, exist_ok=True)

# Save the results to a file
results_file = os.path.join(output_dir, f"{results_var_name}.json")
with open(results_file, "w") as f:
    json.dump(globals()[results_var_name], f)

  0%|          | 0/110 [00:00<?, ?it/s]Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
100%|██████████| 110/110 [2:51:32<00:00, 93.57s/it]  


### 3.3 Evaluate updated score

In [16]:
output_dir = "outputs"
results_var_name = "results_mgsm_finetuned_2"

# Load the results from the json file and convert to pandas DataFrame
results_file = os.path.join(output_dir, f"{results_var_name}.json")
results_df = pd.read_json(results_file)

# Convert ground_truth and generated_answer columns to float
results_df['ground_truth'] = results_df['ground_truth'].astype(float)
# results_df['generated_answer'] = results_df['generated_answer'].astype(float)
results_df['generated_answer'] = pd.to_numeric(results_df['generated_answer'], errors='coerce')
    # coerce will convert the invalid parsing to NaN


# Compare the generated answers with the ground truth answers
results_df['correct'] = results_df['ground_truth'] == results_df['generated_answer']

# Get accuracy per language then get average accuracy
accuracy_per_language = results_df.groupby('language')['correct'].mean()
overall_accuracy = accuracy_per_language.mean()

# Get Meta's accuracy
meta_accuracy = eval_dataset_FROM_META_df.groupby('subtask_name')['is_correct'].mean()
meta_accuracy_overall = meta_accuracy.mean()
print("Meta's Overall Accuracy:", meta_accuracy_overall)
print("Personal Overall Accuracy:", overall_accuracy)


# Check if the language is supported (see https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/README.md?code=true)
lang_supported = ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th']
is_supported = [lang in lang_supported for lang in accuracy_per_language.index]

# Compare Meta and personal evaluation results
eval_scores_df = pd.DataFrame({
    'Language': meta_accuracy.index,
    'Meta Score': meta_accuracy.values,
    'Personal Score': accuracy_per_language.values,
    'Language Supported': is_supported
})

print('')
print(eval_scores_df)

Meta's Overall Accuracy: 0.24472727272727274
Personal Overall Accuracy: 0.1941818181818182

   Language  Meta Score  Personal Score  Language Supported
0        bn       0.144           0.156               False
1        de       0.272           0.232                True
2        en       0.420           0.340                True
3        es       0.352           0.168                True
4        fr       0.264           0.248                True
5        ja       0.180           0.160               False
6        ru       0.272           0.244               False
7        sw       0.184           0.120               False
8        te       0.112           0.048               False
9        th       0.228           0.188                True
10       zh       0.264           0.232               False


In [17]:
# Delete the fine-tuned model weights
model_dir = "./finetune_results"
if os.path.exists(model_dir):
    import shutil
    shutil.rmtree(model_dir)
    print(f"Deleted the fine-tuned model weights in {model_dir}")

Deleted the fine-tuned model weights in ./finetune_results
