# Llama 3.2 1B Evaluation
Ryan Roi Cayas\
2022-22085

In this notebook, we replicate the published scores of the Llama 3.2 1B on the MGSM dataset.

## 1. Prerequisites


### 1.1 Load libraries and set-up CUDA

In [1]:
# Import the necessary libraries
import os
import json
import re
import random
import pandas as pd
from tqdm import tqdm

import torch
from torch.utils.data import DataLoader
from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
from datasets import load_dataset, concatenate_datasets # Huggingface datasets (https://huggingface.co/docs/datasets/)
from huggingface_hub import login

In [2]:
# Set-up CUDA device
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
# use a specific GPU
os.environ["CUDA_VISIBLE_DEVICES"]="4"

# Use GPU for inference
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print the device being used
print(f"Using device: {device}")

# Check the GPU name
if device.type == 'cuda':
    gpu_name = torch.cuda.get_device_name(0)  # 0 because CUDA_VISIBLE_DEVICES=4 means GPU 4 is now 0
    print("Using GPU:", gpu_name)

Using device: cuda
Using GPU: NVIDIA A100-SXM4-40GB


### 1.2 Load the pre-trained tokenizer and model

In [3]:
# Paths to model and tokenizer
model_dir = "../../../../../llm/llama/Llama-3.2-1B-Instruct"

# Load tokenizer and model
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_dir, padding_side="left")
model = LlamaForCausalLM.from_pretrained(model_dir)

# Set the eos_token as the padding token
tokenizer.pad_token = tokenizer.eos_token

# Move the model to the GPU
model.to(device)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

### 1.3 Load the MGSM Dataset

In [4]:
# Languages: English, Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
languages = ["en","es","fr","de","ru","zh","ja","th","sw","bn","te"]

# Empty list to store datasets with added 'language' feature
train_datasets = []
test_datasets = []

# Load the datasets and add the 'language' feature
for lang in tqdm(languages, desc="Loading datasets"):
    # Load train and test datasets for the language
    train = load_dataset("juletxara/mgsm", lang, split="train")
    test = load_dataset("juletxara/mgsm", lang, split="test")
    
    # Add the 'language' feature to both train and test sets
    train = train.add_column("language", [lang] * len(train))
    test = test.add_column("language", [lang] * len(test))
    
    # Append the datasets with the 'language' feature to the lists
    train_datasets.append(train)
    test_datasets.append(test)

# Concatenate datasets from all languages into a single train and test dataset
mgsm_train = concatenate_datasets(train_datasets)

mgsm_test = concatenate_datasets(test_datasets)
mgsm_test = mgsm_test.remove_columns(["answer", "equation_solution"])

# Verify the structure
print(mgsm_train)
print(mgsm_test)


Loading datasets: 100%|██████████| 11/11 [00:39<00:00,  3.60s/it]

Dataset({
    features: ['question', 'answer', 'answer_number', 'equation_solution', 'language'],
    num_rows: 88
})
Dataset({
    features: ['question', 'answer_number', 'language'],
    num_rows: 2750
})





In [26]:
mgsm_train.to_pandas().head()

Unnamed: 0,question,answer,answer_number,equation_solution,language
0,Question: Roger has 5 tennis balls. He buys 2 ...,Step-by-Step Answer: Roger started with 5 ball...,11,5 + 6 = 11.,en
1,Question: There were nine computers in the ser...,Step-by-Step Answer: There are 4 days from mon...,29,4 * 5 = 20. 9 + 20 = 29.,en
2,Question: Leah had 32 chocolates and her siste...,Step-by-Step Answer: Leah had 32 chocolates an...,39,32 + 42 = 74. 74 - 35 = 39.,en
3,"Question: Shawn has five toys. For Christmas, ...",Step-by-Step Answer: He has 5 toys. He got 2 f...,9,5 + 2 = 7. 7 + 2 = 9.,en
4,Question: Michael had 58 golf balls. On tuesda...,Step-by-Step Answer: Michael started with 58 g...,33,58 - 23 = 35. 35 - 2 = 33.,en


In [23]:
mgsm_test.to_pandas().head()

Unnamed: 0,question,answer,answer_number,equation_solution,language
0,Janet’s ducks lay 16 eggs per day. She eats th...,,18,,en
1,A robe takes 2 bolts of blue fiber and half th...,,3,,en
2,Josh decides to try flipping a house. He buys...,,70000,,en
3,James decides to run 3 sprints 3 times a week....,,540,,en
4,"Every day, Wendi feeds each of her chickens th...",,20,,en


In [27]:
sample_idx = 0

print("Sample Problem:")
print("Question:", mgsm_train[sample_idx]['question'])
print("Answer:", mgsm_train[sample_idx]['answer'])
print("Answer Number:", mgsm_train[sample_idx]['answer_number'])
print("Equation_Solution:", mgsm_train[sample_idx]['equation_solution'])

Sample Problem:
Question: Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer: Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Answer Number: 11
Equation_Solution: 5 + 6 = 11.


### 1.4 Load evaluation data from Meta

The evaluation data is publicly released by Meta and is available here: https://huggingface.co/datasets/meta-llama/Llama-3.2-1B-Instruct-evals

We find here the following:
- evaluation config
- user prompts added in every language
- parsed answers

In [5]:
# Load token from config.json
with open("config.json") as f:
    config = json.load(f)

hf_token = config["hf_token"]
login(hf_token)
eval_dataset_FROM_META = load_dataset("meta-llama/Llama-3.2-1B-Instruct-evals", "Llama-3.2-1B-Instruct-evals__mgsm__details")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /data/students/ryan/.cache/huggingface/token
Login successful


In [6]:
eval_dataset_FROM_META_df = eval_dataset_FROM_META['latest'].to_pandas()
eval_dataset_FROM_META_df.head()

Unnamed: 0,task_type,task_name,subtask_name,input_question,input_choice_list,input_final_prompts,input_correct_responses,output_prediction_text,output_parsed_answer,output_choice_completions,output_choice_negative_log_likelihoods,output_metrics,is_correct,input_question_hash,input_final_prompts_hash,benchmark_label,eval_config
0,Generative,mgsm_chat,te,ఆండ్రూ న్యూజెర్సీ నుంచి రోచెస్టర్‌కు ఒక రోడ్డు...,,[<|start_header_id|>user<|end_header_id|>\n\nఈ...,"[9, 9., 9.0, 9.0., 9.00, 9.00., 9, 9., 9.0, 9....",[రోచెస్టర్ నుంచి బస్సు ద్వారా ప్రయాణించడానికి ...,4.5,,,"{'em': 0.0, 'em_maj1@1': 0.0, 'f1': 0.0, 'f1_m...",False,cd94ca91efc39a41e4c4c4bc0c05402edf876bb9e3b947...,[c08c78b2914e6099ee463c3e4f593bf5017c36c1ed754...,MGSM,"{'max_gen_len': '2048', 'max_prompt_len': '614..."
1,Generative,mgsm_chat,bn,কার্লের প্রিয় খাবার হল চিজ্‌। তিনি এই সপ্তাহে ...,,[<|start_header_id|>user<|end_header_id|>\n\nএ...,"[31, 31., 31.0, 31.0., 31.00, 31.00., 31, 31.,...",[সপ্তাহে তিনি 2*7=<<2*7=14>>14টি স্যান্ডউইচ খে...,25.0,,,"{'em': 0.0, 'em_maj1@1': 0.0, 'f1': 0.0, 'f1_m...",False,6704115ad46deece36245043307151bd6fd83e02930329...,[1aec05a3e1d4a893e3b0d35624fb89159fc1c32561ec8...,MGSM,"{'max_gen_len': '2048', 'max_prompt_len': '614..."
2,Generative,mgsm_chat,bn,জ্যানেটের কাছে 22টি সবুজ কলম ও 10টি হলুদ কলম আ...,,[<|start_header_id|>user<|end_header_id|>\n\nএ...,"[98, 98., 98.0, 98.0., 98.00, 98.00., 98, 98.,...",[জ্যানেটের কাছে সবুজ কলমের মোট সংখ্যা হল 22টি ...,360.0,,,"{'em': 0.0, 'em_maj1@1': 0.0, 'f1': 0.0, 'f1_m...",False,c034cb29e28be89863d8c81696515eb46614beb50893ae...,[b5dc4fc9106d844acf8f278cb90244e5f226aa9043a2a...,MGSM,"{'max_gen_len': '2048', 'max_prompt_len': '614..."
3,Generative,mgsm_chat,bn,ব্রিনলি মিঃ বার্টের অঙ্কের ক্লাসে আছে। প্রত্যে...,,[<|start_header_id|>user<|end_header_id|>\n\nএ...,"[98, 98., 98.0, 98.0., 98.00, 98.00., 98, 98.,...",[প্রথম পাঁচটি পরীক্ষার স্কোর যোগ করলে আমরা পাই...,353.0,,,"{'em': 0.0, 'em_maj1@1': 0.0, 'f1': 0.0, 'f1_m...",False,0c7da335161864651f040a8c291e0e78071a221800dcd4...,[42d35a0e59b2dce67fa4f109447b0752cd28c92d5e570...,MGSM,"{'max_gen_len': '2048', 'max_prompt_len': '614..."
4,Generative,mgsm_chat,bn,মাইকেল বাইক চালাতে ভালোবাসেন। তিনি এটি এক সপ্ত...,,[<|start_header_id|>user<|end_header_id|>\n\nএ...,"[860, 860., 860.0, 860.0., 860.00, 860.00., 86...","[মাইকেল 5 বার বাইক চালান, তাই তিনি 5*25=<<5*25...",860.0,,,"{'em': 1.0, 'em_maj1@1': 1.0, 'f1': 1.0, 'f1_m...",True,f11366127c3e5d1b93ddb37334135fca18618c47e5b47e...,[0540c4c88eef5b44aaa61d9152e58f02465331eca313c...,MGSM,"{'max_gen_len': '2048', 'max_prompt_len': '614..."


In [188]:
# check META's eval_config
eval_dataset_FROM_META_df['eval_config'][0] 

{'max_gen_len': '2048',
 'max_prompt_len': '6144',
 'num_few_shot': '0',
 'num_generations': '1',
 'prompt_fn': "functools.partial(<function jinja_dialog_format at 0x7f0d7e4c0d30>, template={'prompt': '{{ native_prompt }}\\n\\n{{ input }}', 'answer': '{{ target }}'}, append_gen_prefix=False)",
 'return_logprobs': 'false',
 'seed': '42',
 'temperature': '0.0',
 'top_k': '0',
 'top_p': '0'}

In [7]:
eval_config = {
    'max_gen_len': 2048,
    'max_prompt_len': 6144,
    'num_few_shot': 0,
    'num_generations': 1,
    # 'prompt_fn': functools.partial(jinja_dialog_format, template={'prompt': '{{ native_prompt }}\\n\\n{{ input }}', 'answer': '{{ target }}'}, append_gen_prefix=False),
    'return_logprobs': False,
    'seed': 42,
    'temperature': 0.0,
    'top_k': 0,
    'top_p': 0.0
}

In [8]:
# set seed for reproducibility
torch.manual_seed(eval_config['seed'])
random.seed(eval_config['seed'])

In [44]:
eval_dataset_FROM_META_df["subtask_name"].value_counts()

subtask_name
te    250
bn    250
de    250
th    250
en    250
zh    250
fr    250
ja    250
ru    250
es    250
sw    250
Name: count, dtype: int64

### 1.5 Identify user prompts per language

We add the user prompts used by Meta.

In [93]:
# check input_final_prompt for subtask_name = en
filter = eval_dataset_FROM_META_df["subtask_name"] == 'en'
input_prompt = eval_dataset_FROM_META_df[filter]['input_final_prompts'].iloc[0]
input_prompt = r"{}".format(input_prompt).replace("\\n", "\n") # convert prompt to raw string

print(input_prompt) 

['<|start_header_id|>user<|end_header_id|>

Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".

Tracy used a piece of wire 4 feet long to support tomato plants in the garden. The wire was cut into pieces 6 inches long. How many pieces did she obtain?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

']


In [9]:
user_header = "<|start_header_id|>user<|end_header_id|>"
assistant_header = "<|start_header_id|>assistant<|end_header_id|>"

user_prompts = {
    "en": 'Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".',
    "te": 'ఈ గణిత సమస్యను పరిష్కరించండి. చివరి సమాధానాన్ని ఇవ్వదానికి ముందు తర్కాత్మక అదుగులను ఇవ్వండి. చివరి పంక్తిలో మాత్రమే "సమాధానం:" అనే ఆకారంలో చివరి సమాధానాద్ని ఇవ్వండి సమాధానం: తర్వాత పూర్ణాంక సమాధానానికి తప్పించి ఎదేనా చేర్చవద్దు.',
    "bn": 'এই গণিতের সমস্যাটি সমাধান করুন। চূড়ান্ত উত্তর দেওয়ার আগে যুক্তিসম্পন্ন পদক্ষেপ প্রদান করুন। চূড়ান্ত উত্তরটি একক সংখ্যা হিসাবে "উত্তর:" এর পরে শেষ লাইনে দিন। "উত্তর:" এর পরে অন্য কিছু যুক্ত করবেন না।.',
    "de": 'Löse dieses Mathematikproblem. Gib die Schritte zur Begründung an, bevor du die endgültige Antwort in der letzten Zeile alleine im Format "Antwort:" gibst. Füge nichts anderes als die ganzzahlige Antwort nach "Antwort:" hinzu.',
    "th": 'แก้ปัญหาคณิตศาสตร์นี้ ให้ให้ขั้นตอนการใช้เหตุผลก่อนที่จะให้คำตอบสุดท้ายในบรรทัดสุดท้ายโดยอยู่ในรูปแบบ "คำตอบ:" ไม่ควรเพิ่มอะไรนอกจากคำตอบที่เป็นจำนวนเต็มหลังจาก "คำตอบ:"',
    "zh": '解决这个数学问题。在最后一行给出答案前，请提供推理步骤。最后一行应该以 "答案: " 的形式独立给出答案。在 "答案：" 后不要添加除整数答案之外的任何内容。',
    "fr": 'Résolvez ce problème de mathématiques. Donnez les étapes de raisonnement avant de fournir la réponse finale sur la dernière ligne elle-même dans le format de "Réponse:". N\'ajoutez rien d\'autre que la réponse entière après "Réponse:".',
    "ja": 'の数学の問題を解いてください。最終的な答えを出す前に、解答の推論過程を記述してください。そして最後の行には "答え:" の形式で答えを記述し、その後には整数の答え以外何も追加しないでください。',
    "ru": 'Решите эту математическую задачу. Объясните шаги рассуждения перед тем, как дать окончательный ответ в последней строке сам по себе в формате "Ответ:". Не добавляйте ничего, кроме целочисленного ответа после "Ответ:".',
    "es": 'Resuelve este problema matemático. Proporciona los pasos de razonamiento antes de dar la respuesta final en la última línea por sí misma en el formato de "Respuesta:". No añadas nada más que la respuesta entera después de "Respuesta:".',
    "sw": 'Suluhisha tatizo hili la hesabu. Toa hatua za mantiki kabla ya kutoa jibu la mwisho kwenye mstari wa mwisho peke yake katika muundo wa "Jibu:". Usiongeze chochote kingine isipokuwa jibu la integer baada ya "Jibu:".'    
}

answer_translations = {
    "en": "Answer:",
    "te": "సమాధానం:",
    "bn": "উত্তর:",
    "de": "Antwort:",
    "th": "คำตอบ:",
    "zh": "答案: ",
    "fr": "Réponse:",
    "ja": "答え:",
    "ru": "Ответ:",
    "es": "Respuesta:",
    "sw": "Jibu:"
}

# sample prompt for an English question
sample_idx = 0
question = mgsm_test[sample_idx]['question']
language = mgsm_test[sample_idx]['language']
sample_prompt = user_header + "\n\n" + user_prompts[language] + "\n\n" + question + "<|eot_id|>" + assistant_header + "\n\n"
print(sample_prompt)

<|start_header_id|>user<|end_header_id|>

Solve this math problem. Give the reasoning steps before giving the final answer on the last line by itself in the format of "Answer:". Do not add anything other than the integer answer after "Answer:".

Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [10]:
# defined in the cell above: user_header, assistant_header, user_prompts
def add_input_prompt(question: str, language: str) -> str:
    """"
    question_features contains: 
        question	
        answer	
        answer_number	
        equation_solution	
        language
    """

    return user_header + "\n\n" + user_prompts[language] + "\n\n" + question + "<|eot_id|>" + assistant_header + "\n\n"

## 2. Model Inference

### 2.1 Create parser for the model's final answer

In [11]:
# Adapted from Gemma's GSM8k Evaluation Code: https://github.com/google-deepmind/gemma/blob/main/colabs/gsm8k_eval.ipynb

def find_numbers(x: str) -> list[str]:
  """Finds all numbers in a string."""
  # Search for number, possibly negative (hyphen), with thousand separators
  # (comma), and with a decimal point (period inbetween digits).
  numbers = re.compile(
      r'-?[\d,]*\.?\d+',
      re.MULTILINE | re.DOTALL | re.IGNORECASE,
  ).findall(x)
  return numbers


def find_number(x: str,
                answer_delimiter: str = 'Answer:') -> str:
  """Finds the most relevant number in a string."""
  # If model uses the answer delimiter, then select the first number following
  # that format.
  if answer_delimiter in x:
    answer = x.split(answer_delimiter)[-1]
    numbers = find_numbers(answer)
    # print("Found delimiter!")
    # print(numbers)

    if numbers:
      return numbers[0]

  # In general, select the last number in the string.
  numbers = find_numbers(x)
  # print(numbers)

  if numbers:
    return numbers[-1]
  return ''


def maybe_remove_comma(x: str) -> str:
  # Example: 5,600 -> 5600
  return x.replace(',', '')
     

### 2.2 Generate model outputs

In [12]:
generate_config = {
    'max_new_tokens': eval_config['max_gen_len'],
    'num_return_sequences': eval_config['num_generations'],
    'output_scores': eval_config['return_logprobs'],
    'pad_token_id': tokenizer.eos_token_id,
    'do_sample': False,  # since temperature, top_k, top_p are 0
    'top_p': None,
    'temperature': None,
    'top_k': None
}

In [219]:
# Generate a sample inference
sample_idx = 2

input_prompt = add_input_prompt(mgsm_test[sample_idx]['question'], mgsm_test[sample_idx]['language'])
input_tokenized = tokenizer(input_prompt, return_tensors="pt", padding=True, truncation=False).to(device)

# Set the model to evaluation mode
model.eval()

output = model.generate(**input_tokenized, 
                        **generate_config
                        )
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

# splice output_text after "assistant"
assistant_idx = output_text.find("assistant\n\n")
output_text = output_text[assistant_idx + len("assistant\n\n"):]

print(output_text)

To find the profit, first calculate the increased value of the house after repairs:

$80,000 + ($50,000 x 1.5) = $80,000 + $75,000 = $155,000

Then, subtract the original purchase price from the increased value to find the profit:

$155,000 - $80,000 = $75,000

Answer: $75,000


In [220]:
answer_delimiter = answer_translations[mgsm_test[sample_idx]['language']]
output_numerical_answer = maybe_remove_comma(find_number(output_text, answer_delimiter))

print("Numerical Answer:", output_numerical_answer)

Numerical Answer: 75000


In [310]:
# Set the model to evaluation mode
model.eval()

# Parameters for batch processing
loader_config = {'batch_size': 25,
                 'num_workers': 4, 
                 'prefetch_factor': 2,
                 'shuffle': False}

# Create a DataLoader for the dataset
test_data_loader = DataLoader(mgsm_test, **loader_config)

# Initializa a dictionary to store the results
results_mgsm = {'question': mgsm_test['question'], 
                'language': mgsm_test['language'], 
                'ground_truth': mgsm_test['answer_number'], 
                'generated_answer': []}

# Iterate over the DataLoader
for batch_samples in tqdm(test_data_loader, total=len(test_data_loader)):

    # Add input prompt to the batch problems
    batch_problems = [add_input_prompt(question, language) for question, language in zip(batch_samples['question'], batch_samples['language'])]
    
    # Tokenize the input problems
    inputs = tokenizer(batch_problems, return_tensors="pt", padding=True, truncation=True).to(device)

    # Generate outputs from the model for the entire batch
    with torch.no_grad():
        outputs = model.generate(**inputs, **generate_config)

    # Get numerical answer from output texts
    output_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    numerical_answers = [maybe_remove_comma(find_number(output_text, answer_translations[language])) for output_text, language in zip(output_texts, batch_samples['language'])]

    # Add the numerical answers to the results dictionary
    results_mgsm['generated_answer'].extend(numerical_answers)

# Create a directory to store the results
output_dir = "outputs"
results_var_name = "results_mgsm"
os.makedirs(output_dir, exist_ok=True)

# Save the results to a file
results_file = os.path.join(output_dir, f"{results_var_name}.json")
with open(results_file, "w") as f:
    json.dump(globals()[results_var_name], f)

100%|██████████| 110/110 [2:28:59<00:00, 81.27s/it] 


### 2.3 Evaluate score

We calculate the average exact match (maj@1) scores over all eleven languages in the MGSM dataset. \
Reference: https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/eval_details.md

In [314]:
# Check null values in generated answers
null_values = results_mgsm['generated_answer'].count('')
print(f"Number of null values in generated answers: {null_values}")

Number of null values in generated answers: 2


In [330]:
output_dir = "outputs"
results_var_name = "results_mgsm"

# Load the results from the json file and convert to pandas DataFrame
results_file = os.path.join(output_dir, f"{results_var_name}.json")
results_df = pd.read_json(results_file)

# Convert ground_truth and generated_answer columns to float
results_df['ground_truth'] = results_df['ground_truth'].astype(float)
# results_df['generated_answer'] = results_df['generated_answer'].astype(float)
results_df['generated_answer'] = pd.to_numeric(results_df['generated_answer'], errors='coerce')
    # coerce will convert the invalid parsing to NaN


# Compare the generated answers with the ground truth answers
results_df['correct'] = results_df['ground_truth'] == results_df['generated_answer']

# Get accuracy per language then get average accuracy
accuracy_per_language = results_df.groupby('language')['correct'].mean()
overall_accuracy = accuracy_per_language.mean()

# Get Meta's accuracy
meta_accuracy = eval_dataset_FROM_META_df.groupby('subtask_name')['is_correct'].mean()
meta_accuracy_overall = meta_accuracy.mean()
print("Meta's Overall Accuracy:", meta_accuracy_overall)
print("Personal Overall Accuracy:", overall_accuracy)


# Check if the language is supported (see https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/README.md?code=true)
lang_supported = ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th']
is_supported = [lang in lang_supported for lang in accuracy_per_language.index]

# Compare Meta and personal evaluation results
eval_scores_df = pd.DataFrame({
    'Language': meta_accuracy.index,
    'Meta Score': meta_accuracy.values,
    'Personal Score': accuracy_per_language.values,
    'Language Supported': is_supported
})

print('')
print(eval_scores_df)

Meta's Overall Accuracy: 0.24472727272727274
Personal Overall Accuracy: 0.18763636363636363


   Language  Meta Score  Personal Score  Language Supported
0        bn       0.144           0.144               False
1        de       0.272           0.240                True
2        en       0.420           0.408                True
3        es       0.352           0.248                True
4        fr       0.264           0.212                True
5        ja       0.180           0.148               False
6        ru       0.272           0.244               False
7        sw       0.184           0.020               False
8        te       0.112           0.016               False
9        th       0.228           0.204                True
10       zh       0.264           0.180               False


## 3. EXTENSION: Fine-tune model on MGSM training data

https://github.com/huggingface/huggingface-llama-recipes/blob/main/fine_tune/peft_finetuning.py

### 3.1 Preprocess train dataset

In [13]:
def combine_fields(question_set):
    language = question_set['language']
    combined_text = f"{question_set['question']}\n\n{question_set['answer']}\n\n{question_set['equation_solution']}\n\n{answer_translations[language]} {question_set['answer_number']}"
    question_set['text'] = combined_text
    return question_set

# Apply the function to the dataset
mgsm_train_combined = mgsm_train.map(combine_fields)

In [376]:
print(mgsm_train_combined['text'][5])

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?

Step-by-Step Answer: 5 bagels for $3 each should cost 5 * 3 = 15 dollars. Olivia had $23 in the beginning, so now she has 23 - 15 = 8 dollars left. The answer is 8.

5 * 3 = 15. 23 - 15 = 8.

Answer: 8


### 3.2 Fine-tune the model on the training set

In [14]:
# %pip install peft trl bitsandbytes

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import BitsAndBytesConfig, EarlyStoppingCallback

# UNCOMMENT if need to unload memory
# torch.cuda.empty_cache()
# del trainer

QLoRA = True
if QLoRA:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4"
    )
    
    lora_config = LoraConfig(
        r=8,
        target_modules="all-linear",
        bias="none",
        task_type="CAUSAL_LM",
    )
else:
    lora_config = None


sft_config = SFTConfig(
    output_dir="./finetune_results",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    logging_dir='./finetune_logs',
    logging_steps=25,
    # load_best_model_at_end = True,
    # evaluation_strategy = "steps",
    dataset_text_field="text",  # Set the dataset text field here
    max_seq_length=2048  # Set the max sequence length
)

# tokenizer.padding_side = "right"

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_config,
    peft_config=lora_config,
    train_dataset=mgsm_train_combined,
    # callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mcayasryan[0m ([33mcayasryan-university-of-the-philippines-diliman[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
25,1.3041
50,1.0155
75,0.9129
100,0.8346


TrainOutput(global_step=110, training_loss=1.0022768627513539, metrics={'train_runtime': 144.5373, 'train_samples_per_second': 6.088, 'train_steps_per_second': 0.761, 'total_flos': 2485041893277696.0, 'train_loss': 1.0022768627513539, 'epoch': 10.0})

In [15]:
# Load fine tuned model
# del model # delete base model to free up GPU space
model_dir = "./finetune_results/checkpoint-110"
model_finetuned = LlamaForCausalLM.from_pretrained(model_dir)
model_finetuned.to(device)

# Parameters for batch processing
loader_config = {'batch_size': 25,
                 'num_workers': 4, 
                 'prefetch_factor': 2,
                 'shuffle': False}

# Create a DataLoader for the dataset
test_data_loader = DataLoader(mgsm_test, **loader_config)

# Initializa a dictionary to store the results
results_mgsm_finetuned_2 = {'question': mgsm_test['question'], 
                'language': mgsm_test['language'], 
                'ground_truth': mgsm_test['answer_number'], 
                'generated_answer': []}

# Disable parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Iterate over the DataLoader
for batch_samples in tqdm(test_data_loader, total=len(test_data_loader)):

    # Add input prompt to the batch problems
    batch_problems = [add_input_prompt(question, language) for question, language in zip(batch_samples['question'], batch_samples['language'])]
    
    # Tokenize the input problems
    inputs = tokenizer(batch_problems, return_tensors="pt", padding=True, truncation=True).to(device)

    # Generate outputs from the model for the entire batch
    with torch.no_grad():
        outputs = model_finetuned.generate(**inputs, **generate_config)

    # Get numerical answer from output texts
    output_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    numerical_answers = [maybe_remove_comma(find_number(output_text, answer_translations[language])) for output_text, language in zip(output_texts, batch_samples['language'])]

    # Add the numerical answers to the results dictionary
    results_mgsm_finetuned_2['generated_answer'].extend(numerical_answers)

# Create a directory to store the results
output_dir = "outputs"
results_var_name = "results_mgsm_finetuned_2"
os.makedirs(output_dir, exist_ok=True)

# Save the results to a file
results_file = os.path.join(output_dir, f"{results_var_name}.json")
with open(results_file, "w") as f:
    json.dump(globals()[results_var_name], f)

  0%|          | 0/110 [00:00<?, ?it/s]Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
100%|██████████| 110/110 [2:51:32<00:00, 93.57s/it]  


### 3.3 Evaluate updated score

In [16]:
output_dir = "outputs"
results_var_name = "results_mgsm_finetuned_2"

# Load the results from the json file and convert to pandas DataFrame
results_file = os.path.join(output_dir, f"{results_var_name}.json")
results_df = pd.read_json(results_file)

# Convert ground_truth and generated_answer columns to float
results_df['ground_truth'] = results_df['ground_truth'].astype(float)
# results_df['generated_answer'] = results_df['generated_answer'].astype(float)
results_df['generated_answer'] = pd.to_numeric(results_df['generated_answer'], errors='coerce')
    # coerce will convert the invalid parsing to NaN


# Compare the generated answers with the ground truth answers
results_df['correct'] = results_df['ground_truth'] == results_df['generated_answer']

# Get accuracy per language then get average accuracy
accuracy_per_language = results_df.groupby('language')['correct'].mean()
overall_accuracy = accuracy_per_language.mean()

# Get Meta's accuracy
meta_accuracy = eval_dataset_FROM_META_df.groupby('subtask_name')['is_correct'].mean()
meta_accuracy_overall = meta_accuracy.mean()
print("Meta's Overall Accuracy:", meta_accuracy_overall)
print("Personal Overall Accuracy:", overall_accuracy)


# Check if the language is supported (see https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/README.md?code=true)
lang_supported = ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th']
is_supported = [lang in lang_supported for lang in accuracy_per_language.index]

# Compare Meta and personal evaluation results
eval_scores_df = pd.DataFrame({
    'Language': meta_accuracy.index,
    'Meta Score': meta_accuracy.values,
    'Personal Score': accuracy_per_language.values,
    'Language Supported': is_supported
})

print('')
print(eval_scores_df)

Meta's Overall Accuracy: 0.24472727272727274
Personal Overall Accuracy: 0.1941818181818182

   Language  Meta Score  Personal Score  Language Supported
0        bn       0.144           0.156               False
1        de       0.272           0.232                True
2        en       0.420           0.340                True
3        es       0.352           0.168                True
4        fr       0.264           0.248                True
5        ja       0.180           0.160               False
6        ru       0.272           0.244               False
7        sw       0.184           0.120               False
8        te       0.112           0.048               False
9        th       0.228           0.188                True
10       zh       0.264           0.232               False


In [17]:
# Delete the fine-tuned model weights
model_dir = "./finetune_results"
if os.path.exists(model_dir):
    import shutil
    shutil.rmtree(model_dir)
    print(f"Deleted the fine-tuned model weights in {model_dir}")

Deleted the fine-tuned model weights in ./finetune_results
