In this project, we are going to fine tune the Llama 3 8b model to increase it's reasoning capability on complex maths problem. We will use MetaMathQA-40k dataset to finetune model.

In [None]:
!pip install unsloth # we will use unsloth, because of it's optimizations, also it is better to buy 100 compute units from colab, because the dataset used has 40k problems, which will be very heavy for free version

Collecting unsloth
  Downloading unsloth-2025.11.3-py3-none-any.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.8/61.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.11.4 (from unsloth)
  Downloading unsloth_zoo-2025.11.4-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.35-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.33.post1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting bitsandbytes!=0.46.0,!=0.48.0,>=0.45.5 (from unsloth)
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting datasets!=4.0.*,!=4.1.0,<4.4.0,>=3.4.1 (from unsloth)
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting trl!=0.19.0,<=0.23.0,>=0.18.2 (from 

In [None]:
# You need to have a hugging face account and approval to use Llama models before proceeding
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
max_seq_length = 2048
load_in_4bit = True # 4-bit QLoRA quantization
dtype = None # Setting it to none, it let the gpu figure out data type will be the best

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("The model is loaded successfully")

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

The model is loaded successfully


Now, we will load the dataset and format it

In [None]:
# Loading the MetaMathQA-40K datset

from datasets import load_dataset
dataset = load_dataset("meta-math/MetaMathQA-40K", split = "train")
print(len(dataset))
print(dataset[0])
print(dataset.column_names)

README.md:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

MetaMathQA-40K.json:   0%|          | 0.00/31.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40000 [00:00<?, ? examples/s]

40000
{'query': "Reggie's father gave him $48. Reggie bought 5 books, each of which cost x. Reggie has 38 money left. What is the value of unknown variable x?", 'response': "To solve this problem, we need to determine the value of x, which represents the cost of each book that Reggie bought.\nLet's break down the information given:\nAmount of money Reggie's father gave him: $48\nNumber of books Reggie bought: 5\nAmount of money Reggie has left: $38\nWe can set up the equation as follows:\nAmount of money Reggie's father gave him - (Number of books Reggie bought * Cost per book) = Amount of money Reggie has left\n$48 - (5 * x) = $38\nLet's simplify and solve for x:\n$48 - 5x = $38\nTo isolate x, we subtract $48 from both sides of the equation:\n$48 - $48 - 5x = $38 - $48\n-5x = -$10\nTo solve for x, we divide both sides of the equation by -5:\nx = -$10 / -5\nx = $2\nThe value of x is $2.\n#### 2\nThe answer is: 2", 'type': 'GSM_SV'}
['query', 'response', 'type']


In [None]:
math_prompt = """Act like an expert mathematician. Your task is to solve the following math problem.
Provide a step-by-step reasoning process before arriving at the final answer.

### Problem:
{}

### Answer:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format_prompts(x):
    instructions = x["query"]
    responses = x["response"]
    texts = []
    for instruction, response in zip(instructions, responses):
        text = math_prompt.format(instruction, response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

dataset = dataset.map(format_prompts, batched = True)

print(len(dataset))
print(dataset[0]['text'])

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

40000
Act like an expert mathematician. Your task is to solve the following math problem.
Provide a step-by-step reasoning process before arriving at the final answer.

### Problem:
Reggie's father gave him $48. Reggie bought 5 books, each of which cost x. Reggie has 38 money left. What is the value of unknown variable x?

### Answer:
To solve this problem, we need to determine the value of x, which represents the cost of each book that Reggie bought.
Let's break down the information given:
Amount of money Reggie's father gave him: $48
Number of books Reggie bought: 5
Amount of money Reggie has left: $38
We can set up the equation as follows:
Amount of money Reggie's father gave him - (Number of books Reggie bought * Cost per book) = Amount of money Reggie has left
$48 - (5 * x) = $38
Let's simplify and solve for x:
$48 - 5x = $38
To isolate x, we subtract $48 from both sides of the equation:
$48 - $48 - 5x = $38 - $48
-5x = -$10
To solve for x, we divide both sides of the equation by 

Preparing the model

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16, # It will assign more weight to LoRA activations
    # we choose 0 and none because of optimizations
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.11.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 1,
        warmup_steps = 100,
        learning_rate = 2e-4,
        lr_scheduler_type = "cosine",

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        optim = "adamw_8bit",

        weight_decay = 0.01,
        max_grad_norm = 1.0,
        logging_steps = 50,
        seed = 3407,
        output_dir = "math_llama3_8b_final",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/40000 [00:00<?, ? examples/s]

In [None]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 40,000 | Num Epochs = 1 | Total steps = 5,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
50,0.9455
100,0.5447
150,0.5143
200,0.5029
250,0.5004
300,0.4868
350,0.4815
400,0.4784
450,0.4775
500,0.4812


Step,Training Loss
50,0.9455
100,0.5447
150,0.5143
200,0.5029
250,0.5004
300,0.4868
350,0.4815
400,0.4784
450,0.4775
500,0.4812


TrainOutput(global_step=5000, training_loss=0.4114391178131104, metrics={'train_runtime': 19240.4881, 'train_samples_per_second': 2.079, 'train_steps_per_second': 0.26, 'total_flos': 5.683639051317412e+17, 'train_loss': 0.4114391178131104, 'epoch': 1.0})

TrainOutput(global_step=5000, training_loss=0.4114391178131104, metrics={'train_runtime': 19240.4881, 'train_samples_per_second': 2.079, 'train_steps_per_second': 0.26, 'total_flos': 5.683639051317412e+17, 'train_loss': 0.4114391178131104, 'epoch': 1.0})

To download the model locally




In [None]:
model.save_pretrained("math_llama3_8b_adapters")
tokenizer.save_pretrained("math_llama3_8b_adapters")

('math_llama3_8b_adapters/tokenizer_config.json',
 'math_llama3_8b_adapters/special_tokens_map.json',
 'math_llama3_8b_adapters/chat_template.jinja',
 'math_llama3_8b_adapters/tokenizer.json')

In [None]:
!zip -r math_llama3_8b_adapters.zip math_llama3_8b_adapters

  adding: math_llama3_8b_adapters/ (stored 0%)
  adding: math_llama3_8b_adapters/README.md (deflated 65%)
  adding: math_llama3_8b_adapters/tokenizer.json (deflated 85%)
  adding: math_llama3_8b_adapters/adapter_model.safetensors (deflated 7%)
  adding: math_llama3_8b_adapters/tokenizer_config.json (deflated 96%)
  adding: math_llama3_8b_adapters/special_tokens_map.json (deflated 70%)
  adding: math_llama3_8b_adapters/adapter_config.json (deflated 57%)
  adding: math_llama3_8b_adapters/chat_template.jinja (deflated 52%)


In [None]:
from google.colab import files
files.download('math_llama3_8b_adapters.zip')

To upload the model, first upload the zip file on colab

In [None]:
!unzip -o math_llama3_8b_adapters.zip -d math_llama3_8b_adapters_unzipped

Archive:  math_llama3_8b_adapters.zip
   creating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/README.md  
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/tokenizer.json  
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/adapter_model.safetensors  
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/tokenizer_config.json  
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/special_tokens_map.json  
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/adapter_config.json  
  inflating: math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters/chat_template.jinja  


In [None]:
!mv math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters ./math_llama3_8b_adapters_ready

Now, let us try running benchmarks


In [None]:
!pip install lm-eval==0.4.2
!pip install antlr4-python3-runtime==4.11

#here is the list of benchmarks available on lm-eval
!lm-eval --tasks list

Collecting lm-eval==0.4.2
  Downloading lm_eval-0.4.2-py3-none-any.whl.metadata (30 kB)
Collecting evaluate (from lm-eval==0.4.2)
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting jsonlines (from lm-eval==0.4.2)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pybind11>=2.6.2 (from lm-eval==0.4.2)
  Downloading pybind11-3.0.1-py3-none-any.whl.metadata (10.0 kB)
Collecting pytablewriter (from lm-eval==0.4.2)
  Downloading pytablewriter-1.2.1-py3-none-any.whl.metadata (38 kB)
Collecting rouge-score>=0.0.4 (from lm-eval==0.4.2)
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu>=1.5.0 (from lm-eval==0.4.2)
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting sqlitedict (from lm-eval==0.4.2)
  Downloading sqlitedict-2.1.0.tar.gz (

First we will run benchmarks on base llama model

In [None]:
# to run benchmarks, you will need a read only type token from hugging face, first get that token and again login using that
!lm_eval --model hf \
    --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct",load_in_4bit=True,trust_remote_code=True \
    --tasks gsm8k,minerva_math_algebra,minerva_math_geometry,minerva_math_prealgebra,asdiv \
    --batch_size 1 \
    --limit 100 \
    --output_path ./base_model_results.json

print("benchmarks completed")

2025-11-16 12:55:37.937763: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-16 12:55:37.959214: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763297737.986568   25021 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763297737.997160   25021 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763297738.023194   25021 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [None]:
'''
!lm_eval --model hf \
    --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct",peft="/content/math_llama3_8b_adapters_ready",load_in_4bit=True,trust_remote_code=True \
    --tasks gsm8k,minerva_math_algebra,minerva_math_geometry,minerva_math_prealgebra,asdiv \
    --batch_size 1 \
    --limit 100 \
    --output_path ./tuned_model_results.json

print("benchmarks completed")
'''

# trying to run benchmarks for finetuned model this way gives errors related to peft, which is most probably due to disagreement between unsloth and lm-eval

2025-11-16 18:02:54.279683: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-16 18:02:54.298443: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763316174.321418   12261 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763316174.328916   12261 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763316174.348220   12261 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

We will merge the fine tuned model with the base model, which will add the LoRA weights to the base model, this method will increase the size of the model, but we can bypass the peft error by this

In [None]:
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

model = PeftModel.from_pretrained(base_model, "/content/math_llama3_8b_adapters_ready")

merged_model = model.merge_and_unload()

merged_model.save_pretrained("finetuned_math_model")
tokenizer.save_pretrained("finetuned_math_model")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]



('finetuned_math_model/tokenizer_config.json',
 'finetuned_math_model/special_tokens_map.json',
 'finetuned_math_model/chat_template.jinja',
 'finetuned_math_model/tokenizer.json')

In [None]:
!lm_eval --model hf \
    --model_args pretrained="/content/finetuned_math_model",load_in_4bit=True,trust_remote_code=True \
    --tasks gsm8k,minerva_math_algebra,minerva_math_geometry,minerva_math_prealgebra,asdiv \
    --batch_size 1 \
    --limit 100 \
    --output_path ./tuned_model_results.json

print("benchmarks completed")

2025-11-16 18:36:50.140602: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763318210.163532   21432 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763318210.171094   21432 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763318210.190468   21432 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763318210.190500   21432 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763318210.190504   21432 computation_placer.cc:177] computation placer alr

Here are the results of fine tuning

| Benchmark              | Base Model | Tuned Model |
|------------------------|------------|-------------|
| minerva_math_prealgebra | 0.39      | 0.49        |
| minerva_math_geometry   | 0.11      | 0.16        |
| minerva_math_algebra    | 0.32      | 0.35        |
|gsm8k                    | 0.73      | 0.71        |
| asdiv                   | 0.06      | 0.01        |


The fine tuned model performs better than the base model on the minerva_math_prealgebra, minerva_math_geometry and minerva_math_algebra benchmarks, which shows that due to fine tuning the model's performance improved on tasks related to algebra and geometry.

However, the model's performance dropped on gsm8k and asdiv benchmarks. For gsm8k benchmark, due to fine tuning our model became slightly worse, this can be explained by the fact that gsm8k consists mostly of simple maths problems, our fine tuning made the model a specialist on high level maths but decreased the performance slightly on general maths.

In asdiv benchmark, the benchmark was run in 0-shot mode, so it is very likely that the drop is due to model not giving the answer in the required format.