In this project, we are going to fine tune the Llama 3 8b model to increase it's reasoning capability on complex maths problem. We will use MetaMathQA-40k dataset to finetune model.

In [None]:
!pip install unsloth # we will use unsloth, because of it's optimizations, also it is better to buy 100 compute units from colab, because the dataset used has 40k problems, which will be very heavy for free version

In [None]:
# You need to have a hugging face account and approval to use Llama models before proceeding
from huggingface_hub import notebook_login
notebook_login()

In [None]:
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

In [None]:
max_seq_length = 2048
load_in_4bit = True # 4-bit QLoRA quantization
dtype = None # Setting it to none, it let the gpu figure out data type will be the best

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("The model is loaded successfully")

Now, we will load the dataset and format it

In [None]:
# Loading the MetaMathQA-40K datset

from datasets import load_dataset
dataset = load_dataset("meta-math/MetaMathQA-40K", split = "train")
print(len(dataset))
print(dataset[0])
print(dataset.column_names)

In [None]:
math_prompt = """Act like an expert mathematician. Your task is to solve the following math problem.
Provide a step-by-step reasoning process before arriving at the final answer.

### Problem:
{}

### Answer:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format_prompts(x):
    instructions = x["query"]
    responses = x["response"]
    texts = []
    for instruction, response in zip(instructions, responses):
        text = math_prompt.format(instruction, response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

dataset = dataset.map(format_prompts, batched = True)

print(len(dataset))
print(dataset[0]['text'])

Preparing the model

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16, # It will assign more weight to LoRA activations
    # we choose 0 and none because of optimizations
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 1,
        warmup_steps = 100,
        learning_rate = 2e-4,
        lr_scheduler_type = "cosine",

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        optim = "adamw_8bit",

        weight_decay = 0.01,
        max_grad_norm = 1.0,
        logging_steps = 50,
        seed = 3407,
        output_dir = "math_llama3_8b_final",
        report_to = "none",
    ),
)

In [None]:
trainer.train()

To download the model locally




In [None]:
model.save_pretrained("math_llama3_8b_adapters")
tokenizer.save_pretrained("math_llama3_8b_adapters")

In [None]:
!zip -r math_llama3_8b_adapters.zip math_llama3_8b_adapters

In [None]:
from google.colab import files
files.download('math_llama3_8b_adapters.zip')

To upload the model, first upload the zip file on colab

In [None]:
!unzip -o math_llama3_8b_adapters.zip -d math_llama3_8b_adapters_unzipped

In [None]:
!mv math_llama3_8b_adapters_unzipped/math_llama3_8b_adapters ./math_llama3_8b_adapters_ready

Now, let us try running benchmarks


In [None]:
!pip install lm-eval==0.4.2
!pip install antlr4-python3-runtime==4.11

#here is the list of benchmarks available on lm-eval
!lm-eval --tasks list

First we will run benchmarks on base llama model

In [None]:
# to run benchmarks, you will need a read only type token from hugging face, first get that token and again login using that
!lm_eval --model hf \
    --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct",load_in_4bit=True,trust_remote_code=True \
    --tasks gsm8k,minerva_math_algebra,minerva_math_geometry,minerva_math_prealgebra,asdiv \
    --batch_size 1 \
    --limit 100 \
    --output_path ./base_model_results.json

print("benchmarks completed")

In [None]:
'''
!lm_eval --model hf \
    --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct",peft="/content/math_llama3_8b_adapters_ready",load_in_4bit=True,trust_remote_code=True \
    --tasks gsm8k,minerva_math_algebra,minerva_math_geometry,minerva_math_prealgebra,asdiv \
    --batch_size 1 \
    --limit 100 \
    --output_path ./tuned_model_results.json

print("benchmarks completed")
'''

# trying to run benchmarks for finetuned model this way gives errors related to peft, which is most probably due to disagreement between unsloth and lm-eval

We will merge the fine tuned model with the base model, which will add the LoRA weights to the base model, this method will increase the size of the model, but we can bypass the peft error by this

In [None]:
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

model = PeftModel.from_pretrained(base_model, "/content/math_llama3_8b_adapters_ready")

merged_model = model.merge_and_unload()

merged_model.save_pretrained("finetuned_math_model")
tokenizer.save_pretrained("finetuned_math_model")

In [None]:
!lm_eval --model hf \
    --model_args pretrained="/content/finetuned_math_model",load_in_4bit=True,trust_remote_code=True \
    --tasks gsm8k,minerva_math_algebra,minerva_math_geometry,minerva_math_prealgebra,asdiv \
    --batch_size 1 \
    --limit 100 \
    --output_path ./tuned_model_results.json

print("benchmarks completed")

Here are the results of fine tuning

| Benchmark              | Base Model | Tuned Model |
|------------------------|------------|-------------|
| minerva_math_prealgebra | 0.39      | 0.49        |
| minerva_math_geometry   | 0.11      | 0.16        |
| minerva_math_algebra    | 0.32      | 0.35        |
|gsm8k                    | 0.73      | 0.71        |
| asdiv                   | 0.06      | 0.01        |


The fine tuned model performs better than the base model on the minerva_math_prealgebra, minerva_math_geometry and minerva_math_algebra benchmarks, which shows that due to fine tuning the model's performance improved on tasks related to algebra and geometry.

However, the model's performance dropped on gsm8k and asdiv benchmarks. For gsm8k benchmark, due to fine tuning our model became slightly worse, this can be explained by the fact that gsm8k consists mostly of simple maths problems, our fine tuning made the model a specialist on high level maths but decreased the performance slightly on general maths.

In asdiv benchmark, the benchmark was run in 0-shot mode, so it is very likely that the drop is due to model not giving the answer in the required format.