# Experiment: Fine-Tuning of Language Models - Math Equations Simplified

Fine-tuning is a process used to specialize a pre-trained language model (LLM) for specific tasks or improve its performance on particular types of data. This involves continuing the training phase of the model using a smaller, task-specific dataset. The goal is to adjust the model's parameters so that it better aligns with the requirements and nuances of the desired tasks.

## Introduction to LLaMA Model

LLaMA (LLM for Advanced Multitask Applications) is a type of transformer-based language model designed for high performance across a wide range of language understanding tasks. Developed by Meta AI, LLaMA models are known for their scalability and effectiveness, making them suitable for both research and practical applications. Fine-tuning LLaMA models allows users to leverage their robust pre-trained capabilities and tailor them to specialized tasks or datasets.


In [None]:
# install dependencies

# we use the latest version of transformers, peft, and accelerate
!pip install -q accelerate peft transformers

# install bitsandbytes for quantization
!pip install -q bitsandbytes

# install trl for the SFT library
!pip install -q trl

# we need sentencepiece for the llama2 slow tokenizer
!pip install sentencepiece

# we need einops, used by falcon-7b, llama-2 etc
# einops (einsteinops) is used to simplify tensorops by making them readable
!pip install -q -U einops

# we need to install datasets for our training dataset
!pip install -q datasets

!pip install tensorboardx

!pip install pydetex

In [1]:
!pip install sentencepiece

Defaulting to user installation because normal site-packages is not writeable


## Settings
The following configures our settings for finetuning our model

In [1]:
# The model that you want to train from the Hugging Face hub
model_name = "meta-llama/Llama-2-7b-chat-hf"
#model_name = "RohitSahoo/llama-2-7b-chat-hf-math-ft-V2"

# Fine-tuned model name
new_model = "llama-2-7b-chat-hf-math-ft-V2"

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 10

In [2]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
    logging,
)
import time
import logging



# Understanding Quantization Configuration

Quantization is a technique used to reduce the precision of the numbers that represent a model's parameters, from floating-point representations to lower-bit representations. This is often done to reduce the model's memory footprint and speed up computation, making deployment more efficient on various platforms, especially those with limited resources like mobile devices or edge computing platforms.

## Key Components of Quantization

- **Bit Precision**: This refers to the number of bits used to represent each weight in the model. For example, 4-bit quantization reduces the model weights to only 4 bits per weight.
- **Quantization Type**: Different types of quantization strategies can be applied, such as uniform quantization where the range between the minimum and maximum value is split evenly, or non-uniform quantization which may focus on preserving more detail in more important parts of the range.
- **Compute Dtype**: The data type used for computations during model inference. Using `torch.float16` can be a balance between computational speed and maintaining sufficient precision.
- **Double Quantization**: Sometimes, weights are quantized twice with different precisions to further compress the model without a significant loss in accuracy.

In our configuration, we use 4-bit quantization with a noise-free type (`nf4`), which aims to minimize the loss of information during quantization. The computations are performed using 16-bit floating points to maintain a good balance between performance and accuracy. Double quantization is disabled to keep the model simpler and more straightforward in its quantized state.

This approach helps in deploying models more efficiently, reducing the computational cost, and allowing the model to be used in resource-constrained environments without a large trade-off in performance.


In [3]:
# Import the BitsAndBytes configuration settings for quantization
from bitsandbytes as bnb
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Define the quantization configuration for loading the model in 4-bit precision
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Enable loading of the model in 4-bit precision
    bnb_4bit_quant_type="nf4",             # Set the quantization type to 'nf4' (noise-free 4-bit)
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 as the data type for computations
    bnb_4bit_use_double_quant=False,       # Disable double quantization
)

# Load the pre-trained language model with specified quantization settings and map it to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"":0}                      # Automatically use GPU with ID 0
)

# Disable caching to manage GPU memory usage more effectively
model.config.use_cache = False

# Load the tokenizer associated with the model and configure it for processing
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    trust_remote_code=True,                # Trust the remote code in the tokenizer's configuration
    use_fast=False                         # Do not use the fast tokenizer implementation
)
tokenizer.pad_token = tokenizer.eos_token  # Set the padding token to be the end-of-sentence token
tokenizer.padding_side = "right"           # Pad sequences on the right side (default for most models)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
prompt = '''After the causal masking token rearrangement, each timestep of the rearranged matrix Y is vector of K tokens. Copet et al. (2023) observed 
that when performing autoregressive generation over stacked RVQ tokens, it is advantageous to apply a delay pattern so that the prediction of
codebook k at time t can be conditioned on the prediction of codebook k − 1 from the same timestep. We take a similar approach which we
describe here. Assume a span Ys is of shape Ls × K. Applying the delay pattern rearranges
it into Zs = (Zs,0, Zs,1, · · · , Zs,Ls+K−1), where Zs,t, t ∈ [Ls + K − 1] is defined as2 : Zs,t = (Ys,t,1, Ys,t+1,2, · · · , Ys,t−K+1,K) (1) 
where Ys,t−k+1,k denotes the token located at coordinate (t − k + 1, k) in matrix Ys, i.e. the kth codebook entry at the (t − k + 1)th timestep. explain the equation and indiviual terms?'''

In [5]:
# Start timing the execution
start = time.time()

# Set the logging level to CRITICAL to reduce console clutter
logging.set_verbosity(logging.CRITICAL)

# Initialize the text-generation pipeline with the specified model and tokenizer
pipe = pipeline(
    task="text-generation",  # Specify the task as text-generation
    model=model,             # Provide the pre-loaded model (configured with quantization)
    tokenizer=tokenizer,     # Provide the tokenizer associated with the model
    max_length=1000          # Set the maximum length of the generated text to 1000 tokens
)

# Execute the pipeline with an instruction wrapped text input
result = pipe(f"[INST]{text}[/INST]")

# Print the generated text from the result
print(result[0]['generated_text'])

# End timing and calculate the duration of the execution
end = time.time()
print(end - start)  # Print the total time taken to execute the text generation

[INST]After the causal masking token rearrangement,
each timestep of the rearranged matrix Y is vector of K tokens. Copet et al. (2023) observed
that when performing autoregressive generation
over stacked RVQ tokens, it is advantageous to
apply a delay pattern so that the prediction of
codebook k at time t can be conditioned on the
prediction of codebook k − 1 from the same
timestep. We take a similar approach which we
describe here. Assume a span Ys is of shape
Ls × K. Applying the delay pattern rearranges
it into Zs = (Zs,0, Zs,1, · · · , Zs,Ls+K−1), where
Zs,t, t ∈ [Ls + K − 1] is defined as2
:
Zs,t = (Ys,t,1, Ys,t+1,2, · · · , Ys,t−K+1,K) (1)
where Ys,t−k+1,k denotes the token located at coordinate (t − k + 1, k) in matrix Ys, i.e. the kth codebook entry at the (t − k + 1)th timestep. explain the equation and indiviual terms?[/INST] Explanation: The equation is rearranging the span Ys of shape Ls × K into a matrix Zs of shape Ls × K+K. The term Zs,t, t ∈ [Ls + K − 1] represents the

In [None]:
# Load the base model with the specified name, explicitly mapping it to use the CPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu"  # Map the model to run on the CPU instead of GPU
)

# Disable caching to prevent storing the outputs from intermediate layers,
# which can save memory during inference or training
model.config.use_cache = False

# Load the tokenizer associated with the model, with specific configurations
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,  # Allows execution of custom code found in the tokenizer files
    use_fast=False           # Disables the use of the fast tokenizer
)
tokenizer.pad_token = tokenizer.eos_token  # Set the padding token to be the end-of-sentence token
tokenizer.padding_side = "right"           # Ensure padding is added to the right of the sequences

# Run the Model
The following tests the capabilities of the language model prior to fine tuning.

In [5]:
import time
start = time.time()

text = '''Reward Modelling Phase: In the second phase the SFT model is prompted with prompts x to
produce pairs of answers (y1, y2 ) ∼ π SFT (y | x). These are then presented to human labelers
who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the
preferred and dispreferred completion amongst (y1 , y2 ) respectively. The preferences are assumed
to be generated by some latent reward model r∗ (y, x), which we do not have access to. There are a
number of approaches used to model preferences, the Bradley-Terry (BT) [5] model being a popular
choice (although more general Plackett-Luce ranking models [30, 21] are also compatible with the
framework if we have access to several ranked answers). The BT model stipulates that the human
preference distribution p∗ can be written as:
p∗ (y1 ≻ y2 | x) =
exp (r∗ (x, y1 ))
.
exp (r∗
(x, y1 )) + exp (r ∗ (x, y2))
(1)
Assuming access to a static dataset of comparisons D = 
x(i), yw (i) , yl (i) N
i=1 sampled from p∗ , we
can parametrize a reward model rφ (x, y) and estimate the parameters via maximum likelihood.
Framing the problem as a binary classification we have the negative log-likelihood loss:
LR(rφ , D) = −E(x,yw ,yl )∼D 
log σ(rφ (x, yw ) − rφ (x, yl )).

Please explain the math behind this paper by explaining all the variables'''

end = time.time()
print(end - start)

4.5299530029296875e-05


In [11]:
text = '''Baseline + self-train. Similar to DECOLA Phase 2, we selftrain baseline on weakly-labeled data. For the self-training
algorithm, we use online self-training with max-size loss
from Detic [74] as baseline comparison (baseline + selftrain) to DECOLA Phase 2. We tested max-size and maxscore losses from Detic [74]
What is the Detic model, can you explain the architecture, implementation details and the dataset used for the model? How does it relate to the DECOLA model?'''

In [47]:
import time
start = time.time()

logging.set_verbosity(logging.CRITICAL)

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"[INST]{text}[/INST]")
print(result[0]['generated_text'])
end = time.time()
print(end - start)

In [13]:
import time
start = time.time()

logging.set_verbosity(logging.CRITICAL)

prompt = "Assuming access to a static dataset of comparisons D =  x(i), yw (i) , yl (i) N i=1 sampled from p∗ , we can parametrize a reward model rφ (x, y) and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss: \n Please Explain math LR(rφ , D) = −E(x,yw ,yl )∼D  log σ(rφ (x, yw ) − rφ (x, yl ))"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1500)
result = pipe(f"<s>[INST] {text} [/INST]")
print(result[0]['generated_text'])

end = time.time()
print(end - start)

<s>[INST] Reward Modelling Phase: In the second phase the SFT model is prompted with prompts x to
produce pairs of answers (y1, y2 ) ∼ π SFT (y | x). These are then presented to human labelers
who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the
preferred and dispreferred completion amongst (y1 , y2 ) respectively. The preferences are assumed
to be generated by some latent reward model r∗ (y, x), which we do not have access to. There are a
number of approaches used to model preferences, the Bradley-Terry (BT) [5] model being a popular
choice (although more general Plackett-Luce ranking models [30, 21] are also compatible with the
framework if we have access to several ranked answers). The BT model stipulates that the human
preference distribution p∗ can be written as:
p∗ (y1 ≻ y2 | x) =
exp (r∗ (x, y1 ))
.
exp (r∗
(x, y1 )) + exp (r ∗ (x, y2))
(1)
Assuming access to a static dataset of comparisons D =
x(i), yw (i) , yl (i) N
i=1 sampled from p∗ , w

In [58]:
result[0]['generated_text']

' Explanation: The authors assume that the human preferences are generated by a latent reward model $r\\ast(x,y)$ that is not known. They propose a Bradley-Terry (BT) model to model the human preferences. The BT model stipulates that the human preference distribution $p\\ast(y_1 \\geq y_2 | x)$ can be written as:\n\n$$p\\ast(y_1 \\geq y_2 | x) = \\frac{e^{r\\ast(x,y_1)}}{e^{r\\ast(x,y_1)} + e^{r\\ast(x,y_2)}}$$\n\nThe authors assume that the reward model $r\\ast(x,y)$ is not known, but they can use a static dataset of comparisons $D = (x(i), yw(i), yl(i))$ sampled from $p\\ast$ to estimate the parameters of the reward model via maximum likelihood. The authors define the negative log-likelihood loss as:\n\n$$LR(r\\ast, D) = -\\sum_{i=1}^N E(x(i), yw(i), yl(i)) \\log \\sigma(r\\ast(x(i), yw(i)) - r\\ast(x(i), yl(i)))$$\n\nwhere $\\sigma(r\\ast(x,y)) = \\frac{e^{r\\ast(x,y)}}{e^{r\\ast(x,y)} + e^{r\\ast(x,y)}}$. Answer: $\\boxed{-\\sum_{i=1}^N E(x(i), yw(i), yl(i)) \\log \\sigma(r\\ast(x(

In [None]:
import pydetex.pipelines as pip
text = result[0]['generated_text']
out = pip.simple(text)
print(out)

<s>[INST] Reward Modelling Phase: In the second phase the SFT model is prompted with prompts x to
produce pairs of answers (y1, y2 ) ∼ π SFT (y | x). These are then presented to human labelers
who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the
preferred and dispreferred completion amongst (y1, y2 ) respectively. The preferences are assumed
to be generated by some latent reward model r∗ (y, x), which we do not have access to. There are a
number of approaches used to model preferences, the Bradley-Terry (BT) [5] model being a popular
choice (although more general Plackett-Luce ranking models [30, 21] are also compatible with the
framework if we have access to several ranked answers). The BT model stipulates that the human
preference distribution p∗ can be written as:
p∗ (y1 ≻ y2 | x)=
exp (r∗ (x, y1 ))
.
exp (r∗
(x, y1 )) + exp (r ∗ (x, y2))
(1)
Assuming access to a static dataset of comparisons D=
x(i), yw (i), yl (i) N
i=1 sampled from p∗, we
can

In [28]:
import time
start = time.time()

logging.set_verbosity(logging.CRITICAL)

prompt = "Assuming access to a static dataset of comparisons D =  x(i), yw (i) , yl (i) N i=1 sampled from p∗ , we can parametrize a reward model rφ (x, y) and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss: \n Please Explain math LR(rφ , D) = −E(x,yw ,yl )∼D  log σ(rφ (x, yw ) − rφ (x, yl ))"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1500)
result = pipe(f"<s>[INST] {text} [/INST]")
print(result[0]['generated_text'])
end = time.time()
print(end - start)

<s>[INST] Reward Modelling Phase: In the second phase the SFT model is prompted with prompts x to
produce pairs of answers (y1, y2 ) ∼ π SFT (y | x). These are then presented to human labelers
who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the
preferred and dispreferred completion amongst (y1 , y2 ) respectively. The preferences are assumed
to be generated by some latent reward model r∗ (y, x), which we do not have access to. There are a
number of approaches used to model preferences, the Bradley-Terry (BT) [5] model being a popular
choice (although more general Plackett-Luce ranking models [30, 21] are also compatible with the
framework if we have access to several ranked answers). The BT model stipulates that the human
preference distribution p∗ can be written as:
p∗ (y1 ≻ y2 | x) =
exp (r∗ (x, y1 ))
.
exp (r∗
(x, y1 )) + exp (r ∗ (x, y2))
(1)
Assuming access to a static dataset of comparisons D =
x(i), yw (i) , yl (i) N
i=1 sampled from p∗ , w

# Train the Model
The following section is about taking your dataset and then finetuning the model

## Dataset Overview

This dataset consists of 10,000 math questions, each paired with detailed explanations as answers. These questions cover a variety of math topics and difficulty levels, providing a rich source of data for training language models to understand and generate mathematical content effectively.

## Objective of Fine-Tuning

The main goal of this fine-tuning exercise is to adapt a pre-trained language model to better handle math-related queries and generate accurate and informative explanations. This specialized training helps the model improve its ability to:

- **Understand mathematical notation** and terminology, which are often structured differently from standard text.
- **Generate coherent and contextually relevant explanations** for math problems, an essential skill for educational tools and applications.

## Approach

We will fine-tune the model using a concatenated format of instruction and output. Each input example is prepared by appending the question ("instruction") with its corresponding explanation ("output"), separated by specific tags that help the model distinguish between the problem statement and the explanation. This format aims to teach the model the relationship between a mathematical problem and its step-by-step solution or explanation.

By training on this dataset, the model learns not just to solve mathematical problems, but to articulate the reasoning behind the solutions, making it valuable for educational purposes and more interactive user experiences.

In [17]:
from datasets import load_dataset, concatenate_datasets

# Load your dataset from the Hugging Face Hub under the specified repository
dataset = load_dataset("Wanfq/Explore_Instruct_Math_10k")

# Define a function to preprocess examples in the dataset
def preprocess_examples(example):
    # Concatenate 'instruction' and 'output' fields to form a new 'text' field formatted for the model
    example["text"] = "[INST] " + example["instruction"] + " [/INST] " + example["output"]
    return example  # Return the modified example with the new 'text' field

# Apply the preprocessing function to each split of the dataset and remove columns that are not needed anymore
processed_datasets = {
    split: ds.map(preprocess_examples, remove_columns=['instruction', 'output', 'input'])
    for split, ds in dataset.items()
}

# Combine all the processed datasets (splits) into a single dataset for easier handling
combined_dataset = concatenate_datasets(list(processed_datasets.values()))


  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In [18]:
combined_dataset

Dataset({
    features: ['text'],
    num_rows: 10000
})

## Fine Tune the Model

This section details the configurations and parameters used for fine-tuning the LLaMA model with LoRA enhancements on a dataset of 10,000 math problems. The setup is designed to optimize the model's ability to generate informative answers and explanations for complex mathematical queries.

## LoRA Configuration

- **LoRA Alpha**: Set to 16, indicating the scaling factor for LoRA weights.
- **Dropout Rate**: A dropout of 0.1 to prevent overfitting.
- **Rank (r)**: The rank of LoRA matrices, set to 64 to modify the capacity of adaptation.
- **Bias**: Configured to "none" to not use any additional bias in the LoRA layers.
- **Task Type**: Specified as "CAUSAL_LM" to indicate that the model operates in a causal language modeling context.

## Training Parameters

- **Optimizer**: Paged AdamW with 32-bit precision, optimizing computational efficiency.
- **Batch Size**: 4 per device, balancing between memory usage and batch size.
- **Gradient Accumulation Steps**: Set to 2, allowing for effective training with smaller batch sizes.
- **Learning Rate**: A learning rate of 2e-4 is chosen to ensure stable training progression.
- **Weight Decay**: Standard setting of 0.001 to help regulate model parameters.
- **FP16 and BF16**: Precision settings are left disabled for standard training environments but can be enabled on NVIDIA A100 GPUs for enhanced performance.
- **Max Gradient Norm**: Capped at 0.3 to prevent the exploding gradient problem in training.
- **Learning Rate Scheduler**: A cosine learning rate scheduler is used to adjust the learning rate based on the training progress.
- **Reporting**: Set to report metrics to TensorBoard for visualization and monitoring training progress.

## Supervised Fine-Tuning

- **Trainer Configuration**: Utilizes the SFTTrainer class, integrating the LoRA and PEFT configurations to adapt the model more effectively to the task.
- **Sequence Length**: No maximum sequence length is imposed to allow flexibility in handling varying lengths of mathematical explanations.
- **Training Dataset**: The combined dataset of instructions and outputs is used, formatted to train the model in generating accurate and contextual answers.

This configuration ensures the model is fine-tuned to not only solve mathematical problems but also to provide detailed explanations, enhancing its utility in educational and problem-solving applications.


In [13]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,      # uses the number of epochs earlier
    per_device_train_batch_size=4,          # 4 seems reasonable
    gradient_accumulation_steps=2,          # 2 is fine, as we're a small batch
    optim="paged_adamw_32bit",              # default optimizer
    save_steps=0,                           # we're not gonna save
    logging_steps=10,                       # same value as used by Meta
    learning_rate=2e-4,                     # standard learning rate
    weight_decay=0.001,                     # standard weight decay 0.001
    fp16=False,                             # set to true for A100
    bf16=False,                             # set to true for A100
    max_grad_norm=0.3,                      # standard setting
    max_steps=-1,                           # needs to be -1, otherwise overrides epochs
    warmup_ratio=0.03,                      # standard warmup ratio
    group_by_length=True,                   # speeds up the training
    lr_scheduler_type="cosine",           # constant seems better than cosine
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=combined_dataset,
    peft_config=peft_config,                # use our lora peft config
    dataset_text_field="text",
    max_seq_length=None,                    # no max sequence length
    tokenizer=tokenizer,                    # use the llama tokenizer
    args=training_arguments,                # use the training arguments
    packing=False,                          # don't need packing
)

# Train model
trainer.train()

# Fine-tuned model name
new_model = "llama-2-7b-chat-hf-math-ft-V2"

# Save trained model
trainer.model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)



Step,Training Loss
10,1.4417
20,1.23
30,1.0561
40,1.0473
50,1.0169
60,0.7949
70,0.8023
80,0.8348
90,0.8608
100,0.844


('llama-2-7b-chat-hf-math-ft-V2/tokenizer_config.json',
 'llama-2-7b-chat-hf-math-ft-V2/special_tokens_map.json',
 'llama-2-7b-chat-hf-math-ft-V2/tokenizer.model',
 'llama-2-7b-chat-hf-math-ft-V2/added_tokens.json')

# Run the Model
The following runs the model post fine tune

In [19]:
from peft import AutoPeftModelForCausalLM, PeftModel, LoraConfig
from transformers import AutoTokenizer

peft_model_dir = "llama-2-7b-chat-hf-math-ft-V2"

# load base LLM model and tokenizer
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
base_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNo

After fine-tuning, the next critical step is to merge the enhanced weights from the fine-tuned model with the original pre-trained base model. This process is crucial for integrating the specialized capabilities developed during fine-tuning with the robust foundational knowledge of the base model.

## Process of Merging Weights

- **FP16 Precision**: The base model is reloaded with FP16 precision. Using FP16 (16-bit floating point) precision helps to reduce the memory footprint and improve the computational efficiency of the model, which is particularly beneficial for deployment on platforms with limited resources or for applications requiring high throughput.

- **Low CPU Memory Usage**: The model is loaded with settings optimized for lower CPU memory usage, facilitating better performance and resource management, especially in environments with constrained CPU resources.

- **Model Merger**: The PEFT (Progressive Error Feedback Training) model, which incorporates the LoRA (Low-Rank Adaptation) weights, is merged with the base model. The LoRA technique involves adapting the model's layers by inserting trainable low-rank matrices that capture important modifications during fine-tuning, without altering the original pre-trained weights extensively.

- **Final Model State**: After merging, the enhanced model combines the original capabilities of the base model with the specific adaptations learned during fine-tuning. This merged model is then unloaded from active memory to finalize the integration and prepare for deployment or further use.

## Significance

This merging step is significant because it ensures that the fine-tuned enhancements are seamlessly integrated into the base model's architecture. It allows the model to maintain its general capabilities while also excelling in the specific tasks it was fine-tuned for, such as explaining mathematical concepts and solutions. The use of FP16 and memory-efficient loading options further ensures that the model is ready for efficient real-world application.

In [24]:
# Reload the base model with FP16 precision to optimize for GPU memory usage and computation speed
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,    # Minimize CPU memory usage during loading
    return_dict=True,          # Ensure that the outputs of the model are returned as a dictionary
    torch_dtype=torch.float16, # Set model weights to float16 for memory efficiency
    device_map={"": 0},        # Map the model to GPU with ID 0
)

# Initialize the PEFT model from the pre-trained base and the fine-tuned new model
model = PeftModel.from_pretrained(base_model, new_model)
# Merge LoRA weights with the base model and unload from memory
model = model.merge_and_unload()

# Reload the tokenizer associated with the model to ensure compatibility and save it later
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,    # Allow the loading of remote code associated with the tokenizer
)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to end-of-sentence token for consistency
tokenizer.padding_side = "right"           # Ensure that padding is applied to the right side


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Pushing the Trained Model to HuggingFace

1. **Authentication**: The first step involves authenticating with the Hugging Face Hub using `notebook_login()`. This step ensures that you have the necessary permissions to upload models under your account.

2. **Pushing the Model**: 
   - The model, after being fine-tuned and merged with LoRA weights, is pushed to the Hub.
   - The command `model.push_to_hub()` is used, specifying the name of the model and opting not to use a temporary directory during the upload process.
   - This makes the model publicly available and downloadable from the Hugging Face Hub.

3. **Pushing the Tokenizer**:
   - Similarly, the tokenizer is pushed using `tokenizer.push_to_hub()`.
   - This ensures that anyone who uses the model can also use the exact tokenizer settings used during fine-tuning, maintaining consistency in how text is processed.


In [9]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [23]:
new_model

'llama-2-7b-chat-hf-math-ft-V2'

In [26]:
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/RohitSahoo/llama-2-7b-chat-hf-math-ft-V2/commit/888f0107b5e4eacd0e7954010817420adffaa64c', commit_message='Upload tokenizer', commit_description='', oid='888f0107b5e4eacd0e7954010817420adffaa64c', pr_url=None, pr_revision=None, pr_num=None)

## Model Available:
https://huggingface.co/RohitSahoo/llama-2-7b-chat-hf-math-ft-V2

## Fine Tuning Evaluation

## Purpose of Evaluation

After fine-tuning a language model on a specific dataset, such as our dataset of 10,000 math equations and explanations, it is crucial to evaluate the model's performance to understand how well it has adapted to the target task. One effective method for evaluating text generation models, particularly in the contexts of summarization and response generation, is using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score.

## ROUGE Score Explained

The ROUGE score is a set of metrics designed to evaluate the quality of text that has been machine-generated by comparing it to reference texts, which are typically human-generated answers in our context. ROUGE measures the overlap of n-grams, word sequences, and longest common subsequences between the generated text and the references. It focuses on both precision and recall, providing scores for ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence).

## Tokenization

For ROUGE score calculation, both the reference and generated texts must be properly tokenized. Tokenization involves breaking the text down into individual words or meaningful units, which facilitates a detailed and accurate comparison between the texts. The effectiveness of ROUGE scoring depends significantly on the alignment of this tokenization with the natural linguistic structures of the language.

## Multiple References

If multiple correct answers exist for a question, the ROUGE score calculation can take into account all possible correct answers. This flexibility enhances the robustness of the evaluation by recognizing any valid answer the model generates, thus not penalizing the model for providing an alternative correct answer that still accurately addresses the query.

## Scoring Detail

ROUGE scores are typically expressed as a percentage, where higher scores indicate that the generated text closely matches the quality and relevance of the human-generated reference texts. These scores are crucial for quantitatively assessing how effectively the model generates understandable, accurate, and contextually appropriate responses.

## Conclusion

Using the ROUGE score to evaluate our fine-tuned model offers a comprehensive measure that enhances qualitative assessments. It helps in gauging the model's capability to generate coherent and contextually correct responses, which is particularly important in the domain of mathematical problem-solving and explanation.


In [None]:
# Initialize the pipeline
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

answers = []

# Function to process questions in batches and generate answers
def generate_answers(questions, batch_size=10):
    answers = []
    for i in range(0, len(questions), batch_size):
        batch = questions[i:i+batch_size]
        batch_answers = []
        for text in batch:
            result = pipe(f"<s>[INST] {text} [/INST]")
            batch_answers.append(result[0]['generated_text'])
            print(i)
        answers.extend(batch_answers)
    return answers

# Generate answers
answers = generate_answers(questions, batch_size=10)

In [1]:
from rouge import Rouge

# Assume `reference_answers` is a list of lists of correct answers where each inner list corresponds to possible translations
# `generated_answers` is a list of answers generated by your model

# Each reference answer must be a list of tokens, so tokenize them as needed
reference_answers_formatted = [[answer.split()] for answer in reference_answers]

# Tokenize the generated answers
generated_answers_tokenized = [answer.split() for answer in generated_answers]

# Initialize the Rouge scoring object
rouge = Rouge()

# Calculate scores
scores = rouge.get_scores(generated_answers, reference_answers, avg=True)


ROUGE Scores: 0.53
