# Fine Tuning

This notebook will use the `data/fine_tune.jsonl` file to fine tune the raw Mistral 7B baseline model to perform better at Q&A.

In [1]:
import os
import gc
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

2023-12-09 16:03:51.966687: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-09 16:03:51.992945: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Configuring parameters for training
Parts of this process are based on a combination of [Fine-Tune Your Own Llama 2 Model in a Colab Notebook - Maxime Labonne](https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32) 
and [A Beginner’s Guide to Fine-Tuning Mistral 7B Instruct Model - Adithya S K](https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe) with some fine tuning of parameters to run on a single 24GB VRAM Nvidia Cuda compatible GPU.

In [2]:
# The model that you want to train from the Hugging Face hub
model_name = "mistralai/Mistral-7B-v0.1"

# LoRa
new_model_dir = "mistralai-lora"

# Merged model
merged_model_dir = "merged-fine-tuned"

# Fine tuning data path
fine_tune_file = "data/fine_tune.jsonl"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 3

# Number of training steps (overrides num_train_epochs if not -1)
max_steps = -1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 100

# Log every X updates steps
logging_steps = 100

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = 512

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

# define the device type
device = "cuda"


## Load Model and Tokenizer

In [3]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Your GPU supports bfloat16: accelerate training with bf16=True


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


## Test out the base model

In [5]:
## Omit this step to save VRAM for training. Uncomment to test.
# messages = [ 
#     {"role": "user", "content": "What is your favourite condiment?"},
#     {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
#     {"role": "user", "content": "Do you have mayonnaise recipes?"}
# ]

# encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
# model_inputs = encodeds.to(device)

# generated_ids = base_model.generate(model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
# decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
# print(decoded[0])


## Load the training dataset

The training dataset consists of entries formatted as recommended by [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1#instruction-format) in JSON Lines `jsonl` format.

Example:
```
{text: "<s>[INST] Something the user says [/INST] Desirable response from the model.</s>"}
{text: "<s>[INST] Another thing from the user [/INST] Another response from the model.</s>"}
```

In [6]:
train_dataset = load_dataset('json', data_files=fine_tune_file , split='train')

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

## Configure and run fine tuning training

In [7]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

## Run training

In [8]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model_dir)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,1.2022
200,1.2411
300,1.2172
400,1.2267
500,1.3053
600,1.2068
700,1.2581
800,1.3488
900,1.1884
1000,1.2227


## Free up VRAM

In [9]:
del base_model
gc.collect()

del trainer
gc.collect()

41285

Sometimes you'll need to run this cache clearing step multiple times if the following "Save the merged model" step causes an out of memory (OOM) error.

In [10]:
torch.cuda.empty_cache()

## Save the merged model

In [11]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
merged_model = PeftModel.from_pretrained(base_model, new_model_dir)
merged_model = merged_model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
# Save the merged model
merged_model.save_pretrained(merged_model_dir,safe_serialization=True)
tokenizer.save_pretrained(merged_model_dir)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Test out the merged model

In [13]:
def send_to_model(model, msg):
    messages = [
        {"role": "user", "content": msg},
    ]
    
    encoded = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
    
    generated_ids = model.generate(encoded, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id, temperature=0.4, repetition_penalty=1.20)
    decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return decoded
    
def send_chat(model, msg):
    result = send_to_model(model, msg)[0]
    return result.rsplit(" [/INST] ", 1)[1]

def print_chat(model, msg):
    print(send_chat(model, msg))

In [14]:
print_chat(merged_model, "How do I make a bulleted list in markdown?")

To create a bulleted list in Markdown, you can follow these steps:

1. Start by creating a new text file and opening it in your preferred text editor or word processor.
2. Open the Markdown syntax guide by pressing Ctrl + Shift + Alt + U (or Ctrl + C) on Windows computers. Alternatively, you can search for "Markdown" in your computer's built-in search bar.
3. Once you have opened the Markdown syntax guide, select the bullet list icon from the options menu (the three lines that look like an L). You can find this option by scrolling down until you see the bullet list icon.
4. Place the cursor where you would like to start the list. The list will begin with an indentation of four spaces.
5. Press Enter to create the first item in the list.
6. Continue adding items to the list by repeating the process outlined above.
7. Save the Markdown file with .md extension.
8. Preview the Markdown file by clicking Ctrl + Shift + Alt + U (or Ctrl + C) again.
9. If you are satisfied with the results, yo

In [15]:
print_chat(merged_model, "How do I read a file in Python?")

There are several ways to read a file using the python library, including:
1. `open` function - Opens a file object and returns it. The open method can be used to read or write to a text based file such as .txt, .csv, .ini, etc. 
2. `with` Opening the file using with blocks ensures that the file is closed at the end of execution, even if an exception occurs. This is useful for files that need to be opened and then closed without any explicit code to manage this task.
3. `getattr` getattr() takes a string name attribute and returns the value associated with that name. Note that this requires knowing the name of the attribute first. For example, if you want to access the integer value named "age" from a dictionary, you would have to know that age was a key before calling getattr(). 
4. `readline` reads one line from the file object. If there is only one line left to read, the loop will return EOF (end-of-file) indicating that all data has been read. 
5. `reads` reads all remaining lines 

In [16]:
print_chat(merged_model, "What is black and white and read all over?")

The phrase "Black and White" is often used to refer to a classic or traditional design, while the term "Read All Over" is commonly associated with modern fashion trends. However, when combined they create an unusual and thought-provoking juxtaposition of words that could be interpreted in multiple ways. 

The Black and White color scheme has been around for centuries and is often associated with a timeless and elegant aesthetic. On the other hand, the phrase "Read All Over" suggests a focus on trendy or contemporary fashion. When you put these two together it creates a unique and unexpected combination that could be seen as either playful or serious depending on your interpretation. 

In summary, the answer to your question depends on how you interpret the phrase "black and white" and the phrase "read all over". If you take them literally then the answer would be that they are just random words unrelated to each other but if you look at the context behind them there may be some hidden 

In [17]:
print_chat(merged_model, "How much wood would a woodchuck chuck if a woodchuck could chuck wood?")

As much as any other woodchuck! 
The phrase "woodchuck" is often used to describe anyone or anything that seems to be strange, or out of the ordinary. The phrase was likely created in the early 1900s and has been used ever since. However, there is no direct connection between woodchucks and the phrase "woodchuck." In fact, woodchucks are also known as land beavers and are members of the family Mole-rat Pocket. 
So when you ask me how much wood would a woodchuck chuck if a woodchuck could chuck wood - I'm simply saying that it doesn't matter because woodchucks can’t chuck wood. 
I hope this answered your question! Let me know if you have any others! :) 
(Source: https://english.stackexchange.com/questions/5678/how-much-wood-would-a-woodchuck-chuck) 
https://en.wikipedia.org/wiki/Woodchuck#Origin_of_the_name 
https://grow.ifa.coop/blog/2013/12/what-is-a-woodchuck/#woodchuck-are-they-really-woodchucks  
https://english.stackoverflow.com/questions/4042/why-do-people-say-woodchuck-when-its-