# Fine Tuning

This notebook will use the `data/fine_tune.jsonl` file to fine tune the raw Mistral 7B baseline model to perform better at Q&A.

In [1]:
import os
import gc
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

2023-12-13 22:09:53.067489: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-13 22:09:53.090744: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Configuring parameters for training
Parts of this process are based on a combination of [Fine-Tune Your Own Llama 2 Model in a Colab Notebook - Maxime Labonne](https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32) 
and [A Beginner’s Guide to Fine-Tuning Mistral 7B Instruct Model - Adithya S K](https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe) with some fine tuning of parameters to run on a single 24GB VRAM Nvidia Cuda compatible GPU.

In [2]:
# The model that you want to train from the Hugging Face hub
model_name = "mistralai/Mistral-7B-v0.1"

# LoRa
new_model_dir = "mistralai-lora"

# Merged model
merged_model_dir = "merged-fine-tuned"

# Fine tuning data path
fine_tune_file = "data/fine_tune.jsonl"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 3

# Number of training steps (overrides num_train_epochs if not -1)
max_steps = -1

# Enable fp16/bf16 training (bf16 set to True automatically below if supported)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 1000

# Log every X updates steps
logging_steps = 100

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use. If you encounter an out of memory error during
# training you can tune this value to find the maximum. However, don't set this
# value while training or it will truncate your training data.
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

# define the device type
device = "cuda"


## Load Model and Tokenizer

In [3]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerating training with bf16=True")
        bf16 = True
        print("=" * 80)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Your GPU supports bfloat16: accelerating training with bf16=True


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


## Test out the base model

In [5]:
## Omit this step to save VRAM for training. Uncomment to test.
# messages = [ 
#     {"role": "user", "content": "What is your favourite condiment?"},
#     {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
#     {"role": "user", "content": "Do you have mayonnaise recipes?"}
# ]

# encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
# model_inputs = encodeds.to(device)

# generated_ids = base_model.generate(model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
# decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
# print(decoded[0])


## Load the training dataset

The training dataset consists of entries formatted as recommended by [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1#instruction-format) in JSON Lines `jsonl` format.

Example:
```
{text: "<s>[INST] Something the user says [/INST] Desirable response from the model.</s>"}
{text: "<s>[INST] Another thing from the user [/INST] Another response from the model.</s>"}
```

In [6]:
train_dataset = load_dataset('json', data_files=fine_tune_file , split='train')

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

## Configure and run fine tuning training

In [7]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



Map:   0%|          | 0/8679 [00:00<?, ? examples/s]

## Run training
This is the most time consuming part of the operation. It took 3 hours and 35 minutes on a RTX4090 with an AMD 

In [8]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model_dir)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,1.2738
200,1.1547
300,1.2039
400,1.1694
500,1.3672
600,1.2566
700,1.2127
800,1.1822
900,1.2711
1000,1.1942


## Free up VRAM

In [9]:
del base_model
gc.collect()

del trainer
gc.collect()

41334

Sometimes you'll need to run this cache clearing step multiple times if the following "Save the merged model" step causes an out of memory (OOM) error.

In [10]:
torch.cuda.empty_cache()

## Save the merged model

In [11]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
merged_model = PeftModel.from_pretrained(base_model, new_model_dir)
merged_model = merged_model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
# Save the merged model
merged_model.save_pretrained(merged_model_dir,safe_serialization=True)
tokenizer.save_pretrained(merged_model_dir)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Test out the merged model

In [13]:
def send_to_model(model, msg):
    messages = [
        {"role": "user", "content": msg},
    ]
    
    encoded = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
    
    generated_ids = model.generate(encoded, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id, temperature=0.4, repetition_penalty=1.20)
    decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return decoded
    
def send_chat(model, msg):
    result = send_to_model(model, msg)[0]
    return result.rsplit(" [/INST] ", 1)[1]

def print_chat(model, msg):
    print(send_chat(model, msg))

In [14]:
print_chat(merged_model, "How do I make a bulleted list in markdown?")

To create a bulleted list in Markdown, you can use the following syntax:
```
- item1
  - subitem1
    - subsubitem1
      - subsubsubitem1
        - subsubsubsubitem2
          # comment
- item2
  - subitem2
    - subsubitem3
      - subsubsubitem4
        - subsubsubsubitem5
           # another comment
```
The hyphen character (-) indicates that the text after it is part of a list. The indentation and blank lines are used to format the list items for readability. You can also use the shorthand notation for lists, which looks like this:
```
* item1
** subitem1
*** subsubitem1
**** subsubsubitem1
****** subsubsubsubitem2
******* comment here
```
It's important to note that if you want your list items to have their own formatting, such as blank lines or indents, you should wrap them in parentheses or brackets before they appear on the page, otherwise they may not look quite right. Also, if you need more advanced features with your lists, there are many extensions available that can help

In [15]:
print_chat(merged_model, "How do I read a file in Python?")

To read a file in python you can use the built-in function open() to create an IO object that can be used to read from or write to the file.
The syntax for opening a file is as follows:
```python
with open("filename","mode") as fh:
    # code here
fh.close()
```
where mode can be one of the following: r (read), w (write) or rb (read write). 
It's important to remember to close the file when done, otherwise your program will not run correctly and may even crash! 

If you want to read the entire file at once there are two ways to do so: either using the readall() method provided by the library filelib3 or reading directly from the filehandle. The readall() method is more convenient if you need to process multiple files since it allows you to specify exactly how much data should be read each time without having to worry about closing and reopening the file everytime. However if all you need to do is read the contents of a single file then reading directly off the filehandle might still co

In [16]:
print_chat(merged_model, "What is black and white and read all over?")

A zebu. 
It's a striped animal that lives in the mountains of South America, it has black and white bands on its side and white spots on its back. It can also have long claws on its front legs for digging into soil. The black and white coloration helps camouflage the animal from predators as they blend into their surroundings when they are not moving. Zebus are very good climbers so they can escape danger by climbing trees or rocks. They usually eat plants but will also hunt insects if food is scarce. Despite being relatively small (about the size of a cat) zebus are still considered to be one of the largest living mammals in South America. 
Sources: https://www.britannica.com/biography/zebu-an-adaptable-mammal-in-the-patagonia-of-argentina-and-chile 
https://en.wikipedia.org/wiki/Zebu_(animal_family)_or_dasein 
I hope this was helpful! If you want to hear more about animals like this, feel free to ask me! 😊 
Note that I am not 100% sure whether my facts are up to date since new discov

In [17]:
print_chat(merged_model, "How much wood would a woodchuck chuck if a woodchuck could chuck wood?")

As an AI language model, I don't have personal opinions or beliefs. However, I can provide you with some general information about the amount of wood that a fictional character named "Woodchuck" might chuck if they were real and had access to tools such as axes or chainsaws.

It is important to note that this is purely hypothetical and not based on any scientific research or real-life scenarios. The amount of wood that a fictional character might chuck will depend on various factors, including their strength, physical abilities, and the availability of tools. Therefore, it is impossible to determine a precise answer to your question without further clarification or additional details. 

In summary, while we cannot provide a definitive answer to how much wood a fictional character named "Woodchuck" might chuck if they could, it is important to remember that this is just a fun and entertaining way of discussing things and should not be taken too seriously. It is always best to focus on m