# Final Project
### Fine tuning of a Mistral 7B Model
---
Members:
- Bastian Castillo (C0872284)
- Fadernel Bedoya (C0872455)
- Marcelo Munoz (C0873813)
- Suyog Adhikari (C0880973)

**Goal**:

The goal of this proyect is to fine tune a pretrained model be able to sove basics tasks given as instructions through a chatbot interface to interact with it. This interface was going to be built using Gradio library and it will be deploy on Hugging Face Spaces. 

(*) Because of the limited computed power and storage resources the dataset used to fine tune the model has only 1K instances. We tried to used some other datasets use for the same purpose, but we had resources during the training and also during deployment.


First of all, the required dependencies are installed in this notebook environment:

In [1]:
%%capture
%pip install -U bitsandbytes
%pip install -U datasets transformers
%pip install -U peft
%pip install -U accelerate
%pip install -U trl

Once the dependencies are installed, these are imported into the notebook:

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb
from datasets import load_dataset
from trl import SFTTrainer



### Hugging Face and Wanbd Secrets

Hugging Face platform allows us to store the model generated and also provides a space to be deployed. W&B is a platform that facilitates the plotting of metrics during the training in a graphical way. In the next cell, we loaded the tokens for both apps as secrets to maintain them secured.

In [3]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_hf = user_secrets.get_secret("HUGGINGFACE_TOKEN")
secret_wandb = user_secrets.get_secret("WANDB_TOKEN")

As can be seen, it is possible to login to Hugging Face platform to then push our model to the HG repository and Space.

In [4]:
!huggingface-cli login --token $secret_hf

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Also we access W&B platform through this notebook and setup our project graphs into this platform.

In [5]:
wandb.login(key = secret_wandb)
run = wandb.init(
    project='Fine tuning mistral 7B', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mbascr[0m ([33mlambton-college[0m). Use [1m`wandb login --relogin`[0m to force relogin


### Model

The model selected for the team was Mistral 7B. This model has 7.3 Billion of parameters. Despite its number of parameters it performs better than LlaMa 2 13B and LlaMa 1 34B in many benchmarks. This model has a transformer architecture. This model has two important mechanisms:

- Group-query Attention: it allows a faster inference time if we compared to standard full attention.
- Sliding Windows Attention: this means it has the ability to handle longer text sequences using less resources.

We can use this model without restriction because it was released under Apache 2.0 license.

In the following cell, the model string, the name of our fine-tuning model and the name of the dataset we will use are declared:

In [8]:
base_model = "/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"
dataset_name = "mlabonne/guanaco-llama2-1k"
new_model = "chatbot_2"

### Dataset

The dataset selected contains 1K of intances as instructions and the response already formatted to be used as train input into the model in multiple languages.

In [7]:
import pandas as pd

# Load the dataset
dataset = load_dataset(dataset_name,split="train")
dataset_df = pd.DataFrame(dataset)

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Here can be seen some examples of the dataset. The character **"\<s>"** indicates the beginning of a sentence and **"\</s>"** the end of it. The text between **"\[INST\]"** and **"\[/INST\]"** correspond to an instruction, and what is out of it corresponds to the answer. Basically this data is used for Question & Answer models, like in our case.

In [8]:
dataset_df

Unnamed: 0,text
0,<s>[INST] Me gradué hace poco de la carrera de...
1,<s>[INST] Самый великий человек из всех живших...
2,<s>[INST] Compose a professional email with th...
3,<s>[INST] ¿Qué juegos me recomendarías si me h...
4,<s>[INST] Cual es el desarrollo motor de niño/...
...,...
995,<s>[INST] I want you to act as a Linux termina...
996,<s>[INST] quiero un tutorial de como acceder a...
997,"<s>[INST] Auf Mani reimt sich Banani, daraus w..."
998,<s>[INST] Buenos días! [/INST] ¡Hola! ¿Cómo es...


For a faster training it is used a 4-bit precision and the model is loaded (The base_model variable contains the name of the Mistral model):

In [9]:
bnb_config = BitsAndBytesConfig(  
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)

model.config.use_cache = False # silence the warnings
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Also the tokenizer is loaded and configured: EOS (end of a sentence) and padding properties.

In [10]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

(True, True)

Due to the amount of compute power resources required for this type of text-generation models, it is necessary to reduce the number of parameters adding an adopter layer to use the memory in a more efficient way. This technique is named LoRA - Low-Rank Adaptation of Large Language Models. The approach of this technique is to represent the weights updates with two smaller matrices through low-rank decomposition. As Huggin Face documentation says: 

_These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined._

_This approach has a number of advantages:_

- _LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters._
- _The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them._
- _LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them._
- _Performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models._
- _LoRA does not add any inference latency because adapter weights can be merged with the base model."_


In [11]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

### Hyperparameters

The arguments for the trainer class are declared bellow, like the output local folder for the model file generated, the number of epochs (in this case, because of the resource limits we trained the model through 1 epoch), the optimizer, the number of the barhc among other hyperparameters:

In [12]:
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="wandb"
)

The trainer is instantiated and the training begging with the trainer.train() method. The model was trained around 1 hour and 30 minutes using 2 GPUS of 15GB and 16GB of RAM instance:

In [13]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [14]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.2476
50,1.6229
75,1.2026
100,1.3971
125,1.1572
150,1.3262
175,1.1714
200,1.4292
225,1.1359
250,1.481


TrainOutput(global_step=250, training_loss=1.317102081298828, metrics={'train_runtime': 5524.6504, 'train_samples_per_second': 0.181, 'train_steps_per_second': 0.045, 'total_flos': 1.874641569231667e+16, 'train_loss': 1.317102081298828, 'epoch': 1.0})

### Save model and publish to Hugging Face Hub repository

The model is locally saved and push into Hugging Face Hub to be used during the deployment into the space.

In [15]:
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True

VBox(children=(Label(value='0.057 MB of 0.057 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁▂▃▃▄▅▆▆▇██
train/global_step,▁▂▃▃▄▅▆▆▇██
train/learning_rate,▁▁▁▁▁▁▁▁▁▁
train/loss,▃█▂▅▁▄▂▅▁▆
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,1.0
train/global_step,250.0
train/learning_rate,0.0002
train/loss,1.481
train/total_flos,1.874641569231667e+16
train/train_loss,1.3171
train/train_runtime,5524.6504
train/train_samples_per_second,0.181
train/train_steps_per_second,0.045


In [None]:
trainer.model.push_to_hub(new_model, use_temp_dir=False)

### Saving model files

In [40]:
!zip -r model.zip /kaggle/working/results/adapter_config.json /kaggle/working/results/adapter_model.safetensors /kaggle/working/results/tokenizer.json /kaggle/working/results/tokenizer.model /kaggle/working/results/tokenizer_config.json /kaggle/working/results/training_args.bin /kaggle/working/results/special_tokens_map.json /kaggle/working/results/checkpoint-250

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  adding: kaggle/working/results/adapter_config.json (deflated 50%)
  adding: kaggle/working/results/adapter_model.safetensors (deflated 8%)
  adding: kaggle/working/results/tokenizer.json (deflated 74%)
  adding: kaggle/working/results/tokenizer.model (deflated 55%)
  adding: kaggle/working/results/tokenizer_config.json (deflated 64%)
  adding: kaggle/working/results/training_args.bin (deflated 48%)
  adding: kaggle/working/results/special_tokens_map.json (deflated 73%)
  adding: kaggle/working/results/checkpoint-250/ (stored 0%)
  adding: kaggle/working/results/checkpoint-250/adapter_config.json (deflated 50%)
  adding: kaggle/working/results/checkpoint-250/tokenizer.model (deflated 55%)
  adding: kaggle/working/results/checkpoint-250/rng_state.pth (deflated 28%)
  adding: kaggle/working/results/checkpoint-250/tokenizer.json (deflated 74%)
  adding: kaggle/working/results/checkpoint-250/README.md (deflated 65%)
  adding: kaggle/working/results/checkpoint-250/scheduler.pt (deflated 51

In [42]:
!zip -r new_model.zip /kaggle/working/mistral_7b_guanaco

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  adding: kaggle/working/mistral_7b_guanaco/ (stored 0%)
  adding: kaggle/working/mistral_7b_guanaco/adapter_config.json (deflated 50%)
  adding: kaggle/working/mistral_7b_guanaco/README.md (deflated 65%)
  adding: kaggle/working/mistral_7b_guanaco/adapter_model.safetensors (deflated 8%)


### Inference

Once the model is saved it can be use using pipeline functions. Also the input must be formatted to be given as an input to the model for inferencing.

In [38]:
logging.set_verbosity(logging.CRITICAL)

prompt = "How do I find true love?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] How do I find true love? [/INST] Finding true love is a complex and personal journey, and there is no one-size-fits-all answer. However, here are some tips that may help you on your journey:

1. Be yourself: The most important thing is to be true to yourself and your values. Don't try to be someone else or change who you are to please others.

2. Be open-minded: Don't limit yourself to a certain type of person or relationship. Be open to new experiences and possibilities.

3. Be patient: Finding true love takes time and effort. Don't rush the process or settle for less than what you want.

4. Be active: Get out there and meet new people. Join clubs, go to events, and try new things.

5. Be honest: Be honest with yourself and others about your feelings and intentions.


In [41]:
prompt = "What is Datacamp Career track?"
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] What is Datacamp Career track? [/INST] Datacamp Career Track is a program that provides a comprehensive learning path for individuals who want to become data scientists. It covers a wide range of topics, including data analysis, machine learning, and data visualization. The program is designed to help individuals develop the skills and knowledge needed to succeed in the field of data science. 

The program consists of a series of courses, each of which is designed to build upon the knowledge and skills learned in the previous course. The courses are delivered through a combination of video lectures, interactive exercises, and quizzes. The program also includes a final project, where students apply the skills they have learned to a real-world problem. 

The program is designed to be completed in approximately 12 weeks, but students can take longer if needed. Upon completion of the program, students receive a certificate of completion from Datacamp. 

Over
