# Fine-Tuning of LLMs with Hugging Face

[Hugging Face Hub](https://huggingface.co/docs/hub/index)
The Hugging Face Hub is a platform with over `350k models`, `75k datasets`, and `150k demo apps (Spaces)`, all open source and publicly available, in an online platform where people can easily collaborate and build ML together. The Hub works as a central place where anyone can explore, experiment, collaborate, and build technology with Machine Learning.

The idea in this project is to take a pre-trained model (base model) from Hugging Face, and then fine-tune it with an augmented source dataset which contain medical terms so the final chat model to be able to answer specific medical questions as shown in the [figure](https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32):

![Alt Text](backgroundFineTuningLLMs.png)


**Pre-trained model:**
[aboonaji/llama2finetune-v2](https://huggingface.co/aboonaji/llama2finetune-v2)


**Source(Instruction) dataset:**
[(gamino/wiki_medical_terms)](https://huggingface.co/datasets/gamino/wiki_medical_terms)
This dataset contains over 6,000 medical terms and their wikipedia text. It is intended to be used on a downstream task that requires medical terms and their wikipedia explanation.


**Formated datasets:**

[aboonaji/wiki_medical_terms_llam2_format](https://huggingface.co/datasets/aboonaji/wiki_medical_terms_llam2_format)

[mlabonne/guanaco-llama2](https://huggingface.co/datasets/mlabonne/guanaco-llama2)

[emre/llama-2-instruct-121k-code](https://huggingface.co/datasets/emre/llama-2-instruct-121k-code)

## Step 1: Installing and importing the libraries for Hugging Face

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install huggingface_hub



In [None]:
import os
import torch
from trl import SFTTrainer
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging)

## Step 2: Setting up links to Hugging Face datasets and models

In [None]:
model_identifier = "aboonaji/llama2finetune-v2"
source_dataset = "gamino/wiki_medical_terms"
formatted_dataset = "aboonaji/wiki_medical_terms_llam2_format"

## Step 3: Setting up all the QLoRA hyperparameters for fine-tuning

In [None]:
lora_hyper_r = 64
lora_hyper_alpha = 16
lora_hyper_dropout = 0.1

## Step 4: Setting up all the bitsandbytes hyperparameters for fine-tuning

In [None]:
enable_4bit = True
compute_dtype_bnb = "float16"
quant_type_bnb = "nf4"
double_quant_flag = False

## Step 5: Setting up all the training arguments hyperparameters for fine-tuning

In [None]:
results_dir = "./results"
epochs_count = 10
enable_fp16 = False
enable_bf16 = False
train_batch_size = 4
eval_batch_size = 4
accumulation_steps = 1
checkpointing_flag = True
grad_norm_limit = 0.3
train_learning_rate = 2e-4
decay_rate = 0.001
optimizer_type = "paged_adamw_32bit"
scheduler_type = "cosine"
steps_limit = 100
warmup_percentage = 0.03
length_grouping = True
checkpoint_interval = 0
log_interval = 25

## Step 6: Setting up all the supervised fine-tuning arguments hyperparameters for fine-tuning

In [None]:
enable_packing = False
sequence_length_max = None
device_assignment = {"": 0}

## Step 7: Loading the dataset

In [None]:
training_data = load_dataset(formatted_dataset, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6861 [00:00<?, ? examples/s]

In [None]:
training_data

Dataset({
    features: ['text'],
    num_rows: 6861
})

## Step 8: Defining the QLoRA configuration

In [None]:
dtype_computation = getattr(torch, compute_dtype_bnb)
print(dtype_computation)

torch.float16


In [None]:
bnb_setup = BitsAndBytesConfig(load_in_4bit=enable_4bit,
                               bnb_4bit_quant_type=quant_type_bnb,
                               bnb_4bit_use_double_quant=double_quant_flag,
                               bnb_4bit_compute_dtype=dtype_computation)

## Step 9: Loading the pre-trained LLaMA 2 model

In [None]:
llama_model = AutoModelForCausalLM.from_pretrained(model_identifier, quantization_config=bnb_setup, device_map=device_assignment)
llama_model.config.use_case = False
llama_model.config.pretraining_tp = 1



config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

## Step 10: Loading the pre-trained tokenizer for the LLaMA 2 model

In [None]:
llama_tokenizer = AutoTokenizer.from_pretrained(model_identifier, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Step 11: Setting up the configuration for the LoRA fine-tuning method

In [None]:
peft_setup = LoraConfig(lora_alpha=lora_hyper_alpha,
                        lora_dropout=lora_hyper_dropout,
                        r=lora_hyper_r,
                        bias="none",
                        task_type="CAUSAL_LM")

## Step 12: Creating a training configuration by setting the training parameters

In [None]:
train_args = TrainingArguments(output_dir=results_dir,
                               num_train_epochs=epochs_count,
                               per_device_train_batch_size=train_batch_size,
                               per_device_eval_batch_size=eval_batch_size,
                               gradient_accumulation_steps=accumulation_steps,
                               learning_rate=train_learning_rate,
                               weight_decay=decay_rate,
                               optim=optimizer_type,
                               save_steps=checkpoint_interval,
                               logging_steps=log_interval,
                               fp16=enable_fp16,
                               bf16=enable_bf16,
                               max_grad_norm=grad_norm_limit,
                               max_steps=steps_limit,
                               warmup_ratio=warmup_percentage,
                               group_by_length=length_grouping,
                               lr_scheduler_type=scheduler_type,
                               gradient_checkpointing=checkpointing_flag)

## Step 13: Creating the Supervised Fine-Tuning Trainer

In [None]:
llama_sftt_trainer = SFTTrainer(model=llama_model,
                                args=train_args,
                                train_dataset=training_data,
                                tokenizer=llama_tokenizer,
                                peft_config=peft_setup,
                                dataset_text_field="text",
                                max_seq_length=sequence_length_max,
                                packing=enable_packing)



Map:   0%|          | 0/6861 [00:00<?, ? examples/s]

## Step 14: Training the model

In [None]:
llama_sftt_trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
25,1.7076
50,0.8512
75,1.393
100,0.7614


TrainOutput(global_step=100, training_loss=1.1783139610290527, metrics={'train_runtime': 1150.5209, 'train_samples_per_second': 0.348, 'train_steps_per_second': 0.087, 'total_flos': 5978369907425280.0, 'train_loss': 1.1783139610290527, 'epoch': 0.06})

## Step 15: Chatting with the model

In [None]:
user_prompt = "Please tell me about Bursitis"
text_generation_pipe = pipeline(task="text-generation",
                                model=llama_model,
                                tokenizer=llama_tokenizer,
                                max_length=300)
generation_result = text_generation_pipe(f"<s>[INST] {user_prompt} [/INST]")
print(generation_result[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


<s>[INST] Please tell me about Bursitis [/INST]  Bursitis is a condition where the bursae, small fluid-filled sacs that cushion and reduce friction between tendons, muscles, and bones, become inflamed. everybody has bursae, but they are more common in the joints. Bursitis is a common condition that can affect any joint in the body. It is usually caused by repetitive motion, injury, or infection.

Bursitis can be acute or chronic. Acute bursitis is usually caused by a sudden injury, such as a fall or a blow to the joint. Chronic bursitis is usually caused by repetitive motion, such as running or cycling.

Symptoms of bursitis may include:

* Pain or tenderness in the affected joint
* Swelling or redness in the affected joint
* Limited mobility or range of motion in the affected joint
* Warmth or heat in the affected joint
* Inflammation or redness in the affected joint
* Difficulty moving the affected joint
* Pain when moving the affected joint
* Pain when resting the affected joint
* P