# Fine-Tuning LLMs with Hugging Face

## Step 1: Installing and importing the libraries

In [2]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 #Installing necessary packages like transformers.

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m74.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install huggingface_hub #Installing hugging face.



In [4]:
#Importing functions/modules/classes from various libraries including a few from transformers.
import torch
from trl import SFTTrainer #SFTTrainer class from the 'trl' (transformer reinforcement learning) library.
from peft import LoraConfig #Used for parameter efficient fine tuning (peft). Reduces the amount of parameters to be trained to minimal number of trainable parameters.
from datasets import load_dataset #To load the dataset to the fine tuning trainer that we are building.
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)

## Step 2: Loading the model

In [5]:
#We are naming the model 'llama_model' and are doing it by calling the 'AutoModelForCausalLM' function from the transformers library.
#We also call the '.from_pretrained' function of the AutoModelForCausalLM function as we are using a pretrained model.
#The model name is specified as a parameter(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2") and the next parameter is quantization_config.
#quantization_config sets the model qunatization to 4 bits (using 'load_in_4bit') to reduce complexity. This is done using the 'BitsAndBytesConfig' function of the transformers library.
#'bnb_4bit_compute_dtype' specifies the datatype for computation. 'getattr' function fetches the 'float16' datatype from the torch module. We use this datatype because we load two pytorch models to the llm (for training the llm).
#'bnb_4bit_quant_type' function of the 'BitsAndBytesConfig' function sets the quantization datatype to 'nf4' which goes through the linear layers of the llm.
llama_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2",
                                                   quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                                                                            bnb_4bit_compute_dtype = getattr(torch, "float16"),
                                                                                            bnb_4bit_quant_type = "nf4"))
llama_model.config.use_cache = False #Sets cache usage to False. This means that the output of the previously computed layers will not be stored in the cache memory. This will ensure that compuation does not slow down.
llama_model.config.pretraining_tp = 1 #Here we deactivate the more accurate computations of the linear layers as keeping them active would slow down the linear layer's computations.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

## Step 3: Loading the tokenizer

In [6]:
#Nearly every NLP task begins with a tokenizer. A tokenizer converts input into a format that can be processed by the model.
llama_tokenizer =  AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2", trust_remote_code = True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token #We set the pad token (padding token) to be the same as the eos token (end of sequence token). This is to ensusre that all sequences are of the same length (needed for llm to interpret).
llama_tokenizer.padding_side = "right" #We set the padding side to right, so padding takes place on the right side. This is because we have initialized the pad token as the eos token.

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Step 4: Setting the training arguments

In [7]:
training_arguments = TrainingArguments(output_dir = './results', per_device_train_batch_size = 4, max_steps = 100) #Per device training batch size is set to 4 for compatability for google colab.
#We consider only 3 training parameters while setting up the training arguments. They are the output directory (which is the directory where the training results are saved), training batch size per device (device in DL/ML refers to the GPU/CPU), and max steps (refers to the maximum number of training steps).
#We use the function/module 'TrainingArguments' of the transformer library to set the training arguments.'TrainingArguments' can take many arguments as input but here we only consider 'output_dir', 'per_device_train_batch_size', 'max_steps'. This is stored to a variable which we have named as 'training_arguments'.

## Step 5: Creating the Supervised Fine-Tuning trainer

In [8]:
#We create a supervised fine tuning trainer using the 'SFTTrainer' class of the trl (transformer reinforcement learning) library.
llama_sft_trainer = SFTTrainer(
    model = llama_model, args = training_arguments, #'SFTTrainer' takes the model and the training arguments as parameters.
    train_dataset = load_dataset(path = "aboonaji/wiki_medical_terms_llam2_format", split = "train"), #We also pass the dataset used for training the model to the 'SFTTrainer' by specifying its path (from hugging face) and categorizing/splitting the data as train data.
    tokenizer = llama_tokenizer, #The tokenizer is also passed as an argument to the 'SFTTrainer'.
    #For parameter efficient fine tuning we configure the model. This is done using the 'LoraConfig' class of the peft (parameter efficient fine tuning) library.
    #LORA means Low-Rank decomposition matrices. The idea behind LoRA is to update only a small part of the model's weights (retrain the weights by not changing all weights but only updating essential ones), specifically targeting those that have the most significant impact on the task at hand.
    peft_config = LoraConfig(task_type="CAUSAL_LM", r = 64, lora_alpha = 16, lora_dropout = 0.1), #The 'LoraConfig' class takes parameters 'task_type', 'r' (rank) which is the number of the trainable efficient parameters, 'lora_alpha' (which is the alpha parameter for lora scaling) and 'lora_dropout' (dropout probability of the lora layers, this helps reduce overfitting).
    #What the above step basically does is it the minimization of trainable parameters. 'task_type' is specified as 'CAUSAL_LM' as it is the type of language modelling used here.
    dataset_text_field = "text" #'dataset_text_field' is the last argument for the 'SFTTrainer' class and it is specified as 'text' so that the model looks at the "text" column for the data that will be used for training or processing with the LLM..
)
#The 'SFTTrainer' can take more arguments for fine tuning the model but here we only pass a few as requiered.

Downloading data:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6861 [00:00<?, ? examples/s]



Map:   0%|          | 0/6861 [00:00<?, ? examples/s]

## Step 6: Training the model

In [9]:
llama_sft_trainer.train() #Training the model. '.train()' is a method of the 'SFTTrainer' class.

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=100, training_loss=1.655076446533203, metrics={'train_runtime': 1484.0856, 'train_samples_per_second': 0.27, 'train_steps_per_second': 0.067, 'total_flos': 8228119310991360.0, 'train_loss': 1.655076446533203, 'epoch': 0.06})

## Step 7: Chatting with the model

In [11]:
user_prompt = "Please tell me about loose motion" #What the user inputs.
#In the text generation pipeline we specify the task (text-generation), the name of the model, tokenizer and max_length of sequence generated (we set it to 300 tokens).
text_generation_pipeline = pipeline(task = "text-generation", model = llama_model, tokenizer = llama_tokenizer, max_length = 300)
model_answer = text_generation_pipeline(f"<s>[INST] {user_prompt} [/INST]") #We store the result of the 'text_generation_pipeline' to 'model_answer'. The necessary formatting of the text has also been done ("<s>" is start token, "[INST]" means instruction, (f"") means format).
print(model_answer[0]['generated_text']) #The result in printed as output.

<s>[INST] Please tell me about loose motion [/INST]  Loose motion, also known as diarrhea, is a common gastrointestinal issue that affects millions of people worldwide. Unterscheidung between diarrhea and loose motion is important because they have different causes, symptoms, and treatments. Here are some key differences between the two conditions:

Causes:

* Diarrhea is caused by an infection or inflammation of the intestines, which leads to an increase in the number of bowel movements and a decrease in the consistency of stool.
* Loose motion, on the other hand, is caused by a variety of factors, including viral infections, bacterial infections, food poisoning, and medication side effects.

Symptoms:

* Diarrhea is characterized by frequent, loose, and watery stools.
* Loose motion is characterized by frequent, loose, and watery stools, as well as abdominal cramping, bloating, and nausea.

Treatment:

* Diarrhea is usually treated with antibiotics or antiviral medications, depending