<a href="https://colab.research.google.com/github/akimi-yano/data-science/blob/main/LLM_Fine_Tuning_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning of LLMs with Hugging Face

## Step 1: Installing and importing the libraries for Hugging Face

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

In [None]:
!pip install huggingface_hub



In [None]:
import os
import torch
from trl import SFTTrainer
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging)

In [None]:
torch.cuda.is_available()  # Should return True if GPU is available

True

## Step 2: Setting up links to Hugging Face datasets and models

In [None]:
# # Verison 1: Medical Knowledge

# model_identifier = "aboonaji/llama2finetune-v2" # this is the model
# source_dataset = "gamino/wiki_medical_terms" # this is the data set
# formatted_dataset = "aboonaji/wiki_medical_terms_llam2_format" # this is the formatted data set as llama 2 expects a specific format

In [None]:
# Version 2: Languages

model_identifier = "aboonaji/llama2finetune-v2" # this is the model
formatted_dataset = "mlabonne/guanaco-llama2" # languages

## Step 3: Setting up all the QLoRA hyperparameters for fine-tuning

In [None]:
lora_hyper_r = 64
lora_hyper_alpha = 16
lora_hyper_dropout = 0.1 # 10%

## Step 4: Setting up all the bitsandbytes hyperparameters for fine-tuning

In [None]:
# quantization from 16 bits to 4 bits
enable_4bit = True
compute_dtype_bnb = "float16" # bnb = bits and bites
quant_type_bnb = "nf4" # quantize to 4 bit precision
double_quant_flag = False # do not apply quantization 2 times at 2 different stages

## Step 5: Setting up all the training arguments hyperparameters for fine-tuning

In [None]:
results_dir = "./results"
epochs_count = 10 # 10 is enough

# by setting both of these to False, we make sure that we use the default precision of 32 bit
enable_fp16 = False # dont use 16 bit floating point precision
enable_bf16 = False # disable the brain floating point during training as well

train_batch_size = 4
eval_batch_size = 4
gradient_accumulation_steps = 1 # number of gradient accumulation steps to increase the batch size without increasing the memory requirement
checkpointing_flag = True # enable gradient checkpointing - technique to save memory with the cost of additional computation - useful for training large model
grad_norm_limit = 0.3 # max norm of gradient
train_learning_rate = 2e-4
decay_rate = 0.001 # used for regularization to avoid overfitting - this is a small number
optimizer_type = "paged_adamw_32bit" # optimizer to use - 32 bit precision version
lr_scheduler_type = "cosine" # learning rate scheduler to stabilize training - use cosine curve
steps_limit = 100
warmup_percentage = 0.03 # 3% of the training steps will be used for warm up phase
length_grouping = True # enable to group the training samples of similar length togehter <- this improves the training efficiency
checkpoint_interval = 0 # we dont save any check point
log_interval = 25 # how often log the intervals

## Step 6: Setting up all the supervised fine-tuning arguments hyperparameters for fine-tuning

In [None]:
enable_packing = False # whether to use packing for our training or not - packing is a technique used in processing sequence
# that multiple shorter sequences are combined into single training example to improve computational efficiency
sequence_length_max = None # max sequence length for training
device_assignment = {"":0} # device to use for training. Using CPU.

## Step 7: Loading the dataset

In [None]:
training_data = load_dataset(formatted_dataset, split = "train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
training_data

Dataset({
    features: ['text'],
    num_rows: 9846
})

## Step 8: Defining the QLoRA configuration

In [None]:
dtype_computation = getattr(torch, compute_dtype_bnb)

bnb_setup = BitsAndBytesConfig(load_in_4bit = enable_4bit,
                               bnb_4bit_quant_type = quant_type_bnb,
                               bnb_4bit_use_double_quant = double_quant_flag,
                               bnb_4bit_compute_dtype = dtype_computation
                               )

## Step 9: Loading the pre-trained LLaMA 2 model

In [None]:
llama_model = AutoModelForCausalLM.from_pretrained(model_identifier, quantization_config = bnb_setup, device_map = device_assignment)
llama_model.config.use_case = False
llama_model.config.pretraining_tp = 1



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return torch.load(checkpoint_file, map_location="cpu")


## Step 10: Loading the pre-trained tokenizer for the LLaMA 2 model

In [None]:
llama_tokenizer = AutoTokenizer.from_pretrained(model_identifier, trust_remote_code = True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

## Step 11: Setting up the configuration for the LoRA fine-tuning method

In [None]:
peft_setup = LoraConfig(lora_alpha = lora_hyper_alpha,
                        lora_dropout = lora_hyper_dropout,
                        r = lora_hyper_r,
                        bias = "none",
                        task_type = "CAUSAL_LM")

## Step 12: Creating a training configuration by setting the training parameters

In [None]:
train_args = TrainingArguments(output_dir = results_dir,
                               num_train_epochs = epochs_count,
                               per_device_train_batch_size = train_batch_size,
                               per_device_eval_batch_size = eval_batch_size,
                               gradient_accumulation_steps = gradient_accumulation_steps,
                               learning_rate = train_learning_rate,
                               weight_decay = decay_rate,
                               optim = optimizer_type,
                               save_steps = checkpoint_interval,
                               logging_steps = log_interval,
                               fp16 = enable_fp16,
                               bf16 = enable_bf16,
                               max_grad_norm = grad_norm_limit,
                               max_steps = steps_limit,
                               warmup_ratio = warmup_percentage,
                               group_by_length = length_grouping,
                               lr_scheduler_type = lr_scheduler_type,
                               gradient_checkpointing = checkpointing_flag
                               )

## Step 13: Creating the Supervised Fine-Tuning Trainer

In [None]:
llama_sftt_trainer = SFTTrainer(model = llama_model,
                                args = train_args,
                                train_dataset = training_data,
                                tokenizer = llama_tokenizer,
                                peft_config = peft_setup,
                                dataset_text_field = "text",
                                max_seq_length = sequence_length_max,
                                packing = enable_packing,
                                )



In [None]:
# # This is for debugging purposes

# for name, param in llama_model.named_parameters():
#     print(f"{name}: requires_grad={param.requires_grad}")

## Step 14: Training the model

In [None]:
llama_sftt_trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
25,1.4478
50,1.5807
75,1.1504
100,1.4692


TrainOutput(global_step=100, training_loss=1.4120229530334472, metrics={'train_runtime': 95.0242, 'train_samples_per_second': 4.209, 'train_steps_per_second': 1.052, 'total_flos': 3352777517137920.0, 'train_loss': 1.4120229530334472, 'epoch': 0.04})

## Step 15: Chatting with the model

In [None]:
# # Version 1: Medical Knowledge

# user_prompt = "Please tell me about Bursitis"
# text_generation_pipeline = pipeline(task = "text-generation", model = llama_model, tokenizer = llama_tokenizer, max_length = 300)
# generation_result = text_generation_pipeline(f"<s>[INST] {user_prompt} [/INST]")
# print(generation_result[0]['generated_text'])

In [None]:
# Version 2: Languages

# user_prompt = "How can we solve malapportionment in Malaysia?"
user_prompt = "Is the FPTP electoral system a reason for malapportionment in Malaysia?"
text_generation_pipeline = pipeline(task = "text-generation", model = llama_model, tokenizer = llama_tokenizer, max_length = 300)
generation_result = text_generation_pipeline(f"<s>[INST] {user_prompt} [/INST]")
print(generation_result[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


<s>[INST] Is the FPTP electoral system a reason for malapportionment in Malaysia? [/INST] The FPTP electoral system is one of the reasons for malapportionment in Malaysia.

Malapportionment is a phenomenon where the electoral system distributes seats in a legislative body unevenly, resulting in some constituencies having more representatives than others. In Malaysia, the FPTP system has led to malapportionment due to the unequal distribution of voters across constituencies.

The FPTP system is based on a first-past-the-post system, where the candidate with the most votes wins the seat. However, this system can lead to distortions in the distribution of seats, particularly in cases where there are large disparities in the number of voters across constituencies. In Malaysia, the FPTP system has resulted in a situation where some constituencies have a much larger number of voters than others, leading to malapportionment.

Malapportionment can have significant consequences for democracy, a

In [None]:
torch.cuda.empty_cache()

Notes: Efficient GPU Usage:

Manage GPU usage to avoid consuming unnecessary resources and limit session time.

1 Free GPU Memory: After using the GPU, free up memory to avoid errors:

`torch.cuda.empty_cache()`

2 Stop the runtime when done: If you're done using Colab for now, stop the runtime to prevent further usage of resources. Go to Runtime > Manage Sessions > Terminate any active sessions.