<a href="https://colab.research.google.com/github/arafatDU/simple-llm/blob/main/fine_tune_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What do you mean by Fine-Tuning LLMs

Fine-tuning LLM involves the additional training of a pre-existing model, which has previously acquired patterns and features from an extensive dataset, using a smaller, domain-specific dataset.

*Below are some of the key steps involved in LLM Fine-tuning:*

1. Select a pre-trained model

2. Gather relevant Dataset

3. Preprocess Dataset

4. Fine-tuning

5. Task-specific adaptation

### Fine-tuning methods
- Full Fine Tuning (Instruction fine-tuning)
- Parameter Efficient Fine-Tuning (PEFT) -with LoRA & QLoRA

LoRA is an improved finetuning method where instead of finetuning all the weights that constitute the weight matrix of the pre-trained large language model, two smaller matrices that approximate this larger matrix are fine-tuned. These matrices constitute the LoRA adapter. This fine-tuned adapter is then loaded into the pre-trained model and used for inference.

In [1]:
!pip install --upgrade datasets transformers huggingface_hub



In [2]:
%pip install -U datasets




# Load the Dataset

In [3]:
from datasets import load_dataset
from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")
dataset = load_dataset("imdb", split="train[:1%]", token=HF_TOKEN)
print(dataset[0])

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# Dataset preprocessing

In [4]:
def preprocessing(batch):
  batch['text']=[text.replace('\n','') for text in batch['text']]
  return batch

dataset = dataset.map(preprocessing, batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

# Initialize the model and Tokenizer

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [6]:
def tokenizing_function(examples):
  tokenized = tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)
  tokenized['labels'] = tokenized['input_ids'].copy()
  return tokenized

tokenized_data = dataset.map(tokenizing_function, batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

# Fine Tuning the Model

In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1
)

# Spliting datatest - Train & Test

In [8]:
train_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data))))
eval_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data)), len(tokenized_data)))

# Setup Trainer

In [9]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

# Run the Fine Tuned Model

In [10]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33marafat[0m ([33marafatdu[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,4.0896,3.706478


TrainOutput(global_step=50, training_loss=4.048168258666992, metrics={'train_runtime': 48.9467, 'train_samples_per_second': 4.086, 'train_steps_per_second': 1.022, 'total_flos': 6532418764800.0, 'train_loss': 4.048168258666992, 'epoch': 1.0})

In [11]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.json',
 './fine_tuned_model/merges.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

# Testing the fine-tuned model

In [14]:
prompt = "The actor "

inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
output = model.generate(inputs['input_ids'], max_length=15)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The actor iced the film with a lot of fun, but it was
