Fine tuning is taking a pre trained model and training at least one model parameter. A smaller fine tuned model can outperform a larger base model.

3 ways to fine tune a model:
1. Self supervised learning where you curate the training corpus to align with the application
2. Supervised Learning : Where we manually supervise the model manually.
3. Reinforcement Learning : It consists of 3 steps : 
- Supervised Finetuning where we supervise the training dataset and then train the model 
- Train reward model where we insert a prompt and then use human interaction to rank the results.
- Reinforcement Learnign with the favourite algorithm. Where we pass a prompt into supervised finetuned model and then pass the copmpletion to the reward model and the reward model provies feedback to the supervised fine tuned model

Supervised Fine-tuning in 5 steps:
1. Choose fie-tuning task : summarization,text completion,binary classification
2. Prepare training dataset : Have input output pairs of respective tasks
3. Choose base model 
4. Fine tune model via supervised learning
5. Evaluate model performance 

3 options for parameter training:
1. Retrain all parameters : which has a downside when there are billion parameters can take up a lot of computation
2. Transfer learning : Where we freeze most os the parameters and finetune the last few layers.
3. Parameter Efficienct Fine-tuning : We freeze all the weights and augment the model with additional parameters which have been trained 

One of the ways to performe PEFT is Low Rank Adaptation(LoRA):
Fine tunes models by adding new trained parameters. This is better than the first 2 methods because the new parameters which are added are far less in number than consiering all the parameters combined.

### Fine-tuning an LLM using LLoRa

In [3]:
from datasets import load_dataset, DatasetDict, Dataset

In [2]:
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig

In [5]:
import evaluate
import torch
import numpy as np

In [6]:
#Base Model
model_checkpoint = 'distilbert-base-uncased'

#Define label maps
id2label = {0 : "Negative",1:"Positive"}
label2id = {"Negative" : 0,"Positive" : 1}

#generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(id2label),
    id2label = id2label,
    label2id = label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
dataset = load_dataset("shawhin/imdb-truncated")
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})

In [8]:
#Create Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,add_prefix_space = True)

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 24.9kB/s]


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.74MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.34MB/s]


In [11]:
#Create tokenizer function
def tokenize_function(examples):
    #extract text
    text = examples["text"]

    #Tokenize and truncate
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors = "np",
        truncation= True,
        max_length = 512
    )
    return tokenized_inputs

#Add pad if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token':['PAD']})
    model.resize_token_embeddings(len(tokenizer))

#tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map: 100%|██████████| 1000/1000 [00:01<00:00, 955.94 examples/s]
Map: 100%|██████████| 1000/1000 [00:01<00:00, 995.39 examples/s] 


DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [12]:
#Data Collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

In [14]:
#Evaluation Metric
accuracy = evaluate.load("accuracy")

#Define an evaluation function to pass in the trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions,axis=1)

    return {"accuracy":accuracy.compute(predictions=predictions,
                                        references=labels)}

In [17]:
#Define List of example
text_list = ["It was good","not a fan,don't recommend",
             "better than the first one","this is not worth watching even once",
             "This one is a pass"]

print("untrained model predictions")
print("<-------------------------------------------->")

for text in text_list:
    #tokenize inputs 
    inputs = tokenizer.encode(text,return_tensors = "pt")
    #compute logits
    logits = model(inputs).logits
    #Convert logits to labels
    predictions = torch.argmax(logits)

    print(text + " - " + id2label[predictions.tolist()])


untrained model predictions
<-------------------------------------------->


It was good - Negative
not a fan,don't recommend - Negative
better than the first one - Negative
this is not worth watching even once - Negative
This one is a pass - Negative


### Finetuning with LloRa

In [18]:
peft_config = LoraConfig(task_type = "SQL_CLS", #sequencing classification
                         r = 4, #Intrinsic value of trainable weight matrix
                         lora_alpha=32,#Learning rate
                         lora_dropout=0.01,#Probability of a dropout
                         target_modules=['q_lin']) ##We apply llora to query layer

model = get_peft_model(model,peft_config)
model.print_trainable_parameters()

trainable params: 36,864 || all params: 66,991,874 || trainable%: 0.05502756946312623


In [19]:
#Hyperparameters
lr = 1e-3 #Number of optimization steps
batch_size = 4 #Number of examples processed per optimization step
num_epochs = 10 #Number of times model runs through the parameters

In [20]:
#Defining training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

In [22]:
#Create trainer object
trainer = Trainer(
    model = model,#our peft model
    args=training_args,#hyperparameters
    train_dataset=tokenized_dataset["train"],#training dataset
    eval_dataset=tokenized_dataset["validation"],#Validation data
    tokenizer = tokenizer,
    data_collator = data_collator,#this is will dynamically pad examples
    compute_metrics=compute_metrics
)

#train model
trainer.train()

  0%|          | 4/2500 [01:36<14:53:57, 21.49s/it]

KeyboardInterrupt: 