# Lightweight Fine-Tuning Project


* PEFT technique: PEFT stands for Parameter-Efficient Fine-Tuning, is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. The technique implemented is called LoRA, (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained.


* Model: The choosen model is distilbert, a successful language model that makes use of attention mechanism to improve it's performance.


* Evaluation approach: Since this is a classification challenge I decided to monitor accuracy score, the total amount of correctly classified samples divided by the total number of samples. The Categorcal Cross Entropy Loss is also monitored, a function that captures discrepancy among real values and predictions.


* Fine-tuning dataset: I choose a sentiment analysis dataset in the financial industry called Auditor Sentiment. Data can be found [here](https://huggingface.co/datasets/FinanceInc/auditor_sentiment) has the following description: ***Auditor sentiment dataset of sentences from financial news. The dataset consists of several thousand sentences from English language financial news categorized by sentiment.***

## Loading and Evaluating a Foundation Model


In [1]:
from datasets import load_dataset

# load sentiment analysis financial data
dataset = load_dataset(
    "FinanceInc/auditor_sentiment", 
    split="train").train_test_split(
        test_size=0.2, 
        shuffle=True, 
        seed=23
)

splits = ["train", "test"]

# View the dataset characteristics
dataset["train"]

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['sentence', 'label'],
    num_rows: 3101
})

There are two different sets of data:

In [2]:
dataset.keys()

dict_keys(['train', 'test'])

Data sizes for training and testing:

In [3]:
print(f"Training set dimensions: {dataset['train'].shape[0]} rows and {dataset['train'].shape[1]} columns.")
print(f"Testing set dimensions: {dataset['test'].shape[0]} rows and {dataset['test'].shape[1]} columns.")

Training set dimensions: 3101 rows and 2 columns.
Testing set dimensions: 776 rows and 2 columns.


Data type:

In [4]:
type(dataset["train"])

datasets.arrow_dataset.Dataset

Checking some samples:

In [5]:
for i in range(10):
    print(f"Sentence: {dataset['train'][i]['sentence']}")
    print(f"Label: {dataset['train'][i]['label']}")
    print("_"*90)

Sentence: ---------------------------------------------------------------------- -------------- Munich , 14 January 2008 : BAVARIA Industriekapital AG closed the acquisition of Elcoteq Communications Technology GmbH in Offenburg , Germany , with the approval of the
Label: 1
__________________________________________________________________________________________
Sentence: However , sales volumes in the food industry are expected to remain at relatively good levels in Finland and in Scandinavia , Atria said .
Label: 2
__________________________________________________________________________________________
Sentence: The optimization of the steel components heating process will reduce the energy consumption .
Label: 2
__________________________________________________________________________________________
Sentence: Each share is entitled to one vote .
Label: 1
__________________________________________________________________________________________
Sentence: - Net sales for the peri

### Data tokenization

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Let's use a lambda function to tokenize all the examples
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["sentence"], truncation=True), batched=True
    )

# Inspect the available columns in the dataset
tokenized_dataset["train"]

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 54.1kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 4.46MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 23.0MB/s]
Map: 100%|██████████| 3101/3101 [00:00<00:00, 6326.05 examples/s]
Map: 100%|██████████| 776/776 [00:00<00:00, 6084.22 examples/s]


Dataset({
    features: ['sentence', 'label', 'input_ids', 'attention_mask'],
    num_rows: 3101
})

In [7]:
from transformers import AutoModelForSequenceClassification

# label: a label corresponding to the class as a string: 'positive' - (2), 'neutral' - (1), or 'negative' - (0)

id2label_dict = {
    2:'positive', 
    1: 'neutral', 
    0: 'negative'
} 

label2id_dict = {
    'positive':2, 
    'neutral':1, 
    'negative':0
} 


model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=3,
    id2label = id2label_dict,
    label2id = label2id_dict,
)

    
# Hint: Check the documentation at https://huggingface.co/transformers/v4.2.2/training.html
for param in model.base_model.parameters():
    param.requires_grad = False

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [9]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# set the number of epochs in the experiment
NUM_EPOCHS = 20



# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        # Set the learning rate
        learning_rate = 2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16, 
        per_device_eval_batch_size=16, 
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.864301,0.592784
2,No log,0.81469,0.595361
3,0.878400,0.770446,0.639175
4,0.878400,0.742673,0.650773
5,0.878400,0.71631,0.661082
6,0.767200,0.696606,0.677835
7,0.767200,0.68471,0.685567
8,0.710500,0.671494,0.693299
9,0.710500,0.665635,0.698454
10,0.710500,0.652384,0.706186


TrainOutput(global_step=3880, training_loss=0.7056359753166277, metrics={'train_runtime': 174.8357, 'train_samples_per_second': 354.733, 'train_steps_per_second': 22.192, 'total_flos': 952228605710790.0, 'train_loss': 0.7056359753166277, 'epoch': 20.0})

In [13]:
trainer.evaluate()

{'eval_loss': 0.6169623732566833,
 'eval_accuracy': 0.7152061855670103,
 'eval_runtime': 1.4505,
 'eval_samples_per_second': 534.97,
 'eval_steps_per_second': 33.78,
 'epoch': 20.0}

## Performing Parameter-Efficient Fine-Tuning


In [14]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    target_modules=["q_lin", "k_lin", "v_lin"], 
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

            
lora_model = get_peft_model(model, config)

In [15]:
lora_model.print_trainable_parameters()

trainable params: 442,368 || all params: 67,398,147 || trainable%: 0.6563503889802786


In [16]:
lora_trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        # Set the learning rate
        learning_rate = 2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16, 
        per_device_eval_batch_size=16, 
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

lora_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.555916,0.755155
2,No log,0.49789,0.795103
3,0.563200,0.461546,0.819588
4,0.563200,0.433667,0.829897
5,0.563200,0.418582,0.832474
6,0.477600,0.403687,0.837629
7,0.477600,0.400576,0.837629
8,0.424500,0.395267,0.83634
9,0.424500,0.385414,0.846649
10,0.424500,0.376046,0.847938


TrainOutput(global_step=3880, training_loss=0.41597716734581386, metrics={'train_runtime': 285.2961, 'train_samples_per_second': 217.388, 'train_steps_per_second': 13.6, 'total_flos': 961997139562950.0, 'train_loss': 0.41597716734581386, 'epoch': 20.0})

In [17]:
lora_trainer.evaluate()

{'eval_loss': 0.35795003175735474,
 'eval_accuracy': 0.8530927835051546,
 'eval_runtime': 1.5489,
 'eval_samples_per_second': 500.995,
 'eval_steps_per_second': 31.635,
 'epoch': 20.0}