<a href="https://colab.research.google.com/github/damayantinaik/Fine-tune-model/blob/main/Fine_Tuning_BERT_LLM_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Full Fine-tuning BERT LLM Model

In this Project I'll create a **BERT-based text classifier** (in particular **DistilBERT**) using the [Hugging Face Transformers](https://huggingface.co/transformers/) library. I'll perform **full fine-tuning** on the **SMS Spam dataset** from the [datasets](https://huggingface.co/docs/datasets/) package and evaluate  model’s performance.

The **SMS Spam dataset (sms_spam)** contains text messages that are **labeled as either 'ham' (not spam) or 'spam'**, making it a **binary classification problem**. My goal is to fine-tune a DistilBERT model to accurately classify these messages.

# Load sms_spam Dataset
The dataset is avaialble at:
https://huggingface.co/datasets/sms_spam

In [None]:
from datasets import load_dataset
data = load_dataset("sms_spam")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5574
    })
})

In [None]:
# The sms_spam dataset only has a train split, so I'll use the train_test_split method to split it into train and test
dataset = data['train'].train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

splits = ['train', 'test']

In [None]:
# View the dataset characteristics
dataset['train']

Dataset({
    features: ['sms', 'label'],
    num_rows: 4459
})

In [None]:
dataset['test']

Dataset({
    features: ['sms', 'label'],
    num_rows: 1115
})

Let's look at the first example!

In [None]:
# Inspect the first example. Do you think this is spam or not?
dataset["train"][0:5]

{'sms': 'Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n',
 'label': 1}

In [None]:
import pandas as pd
df = pd.DataFrame(dataset['train'][:])
df.head()

Unnamed: 0,sms,label
0,Had your mobile 10 mths? Update to the latest ...,1
1,"Like &lt;#&gt; , same question\n",0
2,Should I have picked up a receipt or something...,0
3,Lovely smell on this bus and it ain't tobacco....,0
4,Ok...\n,0


In [None]:
pd.set_option("display.max_colwidth", None)
df.head()

Unnamed: 0,sms,label
0,"Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n",1
1,"Like &lt;#&gt; , same question\n",0
2,Should I have picked up a receipt or something earlier\n,0
3,Lovely smell on this bus and it ain't tobacco... \n,0
4,Ok...\n,0


## Pre-process datasets

Now I am going to process the datasets by converting all the text into tokens to use in the models.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Let's use a lambda function to tokenize all the examples
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["sms"], truncation=True), batched=True
    )

# Inspect the available columns in the dataset
tokenized_dataset["train"]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/4459 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 4459
})

In [None]:
tokenized_dataset["train"]['input_ids']

Column([[101, 2018, 2115, 4684, 2184, 11047, 7898, 1029, 10651, 2000, 1996, 6745, 4950, 1013, 2678, 11640, 2005, 2489, 1012, 2562, 24471, 2168, 2193, 1010, 2131, 4469, 2489, 8117, 2015, 1013, 6981, 1012, 3793, 2748, 2005, 1037, 2655, 102], [101, 2066, 1004, 8318, 1025, 1001, 1004, 14181, 1025, 1010, 2168, 3160, 102], [101, 2323, 1045, 2031, 3856, 2039, 1037, 24306, 2030, 2242, 3041, 102], [101, 8403, 5437, 2006, 2023, 3902, 1998, 2009, 7110, 1005, 1056, 9098, 1012, 1012, 1012, 102], [101, 7929, 1012, 1012, 1012, 102]])

In [None]:
tokenized_dataset["train"]['attention_mask']

Column([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]])

## Load and set up the model

In this case, I am doing a full fine tuning, so I'll unfreeze all the parameters.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "not spam", 1: "spam"},
    label2id={"not spam": 0, "spam": 1},
)

# Unfreezing all the model parameters.
for param in model.parameters():
    param.requires_grad = True

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


## Training the model

Now it's time to train the model. For this, I'll use the `Trainer` class.

First I'll define a function to compute our accuracy metric then I'll make the `Trainer`and fill in some of the training arguments.


In [None]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# The HuggingFace Trainer class handles the training and eval loop for PyTorch.
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/spam_not_spam",
        # Set the learning rate
        learning_rate=2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        # Evaluate and save the model after each epoch
        eval_strategy="epoch",
        save_strategy="epoch",
        # Set the learning rate
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdamayanti-naik222[0m ([33mdamayanti-naik222-none[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.050017,0.987444
2,0.054700,0.055146,0.986547


TrainOutput(global_step=558, training_loss=0.052490652889333744, metrics={'train_runtime': 3738.5866, 'train_samples_per_second': 2.385, 'train_steps_per_second': 0.149, 'total_flos': 144666559425588.0, 'train_loss': 0.052490652889333744, 'epoch': 2.0})

## Evaluate model

Evaluating the model is carried out by calling the 'evaluate' method on the trainer object. This will run the model on the test set and compute the metrics I specified in the compute_metrics function.

In [None]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 0.05001736804842949,
 'eval_accuracy': 0.9874439461883409,
 'eval_runtime': 140.576,
 'eval_samples_per_second': 7.932,
 'eval_steps_per_second': 0.498,
 'epoch': 2.0}

### View results

Let's look at a few examples.

In [None]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

items_for_manual_review = tokenized_dataset["test"].select(
    [0, 1, 22, 31, 43, 292, 448, 487]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "sms": [item["sms"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }

)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,sms,predictions,labels
0,Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke \n,0,0
1,Happy new years melody!\n,0,0
2,PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0\n,1,1
3,URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n,1,1
4,I had askd u a question some hours before. Its answer\n,0,0
5,"SMS. ac JSco: Energy is high, but u may not know where 2channel it. 2day ur leadership skills r strong. Psychic? Reply ANS w/question. End? Reply END JSCO\n",0,1
6,"Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only got bus8,22,65,61,66,382. Ubi cres,ubi tech park.6ph for 1st 5wkg days.èn\n",1,0
7,Burger King - Wanna play footy at a top stadium? Get 2 Burger King before 1st Sept and go Large or Super with Coca-Cola and walk out a winner\n,0,1


### Conclusion

The model was fine-tuned and its performance increased upto 98%. We tested it with few randomly selected messages and could see in most of the cases it predicted correctly though in few cases it missed. The model can be fine-tuned further to obtain better accuracy.