# Sentiment Analysis using DistilBERT (Indonesia)
This notebook focuses on finetuning DistilBERT Indonesia on a specific dataset to do sentiment analysis task. This notebook runs on Google Colab using T4 GPU.

## Install dependencies and import libraries

In [1]:
# !pip install transformers datasets

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
import random
import numpy as np
import pandas as pd
import torch
from torch import optim
import torch.nn.functional as F
from tqdm import tqdm

from transformers import DistilBertForSequenceClassification, DistilBertConfig, DistilBertTokenizer

from utils.metrics import document_sentiment_metrics_fn
from utils.data_utils import DocumentSentimentDataset, DocumentSentimentDataLoader

In [4]:
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def metrics_to_string(metric_dict):
    string_list = []
    for key, value in metric_dict.items():
        string_list.append('{}:{:.2f}'.format(key, value))
    return ' '.join(string_list)

In [5]:
# Set random seed
# So that the finetuning process can be remade
set_seed(2023)

In [6]:
# Define device
import torch
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")

## Import model and tokenizer

In [8]:
# Load Tokenizer and Config
tokenizer = DistilBertTokenizer.from_pretrained('cahya/distilbert-base-indonesian')
config = DistilBertConfig.from_pretrained('cahya/distilbert-base-indonesian')

# Instantiate model
model = DistilBertForSequenceClassification.from_pretrained(
    'cahya/distilbert-base-indonesian',
    num_labels = 3,
    output_attentions = False,
    output_hidden_states = False
).to(device)

Some weights of the model checkpoint at cahya/distilbert-base-indonesian were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at cahya/distilbert-base-indonesian and are newly initialized: ['pre_classifier.bias', 'classifier.we

## Import dataset

In [9]:
train_dataset_path = "/content/train_preprocess.tsv"
valid_dataset_path = "/content/valid_preprocess.tsv"
test_dataset_path = "/content/test_preprocess_masked_label.tsv"

In [10]:
train_dataset = DocumentSentimentDataset(train_dataset_path, tokenizer, lowercase=True)
valid_dataset = DocumentSentimentDataset(valid_dataset_path, tokenizer, lowercase=True)
test_dataset = DocumentSentimentDataset(test_dataset_path, tokenizer, lowercase=True)

train_loader = DocumentSentimentDataLoader(dataset=train_dataset, max_seq_len=512, batch_size=8, num_workers=2, shuffle=True)
valid_loader = DocumentSentimentDataLoader(dataset=valid_dataset, max_seq_len=512, batch_size=8, num_workers=2, shuffle=False)
test_loader = DocumentSentimentDataLoader(dataset=test_dataset, max_seq_len=512, batch_size=8, num_workers=2, shuffle=False)

In [11]:
w2i, i2w = DocumentSentimentDataset.LABEL2INDEX, DocumentSentimentDataset.INDEX2LABEL
print(w2i)
print(i2w)

{'positive': 0, 'neutral': 1, 'negative': 2}
{0: 'positive', 1: 'neutral', 2: 'negative'}


## Inference before finetuning

In [12]:
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita | Label : positive (38.870%)


It can be seen that the result is correct (positive) but the probability is low, so the model should be finetuned to give better performance.

## Training
We finetune the model with learning rate 0.00003 in 10 epochs.

In [16]:
# DistilBERT doesn't receive token_type_ids as its inputs
# So, the function need a little modification

def forward_sequence_classification(model, batch_data, i2w, is_test=False, device='cpu', **kwargs):
    # Unpack batch data
    if len(batch_data) == 3:
        (subword_batch, mask_batch, label_batch) = batch_data

    # Prepare input & label
    subword_batch = torch.LongTensor(subword_batch)
    mask_batch = torch.FloatTensor(mask_batch)
    label_batch = torch.LongTensor(label_batch)

    if device == "cuda":
        subword_batch = subword_batch.cuda()
        mask_batch = mask_batch.cuda()
        label_batch = label_batch.cuda()

    # Forward model
    outputs = model(subword_batch, attention_mask=mask_batch, labels=label_batch)
    loss, logits = outputs[:2]

    # generate prediction & label list
    list_hyp = []
    list_label = []
    hyp = torch.topk(logits, 1)[1]
    for j in range(len(hyp)):
        list_hyp.append(i2w[hyp[j].item()])
        list_label.append(i2w[label_batch[j][0].item()])

    return loss, list_hyp, list_label

In [17]:
optimizer = optim.Adam(model.parameters(), lr=3e-6)
model = model.cuda()

In [18]:
import time

# Train
n_epochs = 10
start = time.time()
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)

    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        # Forward model
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss = total_train_loss + tr_loss

        # Calculate metrics
        list_hyp += batch_hyp
        list_label += batch_label

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1),
            total_train_loss/(i+1), get_lr(optimizer)))

    # Calculate train metric
    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
        total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)))

    # Evaluate on validation
    model.eval()
    torch.set_grad_enabled(False)

    total_loss, total_correct, total_labels = 0, 0, 0
    list_hyp, list_label = [], []

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        batch_seq = batch_data[-1]
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Calculate total loss
        valid_loss = loss.item()
        total_loss = total_loss + valid_loss

        # Calculate evaluation metrics
        list_hyp += batch_hyp
        list_label += batch_label
        metrics = document_sentiment_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))

    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
        total_loss/(i+1), metrics_to_string(metrics)))

stop = time.time()
print(f"\n\nTraining time: {stop - start}s")

(Epoch 1) TRAIN LOSS:0.3655 LR:0.00000300: 100%|██████████| 1375/1375 [01:48<00:00, 12.71it/s]


(Epoch 1) TRAIN LOSS:0.3655 ACC:0.86 F1:0.81 REC:0.79 PRE:0.85 LR:0.00000300


VALID LOSS:0.2135 ACC:0.92 F1:0.89 REC:0.88 PRE:0.89: 100%|██████████| 158/158 [00:06<00:00, 25.54it/s]


(Epoch 1) VALID LOSS:0.2135 ACC:0.92 F1:0.89 REC:0.88 PRE:0.89


(Epoch 2) TRAIN LOSS:0.1889 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.29it/s]


(Epoch 2) TRAIN LOSS:0.1889 ACC:0.94 F1:0.92 REC:0.91 PRE:0.92 LR:0.00000300


VALID LOSS:0.2020 ACC:0.93 F1:0.90 REC:0.90 PRE:0.90: 100%|██████████| 158/158 [00:06<00:00, 22.91it/s]


(Epoch 2) VALID LOSS:0.2020 ACC:0.93 F1:0.90 REC:0.90 PRE:0.90


(Epoch 3) TRAIN LOSS:0.1429 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.30it/s]


(Epoch 3) TRAIN LOSS:0.1429 ACC:0.95 F1:0.94 REC:0.93 PRE:0.95 LR:0.00000300


VALID LOSS:0.2032 ACC:0.93 F1:0.90 REC:0.89 PRE:0.91: 100%|██████████| 158/158 [00:06<00:00, 22.94it/s]


(Epoch 3) VALID LOSS:0.2032 ACC:0.93 F1:0.90 REC:0.89 PRE:0.91


(Epoch 4) TRAIN LOSS:0.1112 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.33it/s]


(Epoch 4) TRAIN LOSS:0.1112 ACC:0.97 F1:0.96 REC:0.95 PRE:0.96 LR:0.00000300


VALID LOSS:0.2134 ACC:0.93 F1:0.90 REC:0.89 PRE:0.91: 100%|██████████| 158/158 [00:07<00:00, 22.21it/s]


(Epoch 4) VALID LOSS:0.2134 ACC:0.93 F1:0.90 REC:0.89 PRE:0.91


(Epoch 5) TRAIN LOSS:0.0840 LR:0.00000300: 100%|██████████| 1375/1375 [01:42<00:00, 13.39it/s]


(Epoch 5) TRAIN LOSS:0.0840 ACC:0.97 F1:0.97 REC:0.96 PRE:0.97 LR:0.00000300


VALID LOSS:0.2314 ACC:0.93 F1:0.90 REC:0.90 PRE:0.90: 100%|██████████| 158/158 [00:07<00:00, 21.29it/s]


(Epoch 5) VALID LOSS:0.2314 ACC:0.93 F1:0.90 REC:0.90 PRE:0.90


(Epoch 6) TRAIN LOSS:0.0639 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.34it/s]


(Epoch 6) TRAIN LOSS:0.0639 ACC:0.98 F1:0.98 REC:0.97 PRE:0.98 LR:0.00000300


VALID LOSS:0.2512 ACC:0.93 F1:0.90 REC:0.90 PRE:0.90: 100%|██████████| 158/158 [00:07<00:00, 21.92it/s]


(Epoch 6) VALID LOSS:0.2512 ACC:0.93 F1:0.90 REC:0.90 PRE:0.90


(Epoch 7) TRAIN LOSS:0.0422 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.32it/s]


(Epoch 7) TRAIN LOSS:0.0422 ACC:0.99 F1:0.98 REC:0.98 PRE:0.99 LR:0.00000300


VALID LOSS:0.2642 ACC:0.93 F1:0.91 REC:0.90 PRE:0.92: 100%|██████████| 158/158 [00:07<00:00, 21.76it/s]


(Epoch 7) VALID LOSS:0.2642 ACC:0.93 F1:0.91 REC:0.90 PRE:0.92


(Epoch 8) TRAIN LOSS:0.0333 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.34it/s]


(Epoch 8) TRAIN LOSS:0.0333 ACC:0.99 F1:0.99 REC:0.99 PRE:0.99 LR:0.00000300


VALID LOSS:0.2841 ACC:0.93 F1:0.90 REC:0.89 PRE:0.91: 100%|██████████| 158/158 [00:07<00:00, 22.30it/s]


(Epoch 8) VALID LOSS:0.2841 ACC:0.93 F1:0.90 REC:0.89 PRE:0.91


(Epoch 9) TRAIN LOSS:0.0229 LR:0.00000300: 100%|██████████| 1375/1375 [01:43<00:00, 13.32it/s]


(Epoch 9) TRAIN LOSS:0.0229 ACC:0.99 F1:0.99 REC:0.99 PRE:0.99 LR:0.00000300


VALID LOSS:0.3439 ACC:0.92 F1:0.89 REC:0.89 PRE:0.88: 100%|██████████| 158/158 [00:06<00:00, 23.37it/s]


(Epoch 9) VALID LOSS:0.3439 ACC:0.92 F1:0.89 REC:0.89 PRE:0.88


(Epoch 10) TRAIN LOSS:0.0209 LR:0.00000300: 100%|██████████| 1375/1375 [01:44<00:00, 13.16it/s]


(Epoch 10) TRAIN LOSS:0.0209 ACC:0.99 F1:0.99 REC:0.99 PRE:0.99 LR:0.00000300


VALID LOSS:0.3380 ACC:0.92 F1:0.88 REC:0.89 PRE:0.88: 100%|██████████| 158/158 [00:06<00:00, 22.63it/s]


(Epoch 10) VALID LOSS:0.3380 ACC:0.92 F1:0.88 REC:0.89 PRE:0.88


Training time: 1110.1545906066895s


In [19]:
model.eval()
torch.set_grad_enabled(False)

total_loss, total_correct, total_labels = 0, 0, 0
list_hyp, list_label = [], []

pbar = tqdm(test_loader, leave=True, total=len(test_loader))
for i, batch_data in enumerate(pbar):
    _, batch_hyp, _ = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')
    list_hyp += batch_hyp

# Save prediction
df = pd.DataFrame({'label':list_hyp}).reset_index()
df.to_csv('pred.txt', index=False)

df.head()

100%|██████████| 63/63 [00:01<00:00, 59.68it/s]


Unnamed: 0,index,label
0,0,negative
1,1,negative
2,2,negative
3,3,negative
4,4,negative


## Inference after finetuning

In [20]:
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita | Label : positive (97.018%)


After being finetuned, the model is able to classify the sentiment correctly with a high probabilty.

## Save the model
Save the model and try to load it one more time to ensure that the model works fine.

In [21]:
model.save_pretrained("/content/drive/MyDrive/Models/distilbert")
tokenizer.save_pretrained("/content/drive/MyDrive/Models/distilbert")

('/content/drive/MyDrive/Models/distilbert/tokenizer_config.json',
 '/content/drive/MyDrive/Models/distilbert/special_tokens_map.json',
 '/content/drive/MyDrive/Models/distilbert/vocab.txt',
 '/content/drive/MyDrive/Models/distilbert/added_tokens.json')

In [22]:
saved_model = DistilBertForSequenceClassification.from_pretrained("/content/drive/MyDrive/Models/distilbert")
saved_tokenizer = DistilBertTokenizer.from_pretrained("/content/drive/MyDrive/Models/distilbert")

text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
subwords = saved_tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(saved_model.device)

logits = saved_model(subwords)[0]
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita | Label : positive (97.019%)


## Performance

In [31]:
test_dataset_path = "/content/test_preprocess.tsv"
df_test = pd.read_table(test_dataset_path, header=None)
df_test.rename(columns={0: "text", 1: "label"}, inplace=True)
df_test.head()

Unnamed: 0,text,label
0,kemarin gue datang ke tempat makan baru yang a...,negative
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative


In [32]:
# Load Tokenizer and Config
tokenizer = DistilBertTokenizer.from_pretrained('cahya/distilbert-base-indonesian')
config = DistilBertConfig.from_pretrained('cahya/distilbert-base-indonesian')

# Instantiate model
model = DistilBertForSequenceClassification.from_pretrained(
    'cahya/distilbert-base-indonesian',
    num_labels = 3,
    output_attentions = False,
    output_hidden_states = False
).to(device)

Some weights of the model checkpoint at cahya/distilbert-base-indonesian were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at cahya/distilbert-base-indonesian and are newly initialized: ['pre_classifier.bias', 'classifier.we

In [33]:
def infer(text):
  print(text)
  inputs = tokenizer.encode(text)
  inputs = torch.LongTensor(inputs).view(1, -1).to(model.device)

  logits = model(inputs)[0]
  label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()
  return i2w[label]

In [34]:
df_test['ori_pred'] = df_test['text'].apply(infer)
df_test.head()

kemarin gue datang ke tempat makan baru yang ada di dago atas . gue kira makanan nya enak karena harga nya mahal . ternyata , boro-boro . tidak mau lagi deh ke tempat itu . sudah mana tempat nya juga tidak nyaman banget , terlalu sempit .
kayak nya sih gue tidak akan mau balik lagi ke tempat itu . gila , ya , gue enggak ngerti kenapa tempat nya dibiarkan panas . sudah begitu kotor pula . kalau panas kepanasan , kalau hujan kehujanan . harus nya sih tidak ada restoran yang kayak gitu . tidak tahu deh apa yang mereka jual .
kalau dipikir-pikir , sebenarnya tidak ada yang bisa dibanggakan dari jokowi . pertama , dia tidak bisa nepatin janji . kedua , kerjaan nya selalu pencitraan . ketiga , dia tidak pro rakyat . sudahlah . ku sudah terlanjur kecewa .
ini pertama kalinya gua ke bank buat ngurusin pembuatan rekening baru . nama nya juga orang pertama kali ya baru ke bank , gua kena semprot . kelihatan banget pelayanan pelanggan - nya tidak suka gua banyak bertanya . amit-amit . padahal itu

Unnamed: 0,text,label,ori_pred
0,kemarin gue datang ke tempat makan baru yang a...,negative,neutral
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative,neutral
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative,neutral
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative,neutral
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative,neutral


In [35]:
ft_model = DistilBertForSequenceClassification.from_pretrained("/content/drive/MyDrive/Models/distilbert/")
ft_tokenizer = DistilBertTokenizer.from_pretrained("/content/drive/MyDrive/Models/distilbert/")

In [36]:
def infer_ft(text):
  print(text)
  inputs = ft_tokenizer.encode(text)
  inputs = torch.LongTensor(inputs).view(1, -1).to(ft_model.device)

  logits = ft_model(inputs)[0]
  label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()
  return i2w[label]

In [37]:
df_test['ft_pred'] = df_test['text'].apply(infer_ft)
df_test.head()

kemarin gue datang ke tempat makan baru yang ada di dago atas . gue kira makanan nya enak karena harga nya mahal . ternyata , boro-boro . tidak mau lagi deh ke tempat itu . sudah mana tempat nya juga tidak nyaman banget , terlalu sempit .
kayak nya sih gue tidak akan mau balik lagi ke tempat itu . gila , ya , gue enggak ngerti kenapa tempat nya dibiarkan panas . sudah begitu kotor pula . kalau panas kepanasan , kalau hujan kehujanan . harus nya sih tidak ada restoran yang kayak gitu . tidak tahu deh apa yang mereka jual .
kalau dipikir-pikir , sebenarnya tidak ada yang bisa dibanggakan dari jokowi . pertama , dia tidak bisa nepatin janji . kedua , kerjaan nya selalu pencitraan . ketiga , dia tidak pro rakyat . sudahlah . ku sudah terlanjur kecewa .
ini pertama kalinya gua ke bank buat ngurusin pembuatan rekening baru . nama nya juga orang pertama kali ya baru ke bank , gua kena semprot . kelihatan banget pelayanan pelanggan - nya tidak suka gua banyak bertanya . amit-amit . padahal itu

Unnamed: 0,text,label,ori_pred,ft_pred
0,kemarin gue datang ke tempat makan baru yang a...,negative,neutral,negative
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative,neutral,negative
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative,neutral,negative
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative,neutral,negative
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative,neutral,negative


In [40]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

d = {
    "Accuracy": [accuracy_score(df_test['label'], df_test['ori_pred']),
                 accuracy_score(df_test['label'], df_test['ft_pred']),],
    "Precision":[precision_score(df_test['label'], df_test['ori_pred'], average="macro"),
                 precision_score(df_test['label'], df_test['ft_pred'], average="macro")],
    "Recall":   [recall_score(df_test['label'], df_test['ori_pred'], average="macro"),
                 recall_score(df_test['label'], df_test['ft_pred'], average="macro")],
    "F1":       [f1_score(df_test['label'], df_test['ori_pred'], average="macro"),
                 f1_score(df_test['label'], df_test['ft_pred'], average="macro")]
}

df_comp = pd.DataFrame.from_dict(d)
df_comp = df_comp.rename(index={0: 'Original BERT', 1: 'Finetuned BERT'})
df_comp.to_csv('distilbert_performance.csv', index=False)
df_comp

Unnamed: 0,Accuracy,Precision,Recall,F1
Original BERT,0.192,0.434157,0.330982,0.141172
Finetuned BERT,0.89,0.886538,0.854852,0.866431


The results above shows that finetuning process is important and necessary before applying the model to a specific task. In this case, finetuning show that it could improve the model's performance significantly (± 0.7 for accuracy and f1-score).