<a href="https://colab.research.google.com/github/billycemerson/ai-engineering-project/blob/main/01_core_llm/01_fine_tunning_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will documentary a simple usage Huggingface model to fine tunning in spesific task.

#### Library & Package

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Data

A simple dataset for Indonesian sentiment analysis.

Link: https://huggingface.co/datasets/sepidmnorozy/Indonesian_sentiment

In [4]:
splits = {'train': 'train.csv', 'validation': 'dev.csv', 'test': 'test.csv'}
train_data = pd.read_csv("hf://datasets/sepidmnorozy/Indonesian_sentiment/" + splits["train"])
validation_data = pd.read_csv("hf://datasets/sepidmnorozy/Indonesian_sentiment/" + splits["validation"])
test_data = pd.read_csv("hf://datasets/sepidmnorozy/Indonesian_sentiment/" + splits["test"])

#### EDA

##### Data Summary

In [5]:
train_data.head()

Unnamed: 0,label,text
0,1,bubur ayam yang lumayan rekomendasi di sekitar...
1,1,menu bebek relatif jarang di bandung dan bebek...
2,1,hampir lebih 5 kali saya ke sini . dim sum nya...
3,1,tempat nya dekat dengan factory outlet jadi ha...
4,1,saya tidak sengaja menemuka warung bakso dan b...


In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7926 entries, 0 to 7925
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   7926 non-null   int64 
 1   text    7926 non-null   object
dtypes: int64(1), object(1)
memory usage: 124.0+ KB


In [7]:
train_data.shape

(7926, 2)

There are about 7k rows data in train.

##### Label Information

In [8]:
train_data['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,5129
0,2797


The positive content is about 2x more than negative (5k/2k)

We need to balancing the label distribution for better model training.

The simple thing to do is undersampling or cut the most label to the same of label with less count.

In [9]:
# Undersampling the label to all in 2.500
train_data_undersampled = train_data.groupby('label').apply(lambda x: x.sample(2500)).reset_index(drop=True)

  train_data_undersampled = train_data.groupby('label').apply(lambda x: x.sample(2500)).reset_index(drop=True)


In [10]:
train_data_undersampled['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,2500
1,2500


Yeah the data now are balance in target class

We can continue the analysis

##### Text Overview

In [11]:
# Maximum text length
train_data_undersampled['text'].str.len().max()

567

In [12]:
# Text with max length
long_text = max(train_data_undersampled['text'], key=len)
long_text

'terlalu pintar berkilah , merangkai kata , dan mencari alasan untuk membela diri . jika sudah ketahuan , si penipu biasanya sangat pintar berkilah , merangkai kata , mencari alasan demi alasan untuk membela diri , meyakinkan korban nya bahwa dia jujur , dia tidak bersalah , dan seterusnya . bagi yang mudah tertipu , dia mungkin masih percaya pada alasan-alasan berkilah tersebut . namun bagi orang yang bersikap kritis , mereka pasti melihat banyak hal aneh , bahkan kejanggalan dan saling berkontradiksi pada ucapan-ucapan dan alasan-alasan si pembohong tersebut .'

In [13]:
# Minimum text length
train_data_undersampled['text'].str.len().min()

3

In [14]:
# Text with min length
short_text = min(train_data_undersampled['text'], key=len)
short_text

'sok'

In [15]:
# Average text length
train_data_undersampled['text'].str.len().mean()

np.float64(185.4234)

In overall, all the text are about 190 length character

#### Data Preprocessing

The preprocessing step are crucial in NLP task. I

In general NLP preprocessing task include
- Tokenization
- Cleaning (Lowercasing and Noise Removal)
- Normalization
- Stemming/Lemmatization

This step really affected the model result.

But in the BERT model case, research are shown that the preprocessing step dont really affect the model performance.

So in this case, we just do some basic preprocesing (cleaning)

In [16]:
train_data_undersampled.head()

Unnamed: 0,label,text
0,0,demo partisan . menyesal gabung grup ini . bek...
1,0,saya mengetahui restoran ini dari ulasan-ulasa...
2,0,"sebarkan , zaadit tidak ikut ke asmat . janji ..."
3,0,"pasti kartu paling jelek indosat , bagaimana s..."
4,0,saya kemarin batalkan tiket pesawat karena sal...


In [17]:
import re

In [18]:
# Lowercase function
def lowercase_text(text):
    return text.lower()

# Simple cleaning (remove punctation)
def cleaning_text(text):
    return re.sub(r'[^\w\s]', '', text)

In [19]:
# Apply the preprocessing function
train_data_undersampled['text'] = train_data_undersampled['text'].apply(lowercase_text)
train_data_undersampled['text'] = train_data_undersampled['text'].apply(cleaning_text)

In [20]:
train_data_undersampled['text'].head()

Unnamed: 0,text
0,demo partisan menyesal gabung grup ini bek b...
1,saya mengetahui restoran ini dari ulasanulasan...
2,sebarkan zaadit tidak ikut ke asmat janji pa...
3,pasti kartu paling jelek indosat bagaimana si...
4,saya kemarin batalkan tiket pesawat karena sal...


In [21]:
train_data_undersampled['text'].iloc[0]

'demo partisan  menyesal gabung grup ini  bek bandung raya gerakan nya partisan banget  tidak ilmiah balas '

The text is more clean than before

#### Modelling

We will use IndoBERT as parent model for fine-tunning task

In [22]:
import torch
import random

In [23]:
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())

if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))

CUDA available: True
Device count: 1
GPU name: Tesla T4


##### Setup Config

In [24]:
# common functions
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def metrics_to_string(metric_dict):
    string_list = []
    for key, value in metric_dict.items():
        string_list.append('{}:{:.2f}'.format(key, value))
    return ' '.join(string_list)

# Set random seed
set_seed(27)

##### Load Model

In [25]:
from transformers import BertTokenizer, BertConfig, BertForSequenceClassification

In [26]:
# Load Tokenizer and Config
tokenizer = BertTokenizer.from_pretrained('indobenchmark/indobert-base-p1')
config = BertConfig.from_pretrained('indobenchmark/indobert-base-p1')
config.num_labels = 2

# Instantiate model
model = BertForSequenceClassification.from_pretrained('indobenchmark/indobert-base-p1', config=config)

tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at indobenchmark/indobert-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
# Model structure
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [28]:
# Model parameter
print(count_param(model))

124442882


##### Data Loader

In [29]:
# Dataset path
train_dataset_path = train_data_undersampled
validation_dataset_path = validation_data
test_dataset_path = test_data

In [30]:
# Tokenize function
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128 #Normally use 256
    )

In [31]:
from datasets import Dataset

In [32]:
train_dataset = Dataset.from_pandas(train_data_undersampled)
val_dataset = Dataset.from_pandas(validation_data)
test_dataset = Dataset.from_pandas(test_data)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove unused columns
train_dataset = train_dataset.remove_columns(["text"])
val_dataset = val_dataset.remove_columns(["text"])
test_dataset = test_dataset.remove_columns(["text"])

train_dataset.set_format("torch")
val_dataset.set_format("torch")
test_dataset.set_format("torch")

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1132 [00:00<?, ? examples/s]

Map:   0%|          | 0/2266 [00:00<?, ? examples/s]

##### Evaluation Metrics

In [33]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

##### Training Argumments

In [34]:
from transformers import TrainingArguments, Trainer

In [35]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="f1"
)

##### Train & Val Loop

In [36]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


##### Training

In [37]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.151141,0.951413,0.960854
2,No log,0.117101,0.958481,0.967338
3,No log,0.127174,0.952297,0.962396


TrainOutput(global_step=237, training_loss=0.10428901060724058, metrics={'train_runtime': 629.2049, 'train_samples_per_second': 23.84, 'train_steps_per_second': 0.377, 'total_flos': 986666457600000.0, 'train_loss': 0.10428901060724058, 'epoch': 3.0})

##### Validation

In [38]:
trainer.evaluate()

{'eval_loss': 0.11710146069526672,
 'eval_accuracy': 0.9584805653710248,
 'eval_f1': 0.9673384294649062,
 'eval_runtime': 8.2373,
 'eval_samples_per_second': 137.423,
 'eval_steps_per_second': 2.185,
 'epoch': 3.0}

##### Testing

In [39]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.13814957439899445,
 'eval_accuracy': 0.9527802294792586,
 'eval_f1': 0.9635434412265758,
 'eval_runtime': 17.3511,
 'eval_samples_per_second': 130.597,
 'eval_steps_per_second': 2.075,
 'epoch': 3.0}

As we can se, the model is good with no underfitting/overfitting

- **'eval_f1': 0.9673** in validation data
- **'eval_f1': 0.9635** in testing data

#### Inference

We can inference the model for some input data

In [41]:
def predict_sentiment(text, model, tokenizer):
    device = model.device
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True
    ).to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    return torch.argmax(outputs.logits, dim=1).item()

In [44]:
# Apply the predict sentiment function using input text
predict_sentiment("Keren pelayanananya", model, tokenizer)

1

In [45]:
predict_sentiment("Nunggu paket telat 1 minggu", model, tokenizer)

0

In [46]:
predict_sentiment("Keren pelayanananya, saya jadi nunggu paket telat 1 minggu. kocak", model, tokenizer)

1

The model show good result in the real input data for both positive and negative

But for the sarcastic text the model dosent have a good information to decide. Hahahaha

Thats why for some reason, people build the sarcastic model too for better understanding text sentiment.