<a href="https://colab.research.google.com/github/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-finetuning-models/02_finetune_v1_malaysian_debertav2_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we finetune [mesolitica/malaysian-debertav2-base](https://huggingface.co/mesolitica/malaysian-debertav2-base). DeBERTa (Decoding-enhanced BERT with disentangled attention) for a Natural language inference (NLI) task. In our case, NLI is the task of determining whether a "hypothesis" is true (*entailment*) or false (*contradiction*) given a question-statement pair. DeBERTa is selected due to its [SOTA performance in comparison to other models like BERT and RoBERTAa](https://wandb.ai/akshayuppal12/DeBERTa/reports/The-Next-Generation-of-Transformers-Leaving-BERT-Behind-With-DeBERTa--VmlldzoyNDM2NTk2#:~:text=What%20we%20do%20see%3A%20for,accuracy%20for%20the%20validation%20set.).

Overall, solely using the [Boolq-Malay](https://huggingface.co/datasets/wanadzhar913/boolq-malay) dataset (comprised of both Malay and English versions of the original [Boolq](https://huggingface.co/datasets/google/boolq) dataset), we obtain the follwing results:

- **No. of Epochs:** 10
- **Accuracy:** 66%  
- **F1-Score:** 65%  
- **Recall:** 65%  
- **Precision:** 66%  

The model can be found here: https://huggingface.co/wanadzhar913/malaysian-debertav2-finetune-on-boolq

In the future, we can do the following to garner better results:
- Increase the `gradient_accumulation_steps` to deal with the small GPU constraints or increase the `batch_size` if we've access to a larger GPU. The reasoning is mainly to avoid [Out of Memory Errors (OOM)](https://discuss.huggingface.co/t/batch-size-vs-gradient-accumulation/5260).
- Given more compute resources, we can also increase our `patience` variable and train for more than 10 epochs.



In [1]:
!pip install datasets huggingface_hub -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
# import transformers
# import datasets
# import huggingface_hub
# import sklearn
# import torch
# import numpy as np

# print(f'transformers version: {transformers.__version__}')
# print(f'datasets version: {datasets.__version__}')
# print(f'huggingface_hub version: {huggingface_hub.__version__}')
# print(f'scikit-learn version: {sklearn.__version__}')
# print(f'torch version: {torch.__version__}')
# print(f'numpy version: {np.__version__}')

transformers version: 4.46.2
datasets version: 3.1.0
huggingface_hub version: 0.26.2
scikit-learn version: 1.5.2
torch version: 2.5.1+cu121
numpy version: 1.26.4


In [None]:
import os
import json
import random

from tqdm import tqdm
from datasets import load_dataset
from huggingface_hub import create_repo, notebook_login
from sklearn.metrics import classification_report

import torch
import numpy as np
from transformers import AutoTokenizer, AutoConfig, pipeline, \
                         DebertaV2ForSequenceClassification

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### 1.0 Load dataset

In [None]:
ds_train = load_dataset("wanadzhar913/boolq-malay", split="train").shuffle(seed=42)
ds_test = load_dataset("wanadzhar913/boolq-malay", split="validation").shuffle(seed=42)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


boolq-english-train.jsonl:   0%|          | 0.00/6.50M [00:00<?, ?B/s]

boolq-malay-train-fixed.jsonl:   0%|          | 0.00/7.15M [00:00<?, ?B/s]

boolq-english-val.jsonl:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

boolq-malay-val-fixed.jsonl:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18851 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6540 [00:00<?, ? examples/s]

In [None]:
train_X = ds_train.map(lambda x: {'input': (x['question'], x['passage'])}) \
                  .remove_columns(['question', 'answer', 'passage', 'language', 'split'])
test_X = ds_test.map(lambda x: {'input': (x['question'], x['passage'])}) \
                .remove_columns(['question', 'answer', 'passage', 'language', 'split'])

train_Y = ds_train.select_columns(["answer"])
test_Y = ds_test.select_columns(["answer"])

Map:   0%|          | 0/18851 [00:00<?, ? examples/s]

Map:   0%|          | 0/6540 [00:00<?, ? examples/s]

In [None]:
train_X[53:55]['input']

[['is colby jack the same as monterey jack',
  'Colby-Jack, or Cojack, is a cheese produced from a mixture of Colby and Monterey Jack cheeses. It is generally sold in a full-moon or a half-moon shape when it is still young and mild in flavor. The cheese has a semi-hard texture. The flavor of Colby-Jack is mild to mellow.'],
 ['adakah tidak dapat mengeja merupakan satu bentuk disleksia',
  'Disleksia, juga dikenali sebagai gangguan membaca, dicirikan oleh kesukaran dalam membaca walaupun mempunyai kecerdasan yang normal. Orang yang berbeza terjejas pada tahap yang berbeza-beza. Masalah mungkin termasuk kesukaran dalam mengeja perkataan, membaca dengan cepat, menulis perkataan, "mengeluarkan bunyi" perkataan dalam kepala, menyebut perkataan semasa membaca dengan kuat dan memahami apa yang dibaca. Selalunya kesukaran ini pertama kali diperhatikan di sekolah. Apabila seseorang yang sebelum ini boleh membaca kehilangan kebolehan mereka, ia dikenali sebagai aleksia. Kesukaran ini adalah tida

In [None]:
train_Y[53:55]['answer']

[0, 1]

In [None]:
test_X['input'][53:60]

[['adakah balkans sebahagian daripada empayar uthmaniyyah',
  'Sebahagian besar Balkan berada di bawah pemerintahan Uthmaniyyah sepanjang tempoh moden awal. Pemerintahan Uthmaniyyah berlangsung lama, dari abad ke-14 sehingga awal abad ke-20 di beberapa wilayah. Empayar Uthmaniyyah adalah pelbagai dari segi agama, bahasa dan etnik, dan, pada masa-masa tertentu, merupakan tempat yang lebih toleran untuk amalan agama berbanding dengan bahagian lain di dunia. Kumpulan-kumpulan yang berbeza dalam empayar diatur mengikut garis pengakuan, dalam sistem yang dipanggil sistem Millet. Di kalangan penganut Kristian Ortodoks dalam empayar (Rum Millet), identiti bersama dibentuk berdasarkan rasa masa yang dikongsi yang ditentukan oleh kalendar gerejawi, hari-hari santo dan perayaan.'],
 ['has every mountain in the world been climbed',
  "An unclimbed mountain is a mountain peak that has yet to be climbed to the top. Determining which unclimbed peak is highest is often a matter of controversy. In som

In [None]:
test_Y[53:60]['answer']

[1, 0, 1, 1, 1, 0, 1]

### 2.0 Load and set model configs

In [None]:
config = AutoConfig.from_pretrained('mesolitica/malaysian-debertav2-base')

config.json:   0%|          | 0.00/892 [00:00<?, ?B/s]

In [None]:
model = DebertaV2ForSequenceClassification.from_pretrained('mesolitica/malaysian-debertav2-base', config = config)
_ = model.cuda()

model.safetensors:   0%|          | 0.00/228M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at mesolitica/malaysian-debertav2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokenizer = AutoTokenizer.from_pretrained('mesolitica/malaysian-debertav2-base')

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

In [None]:
!nvidia-smi

Sat Oct 19 04:57:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0              53W / 400W |    899MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### 3.0 Train/finetune model

In [None]:
trainable_parameters = [param for param in model.parameters() if param.requires_grad]
trainer = torch.optim.AdamW(trainable_parameters, lr = 1e-5, eps=1e-08, betas=(0.9,0.999))

# Add a ReduceLROnPlateau scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(trainer, mode='max', patience=2, factor=0.5, verbose=True)

We see the model **overfitting towards the end** but we're out of compute resources so this'll have to do. In the future, I'd reduce the `patient` variable.

In [None]:
batch_size = 8 # small batch size to avoid OutOfMemory errors :(
epoch = 10

best_dev_acc = -np.inf
patient = 3
current_patient = 0

for e in range(epoch):
    pbar = tqdm(range(0, len(train_X), batch_size))
    losses = []
    for i in pbar:
        trainer.zero_grad()
        x = train_X[i: i + batch_size]
        y = np.array(train_Y['answer'][i: i + batch_size])

        padded = tokenizer(x['input'], padding = 'longest', return_tensors = 'pt')

        padded['labels'] = torch.from_numpy(y)
        for k in padded.keys():
            padded[k] = padded[k].cuda()

        padded.pop('token_type_ids', None)

        loss, pred = model(**padded, return_dict = False)
        loss.backward()

        grad_norm = torch.nn.utils.clip_grad_norm_(trainable_parameters, 5.0)
        trainer.step()
        losses.append(float(loss))

    dev_predicted = []
    for i in range(0, len(test_X), batch_size):
        x = test_X[i: i + batch_size]
        y = np.array(test_Y['answer'][i: i + batch_size])

        padded = tokenizer(x['input'], padding = 'longest', return_tensors = 'pt')
        padded['labels'] = torch.from_numpy(y)
        for k in padded.keys():
            padded[k] = padded[k].cuda()

        loss, pred = model(**padded, return_dict = False)
        dev_predicted.append((pred.argmax(axis = 1).detach().cpu().numpy() == y).mean())

    dev_predicted = np.mean(dev_predicted)

    # Call scheduler.step() with the validation accuracy
    scheduler.step(dev_predicted)

    print(f'epoch: {e}, loss: {np.mean(losses)}, dev_predicted: {dev_predicted}')

    if dev_predicted >= best_dev_acc:
        best_dev_acc = dev_predicted
        current_patient = 0
        model.save_pretrained('malaysian-debertav2-finetune-on-boolq')
    else:
        current_patient += 1

    if current_patient >= patient:
        break

    if e == epoch - 1:
        print(f'Final dev accuracy: {best_dev_acc}')
        model.save_pretrained('malaysian-debertav2-finetune-on-boolq')

100%|██████████| 2357/2357 [09:24<00:00,  4.18it/s]


epoch: 0, loss: 0.6468807298296531, dev_predicted: 0.6612163814180929


100%|██████████| 2357/2357 [09:24<00:00,  4.18it/s]


epoch: 1, loss: 0.49742775551791435, dev_predicted: 0.6523533007334963


100%|██████████| 2357/2357 [09:25<00:00,  4.17it/s]


epoch: 2, loss: 0.25414175960548635, dev_predicted: 0.6392114914425427


100%|██████████| 2357/2357 [09:25<00:00,  4.17it/s]


epoch: 3, loss: 0.1292089269987623, dev_predicted: 0.6667176039119804


100%|██████████| 2357/2357 [09:25<00:00,  4.17it/s]


epoch: 4, loss: 0.09572274779887335, dev_predicted: 0.6731356968215159


100%|██████████| 2357/2357 [09:26<00:00,  4.16it/s]


epoch: 5, loss: 0.07463395613032514, dev_predicted: 0.6752750611246944


100%|██████████| 2357/2357 [09:26<00:00,  4.16it/s]


epoch: 6, loss: 0.07290740855837491, dev_predicted: 0.6763447432762836


100%|██████████| 2357/2357 [09:26<00:00,  4.16it/s]


epoch: 7, loss: 0.05595923417469376, dev_predicted: 0.6648838630806846


100%|██████████| 2357/2357 [09:26<00:00,  4.16it/s]


epoch: 8, loss: 0.053639279538313, dev_predicted: 0.648380195599022


100%|██████████| 2357/2357 [09:26<00:00,  4.16it/s]


epoch: 9, loss: 0.04436893333102321, dev_predicted: 0.6580073349633252


### 4.0 Run benchmarks

In [None]:
label2id = {'contradiction' : 0, 'entailment' : 1}
id2label = {0 : 'contradiction', 1 : 'entailment'}

In [None]:
config.num_labels = 2
config.vocab = ['contradiction', 'entailment']

In [None]:
model.config.label2id = label2id
model.config.id2label = id2label

In [None]:
batch_size = 10
epoch = 100

real_Y = []
for i in tqdm(range(0, len(test_X), batch_size)):
    x = test_X[i: i + batch_size]
    y = np.array(test_Y['answer'][i: i + batch_size])

    padded = tokenizer(x['input'], padding = 'longest', return_tensors = 'pt')
    padded['labels'] = torch.from_numpy(y)
    for k in padded.keys():
        padded[k] = padded[k].cuda()
    padded.pop('token_type_ids', None)

    loss, pred = model(**padded, return_dict=False)
    real_Y.extend(pred.argmax(axis = 1).detach().cpu().numpy().tolist())

100%|██████████| 654/654 [01:04<00:00, 10.19it/s]


In [None]:
padded

{'input_ids': tensor([[  974, 20557,  3407,  ...,     0,     0,     0],
        [ 9259, 12227, 22431,  ...,     0,     0,     0],
        [  498,  2395,  1624,  ...,     0,     0,     0],
        ...,
        [  807,  4044,   856,  ...,     0,     0,     0],
        [ 9259,  5937,  2892,  ...,    16,   612,    17],
        [ 1221,   384,   418,  ...,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'), 'labels': tensor([1, 0, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0')}

In [None]:
pred.argmax(axis = 1).detach().cpu().numpy()

array([1, 0, 1, 1, 1, 0, 1, 0, 1, 1])

In [None]:
pred

tensor([[-4.6697,  4.4405],
        [ 1.1710, -1.7389],
        [-3.0831,  2.9493],
        [-4.5986,  4.2763],
        [-4.1859,  3.9571],
        [ 0.3446, -0.9267],
        [-3.0447,  2.7374],
        [ 3.3037, -4.0914],
        [-3.3666,  3.1480],
        [-4.4459,  4.2598]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [None]:
print(
    classification_report(
        real_Y, test_Y['answer'], # target_names = config.vocab,
        digits = 5
    )
)

              precision    recall  f1-score   support

           0    0.66047   0.53910   0.59364      3031
           1    0.65642   0.76062   0.70469      3509

    accuracy                        0.65795      6540
   macro avg    0.65844   0.64986   0.64916      6540
weighted avg    0.65830   0.65795   0.65322      6540



In [None]:
pipe = pipeline(
    "text-classification",
    tokenizer = tokenizer,
    model=model,
    padding=True
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
article = """
KUALA LUMPUR: Kerajaan bersetuju untuk menaikkan kadar gaji minimum daripada RM1,500 sebulan kepada RM1,700, berkuat kuasa 1 Februari 2025.
Perdana Menteri Datuk Seri Anwar Ibrahim sewaktu membentangkan Belanjawan 2025 Malaysia MADANI di Dewan Rakyat pada Jumaat berkata, penstrukturan ekonomi hanya dianggap berjaya apabila rakyat meraih gaji dan upah yang bermakna untuk menjalani hidup dengan lebih selesa.
"""

In [None]:
pipe([('Betul ke kerajaan naikkan gaji minimum?', article)])

[{'label': 'entailment', 'score': 0.8098661303520203}]

In [None]:
pipe([('Did the government top up minimum wage?', article)])

[{'label': 'entailment', 'score': 0.9928961396217346}]

In [None]:
pipe([('Government naikkan gaji minimum', article)])

[{'label': 'entailment', 'score': 0.7880232334136963}]

### 5.0 Push model to Huggingface 🤗

In [None]:
create_repo("wanadzhar913/malaysian-debertav2-finetune-on-boolq", repo_type="model")

RepoUrl('https://huggingface.co/wanadzhar913/malaysian-debertav2-finetune-on-boolq', endpoint='https://huggingface.co', repo_type='model', repo_id='wanadzhar913/malaysian-debertav2-finetune-on-boolq')

In [None]:
model.push_to_hub('wanadzhar913/malaysian-debertav2-finetune-on-boolq', safe_serialization = True)

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wanadzhar913/malaysian-debertav2-finetune-on-boolq/commit/ae9b4f9f287d390a474284d42707f6e57a389b3a', commit_message='Upload DebertaV2ForSequenceClassification', commit_description='', oid='ae9b4f9f287d390a474284d42707f6e57a389b3a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub('wanadzhar913/malaysian-debertav2-finetune-on-boolq', safe_serialization = True)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wanadzhar913/malaysian-debertav2-finetune-on-boolq/commit/4e528614d1f0b324945ed977162f85656b67d792', commit_message='Upload tokenizer', commit_description='', oid='4e528614d1f0b324945ed977162f85656b67d792', pr_url=None, pr_revision=None, pr_num=None)