# Sequential Labelling

An attempt to Fine-Tune IndoBERT Model for Sequential Labelling. Check this [Huggingface](https://huggingface.co/apwic/indobert-base-uncased-finetuned-nergrit).

## Import Modules

In [1]:
import evaluate
import numpy as np
import transformers
import tensorflow as tf
import pandas as pd

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForTokenClassification, create_optimizer, TFAutoModelForTokenClassification, pipeline
from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback
from IPython.display import display, HTML

## Import Dataset

In [2]:
nergrit = load_dataset('id_nergrit_corpus', 'ner')

In [3]:
nergrit

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 12532
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2399
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2521
    })
})

In [4]:
label_list = nergrit["train"].features[f"ner_tags"].feature.names
label_list

['B-CRD',
 'B-DAT',
 'B-EVT',
 'B-FAC',
 'B-GPE',
 'B-LAN',
 'B-LAW',
 'B-LOC',
 'B-MON',
 'B-NOR',
 'B-ORD',
 'B-ORG',
 'B-PER',
 'B-PRC',
 'B-PRD',
 'B-QTY',
 'B-REG',
 'B-TIM',
 'B-WOA',
 'I-CRD',
 'I-DAT',
 'I-EVT',
 'I-FAC',
 'I-GPE',
 'I-LAN',
 'I-LAW',
 'I-LOC',
 'I-MON',
 'I-NOR',
 'I-ORD',
 'I-ORG',
 'I-PER',
 'I-PRC',
 'I-PRD',
 'I-QTY',
 'I-REG',
 'I-TIM',
 'I-WOA',
 'O']

### Description for this Label

The letter that prefixes each ner_tag indicates the token position of the entity:
- B- indicates the beginning of an entity.
- I- indicates a token is contained inside the same entity (for example, the State token is a part of an entity like Empire State Building).
- 0 indicates the token doesn't correspond to any entity.

While, each of the tokens description in here:
- 'CRD': Cardinal
- 'DAT': Date
- 'EVT': Event
- 'FAC': Facility
- 'GPE': Geopolitical Entity
- 'LAW': Law Entity (such as Undang-Undang)
- 'LOC': Location
- 'MON': Money
- 'NOR': Political Organization
- 'ORD': Ordinal
- 'ORG': Organization
- 'PER': Person
- 'PRC': Percent
- 'PRD': Product
- 'QTY': Quantity
- 'REG': Religion
- 'TIM': Time
- 'WOA': Work of Art
- 'LAN': Language


## Preprocessing

In [3]:
# use IndoBERT
tokenizer = AutoTokenizer.from_pretrained('indolem/indobert-base-uncased')

In [6]:
example = nergrit["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens[:10]

['[CLS]',
 'indonesia',
 'mengekspor',
 'produk',
 'industri',
 'skala',
 'besar',
 'ke',
 'amerika',
 'serikat']

Based on the documentation:

> This adds some special tokens [CLS] and [SEP] and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into two subwords. You'll need to realign the tokens and labels by:

1. Mapping all tokens to their corresponding word with the word_ids method.
2. Assigning the label -100 to the special tokens [CLS] and [SEP] so they're ignored by the PyTorch loss function.
3. Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

So it is needed to realign the token and labels, and truncate the sequence if it is longer than the models maximum length

In [4]:
def tokenize_and_align_labels(dataset):
    tokenized_inputs = tokenizer(dataset["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(dataset[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [5]:
tokenized_nergrit = nergrit.map(tokenize_and_align_labels, batched=True) #processing in batch

Map:   0%|          | 0/2399 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Based on the reference:

> It's more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

So, creating the data collator.

In [12]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

## Metrics Evaluation

Evaluation for this models will only use accuracy.

In [15]:
seqeval = evaluate.load("seqeval")

In [16]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "accuracy": results["overall_accuracy"],
    }

## Fine-Tuning the Model

In [17]:
labels = [
    "B-CRD", "B-DAT", "B-EVT", "B-FAC", "B-GPE", "B-LAN", "B-LAW", "B-LOC", "B-MON", "B-NOR",
    "B-ORD", "B-ORG", "B-PER", "B-PRC", "B-PRD", "B-QTY", "B-REG", "B-TIM", "B-WOA",
    "I-CRD", "I-DAT", "I-EVT", "I-FAC", "I-GPE", "I-LAN", "I-LAW", "I-LOC", "I-MON", "I-NOR",
    "I-ORD", "I-ORG", "I-PER", "I-PRC", "I-PRD", "I-QTY", "I-REG", "I-TIM", "I-WOA", "O",
]

id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in id2label.items()}


In [18]:
len(labels)

39

In [26]:
batch_size = 8
num_train_epochs = 3
num_train_steps = (len(tokenized_nergrit["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=2e-5,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
    num_warmup_steps=0,
)

In [27]:
model = TFAutoModelForTokenClassification.from_pretrained(
    "indolem/indobert-base-uncased", 
    num_labels=39,          # set the num labels to 39
    id2label=id2label, 
    label2id=label2id, 
    from_pt=True
)

All PyTorch model weights were used when initializing TFBertForTokenClassification.

Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_nergrit["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_nergrit["validation"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [29]:
model.compile(optimizer=optimizer)

In [30]:
# create the callback for the model
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

push_to_hub_callback = PushToHubCallback(
    output_dir="indobert-base-uncased-finetuned-nergrit",  # change this based on the output repo desired
    tokenizer=tokenizer,
)

callbacks = [metric_callback, push_to_hub_callback]

/content/indobert-base-uncased-finetuned-nergrit is already a clone of https://huggingface.co/apwic/indobert-base-uncased-finetuned-nergrit. Make sure you pull the latest changes with `repo.git_pull()`.


In [31]:
model.fit(x=tf_train_set, 
          validation_data=tf_validation_set, 
          epochs=3,                 # small epochs to fasten the training
          callbacks=callbacks)

Epoch 1/3

  _warn_prf(average, modifier, msg_start, len(result))


Epoch 2/3

  _warn_prf(average, modifier, msg_start, len(result))


Epoch 3/3

  _warn_prf(average, modifier, msg_start, len(result))




<keras.src.callbacks.History at 0x7a7feabb2ad0>

## Inferencing the Model

In [7]:
text = """Jakarta, Maret 1998
Di sebuah senja, di sebuah rumah susun di Jakarta, mahasiswa bernama Biru Laut disergap empat lelaki tak dikenal. Bersama kawan-kawannya, Daniel Tumbuan, Sunu Dyantoro, Alex Perazon, dia dibawa ke sebuah tempat yang tak dikenal. Berbulan-bulan mereka disekap, diinterogasi, dipukul, ditendang, digantung, dan disetrum agar bersedia menjawab satu pertanyaan penting: siapakah yang berdiri di balik gerakan aktivis dan mahasiswa saat itu.
Jakarta, Juni 1998
Keluarga Arya Wibisono, seperti biasa, pada hari Minggu sore memasak bersama, menyediakan makanan kesukaan Biru Laut. Sang ayah akan meletakkan satu piring untuk dirinya, satu piring untuk sang ibu, satu piring untuk Biru Laut, dan satu piring untuk si bungsu Asmara Jati. Mereka duduk menanti dan menanti. Tapi Biru Laut tak kunjung muncul.
Jakarta, 2000
Asmara Jati, adik Biru Laut, beserta Tim Komisi Orang Hilang yang dipimpin Aswin Pradana mencoba mencari jejak mereka yang hilang serta merekam dan mempelajari testimoni mereka yang kembali. Anjani, kekasih Laut, para orangtua dan istri aktivis yang hilang menuntut kejelasan tentang anggota keluarga mereka. Sementara Biru Laut, dari dasar laut yang sunyi bercerita kepada kita, kepada dunia tentang apa yang terjadi pada dirinya dan kawan-kawannya.
Laut Bercerita, novel terbaru Leila S. Chudori, bertutur tentang kisah keluarga yang kehilangan, sekumpulan sahabat yang merasakan kekosongan di dada, sekelompok orang yang gemar menyiksa dan lancar berkhianat, sejumlah keluarga yang mencari kejelasan akan anaknya, dan tentang cinta yang tak akan luntur."""

In [6]:
# use pipeline to easen inferencing the model
classifier = pipeline("ner", model="apwic/indobert-base-uncased-finetuned-nergrit")

Some layers from the model checkpoint at apwic/indobert-base-uncased-finetuned-nergrit were not used when initializing TFBertForTokenClassification: ['dropout_75']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at apwic/indobert-base-uncased-finetuned-nergrit.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [31]:
classified_text = classifier(text)

In [30]:
pd.DataFrame(classified_text[:10])

Unnamed: 0,entity,score,index,word,start,end
0,B-GPE,0.996624,1,jakarta,0,7
1,B-DAT,0.996588,3,maret,9,14
2,I-DAT,0.993151,4,1998,15,19
3,B-GPE,0.995673,14,jakarta,62,69
4,B-PER,0.903852,18,biru,89,93
5,I-PER,0.714538,19,laut,94,98
6,B-CRD,0.923469,22,empat,108,113
7,B-PER,0.995122,32,daniel,158,164
8,I-PER,0.996632,33,tum,165,168
9,I-PER,0.9962,34,##buan,168,172


Partial showing of the classified_text.

In [32]:
def visualize_entities_html(text, classified_text):
    # Create a color mapping for each entity type
    color_map = {
        'GPE': 'yellow',
        'DAT': 'blue',
        'PER': 'green',
        'CRD': 'red',
        'EVT': '#FFA500',  # orange
        'FAC': '#FFC0CB',  # pink
        'LAW': '#FFD700',  # gold
        'LOC': '#ADFF2F',  # greenyellow
        'MON': '#FA8072',  # salmon
        'NOR': '#9370DB',  # mediumpurple
        'ORD': '#7B68EE',  # mediumslateblue
        'ORG': '#6A5ACD',  # slateblue
        'PRC': '#FF69B4',  # hotpink
        'PRD': '#D2B48C',  # tan
        'QTY': '#FF6347',  # tomato
        'REG': '#DB7093',  # palevioletred
        'TIM': '#EEE8AA',  # palegoldenrod
        'WOA': '#F08080',  # lightcoral
        'LAN': '#BDB76B'   # darkkhaki
    }

    
    # Sort classified_text by start index
    classified_text.sort(key=lambda x: x['start'])

    html_output = text
    shift = 0
    
    for entity in classified_text:
        word = entity['word']
        start = entity['start'] + shift
        end = entity['end'] + shift
        entity_type = entity['entity'].split('-')[-1]  # Extracting main entity type e.g., 'B-GPE' -> 'GPE'
        color = color_map.get(entity_type, 'grey')  # Default to grey if entity type is not in our map
        
        # Wrap the word in a span with background color
        span = f"<span style='background-color: {color}'>{word}</span>"
        
        # Replace the word in the text with the highlighted word
        html_output = html_output[:start] + span + html_output[end:]
        
        # Adjust shift based on the added HTML tags
        shift += len(span) - (end - start)
    
    display(HTML(html_output))


That is the visualization of Sequential Labelling, evaluating the model is already done as the model ist trained.

In [33]:
visualize_entities_html(text, classified_text)

Let's try another text.

In [34]:
other_text = """Toru Watanabe terlempar jauh ke waktu hamper 20 tahun silam saai ia masih menjadi mahasiswa yang terjerat dalam hubungan pertemanan yang rumit serta pelik, masa-masa seks bebas, serba-serbi nafsu, serta rasa hampa yang menyelimuti seorang gadis badung, Midori, yang memasuki hidupnya, yang membuat Toru Watanabe harus menentukan untuk memprioritaskan antara masa depan atau masa lalu"""

In [37]:
other_classified_text = classifier(other_text)

In [38]:
pd.DataFrame(other_classified_text)

Unnamed: 0,entity,score,index,word,start,end
0,B-PER,0.996455,1,tor,0,3
1,I-PER,0.988008,2,##u,3,4
2,I-PER,0.996024,3,watan,5,10
3,I-PER,0.995842,4,##abe,10,13
4,B-QTY,0.986421,11,20,45,47
5,I-QTY,0.97684,12,tahun,48,53
6,B-PER,0.992664,51,mid,253,256
7,I-PER,0.958504,52,##ori,256,259
8,B-PER,0.996367,60,tor,298,301
9,I-PER,0.989379,61,##u,301,302


In [39]:
visualize_entities_html(other_text, other_classified_text)