<b> Fine-tune a pretrained BERT model on the polyglot-ner data to perform named entity recognition. The data consist multiple language. One language will be selected to fine-tune the model. The fine-tuning will performed 3 times. 1- with 1000 sentences, 2- with 300 sentences. 3- with 3000 sentences and frozen embedding</b>

In [2]:
import warnings
from datasets import load_dataset

warnings.filterwarnings('ignore') 

# Load the Polyglot-NER dataset from Hugging Face's datasets library
polyglot_ner_dataset = load_dataset("polyglot_ner")

# Display basic information about the dataset
print(polyglot_ner_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'lang', 'words', 'ner'],
        num_rows: 21070925
    })
})


<b> We want to fine-tune the BERT model on one of the languages of the dataset that:  1- is not English, 2- Has already a pretrained BERT-base. 3- The language contains at least 7k sentences. The following code block will attempt to find a language with these conditions. </b> 

In [5]:
from collections import Counter

# Count the number of sentences for each language
language_counts = Counter(polyglot_ner_dataset['train']['lang'])

# Display the counts for each language
for language, count in language_counts.items():
    print(f"Language: {language}, Sentence Count: {count}")

Language: et, Sentence Count: 87023
Language: nl, Sentence Count: 520664
Language: es, Sentence Count: 386699
Language: ko, Sentence Count: 560105
Language: el, Sentence Count: 446052
Language: hr, Sentence Count: 629667
Language: id, Sentence Count: 463862
Language: uk, Sentence Count: 561373
Language: hu, Sentence Count: 590218
Language: ca, Sentence Count: 372665
Language: fr, Sentence Count: 418411
Language: tl, Sentence Count: 160750
Language: th, Sentence Count: 217631
Language: bg, Sentence Count: 559694
Language: pt, Sentence Count: 396773
Language: sk, Sentence Count: 500135
Language: vi, Sentence Count: 351643
Language: ru, Sentence Count: 551770
Language: de, Sentence Count: 547578
Language: fi, Sentence Count: 387465
Language: cs, Sentence Count: 564462
Language: he, Sentence Count: 459933
Language: da, Sentence Count: 546440
Language: sv, Sentence Count: 634881
Language: fa, Sentence Count: 492903
Language: ar, Sentence Count: 339109
Language: lv, Sentence Count: 331568
La

<b> "ar" Arabic language will be used to fine-tune the pretrained Bert-model https://huggingface.co/aubmindlab/bert-base-arabertv02. Next the data will be prepared</b>

In [9]:
# Filter the dataset to include only Arabic (ar) entries
arabic_ner_dataset = polyglot_ner_dataset['train'].filter(lambda example: example['lang'] == 'ar')

# Print basic information about the Arabic subset
print(arabic_ner_dataset)

Filter: 100%|███████████████████████████████████████████████████████████████████████████████████| 21070925/21070925 [09:50<00:00, 35681.13 examples/s]

Dataset({
    features: ['id', 'lang', 'words', 'ner'],
    num_rows: 339109
})





<b> Now the entire data variable will be removed to fee some memory</b>

In [10]:
import gc 

del polyglot_ner_dataset
del language_counts

gc.collect()

4

<b> Next is preparing the Arabic data. We will start by understanding the data and then tokenization using the same tokenizer of the model</b>

In [13]:
print(arabic_ner_dataset[0:2])

{'id': ['11408797', '11408798'], 'lang': ['ar', 'ar'], 'words': [['جيمس', 'ويليام', 'دنفر', '(', 'James', 'William', 'Denver', ')', '(', 'ولد', 'في', '23', 'أكتوبر', '،', '1817', 'وتوفي', 'في', '9', 'أغسطس', '1892', ')،', 'كان', 'سياسي', 'ا', 'أمريكيا', 'و', 'جندي', 'ا', 'و', 'محاميا', 'و', 'ممثلا', 'قديرا', '.'], ['أندري', 'أيوو', '،', 'من', 'مواليد', '17', 'ديسمبر', '1989', 'في', 'سيكلين', 'في', 'فرنسا', '،', 'لاعب', 'كرة', 'قدم', 'غاني', '.']], 'ner': [['PER', 'PER', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O']]}


<b> Next, will encode the data using the tokenizer that correspond with the BERT model. We'll also align the NER tags with the tokenized input. This is important because BERT's tokenizer might split a single word into multiple subwords</b>

In [15]:
from transformers import AutoTokenizer
import torch

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")

# Define a function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['words'], truncation=True, padding='max_length', is_split_into_words=True, max_length=128)
    
    labels = []
    for i, label in enumerate(examples['ner']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # Use label dictionary to convert string labels to int
                label_ids.append(label_to_id[label[word_idx]])
            else:
                # For subwords/wordpieces, set label to -100 (ignored in loss)
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Create a label_to_id dictionary
label_to_id = {label: i for i, label in enumerate(set([lbl for sublist in arabic_ner_dataset['ner'] for lbl in sublist]))}

# Apply the function to tokenize and align labels
tokenized_arabic_ner_dataset = arabic_ner_dataset.map(tokenize_and_align_labels, batched=True)


# Now `tokenized_arabic_ner_dataset` is ready for training

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 339109/339109 [00:53<00:00, 6320.29 examples/s]


In [17]:
print(tokenized_arabic_ner_dataset[0])

{'id': '11408797', 'lang': 'ar', 'words': ['جيمس', 'ويليام', 'دنفر', '(', 'James', 'William', 'Denver', ')', '(', 'ولد', 'في', '23', 'أكتوبر', '،', '1817', 'وتوفي', 'في', '9', 'أغسطس', '1892', ')،', 'كان', 'سياسي', 'ا', 'أمريكيا', 'و', 'جندي', 'ا', 'و', 'محاميا', 'و', 'ممثلا', 'قديرا', '.'], 'ner': ['PER', 'PER', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'input_ids': [2, 11846, 19502, 39789, 14, 47, 14609, 6055, 60, 31733, 19674, 250, 32590, 179, 37170, 15, 14, 4254, 305, 2474, 3326, 103, 43611, 187, 21480, 305, 30, 4914, 13873, 234, 15, 103, 418, 4191, 112, 31399, 139, 8082, 112, 139, 36205, 139, 8365, 48794, 181, 20, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'to

<b> Now the data is tokenized, we will load the pretrained model, define a trainer function</b>

In [19]:
from transformers import BertForTokenClassification, AutoTokenizer, TrainingArguments, Trainer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")

# Load the pretrained model with a token classification head
num_labels = len(label_to_id)
model = BertForTokenClassification.from_pretrained("aubmindlab/bert-base-arabertv02", num_labels=num_labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<b> First we will fie-tune the model using only 1000 sentences of the data</b>

In [23]:
from sklearn.metrics import accuracy_score, f1_score

# Create a subset of the dataset for training (first 1000 sentences)
train_subset_1k = tokenized_arabic_ner_dataset.select(range(1000))

# Create a subset for testing (next 200 sentences)
test_subset = tokenized_arabic_ner_dataset.select(range(1000, 1200))

# Define a function to compute metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # Flatten the lists and exclude labels for special tokens (i.e., -100)
    flat_labels = [label for sublist in labels for label in sublist if label != -100]
    flat_preds = [pred for sublist, label_sublist in zip(preds, labels) for pred, label in zip(sublist, label_sublist) if label != -100]

    accuracy = accuracy_score(flat_labels, flat_preds)
    f1_micro = f1_score(flat_labels, flat_preds, average='micro')
    f1_macro = f1_score(flat_labels, flat_preds, average='macro')

    return {
        'accuracy': accuracy,
        'f1_micro': f1_micro,
        'f1_macro': f1_macro,
    }


# Initialize the Trainer with the training subset, test subset, and compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_1k,
    eval_dataset=test_subset,  
    compute_metrics=compute_metrics  
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

# Save the model
model.save_pretrained("./my_fine_tuned_arabert_1k")


Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,0.0233,0.094223,0.966894,0.966894,0.67751
2,0.0175,0.089688,0.971859,0.971859,0.661789
3,0.0178,0.092199,0.972043,0.972043,0.68974


{'eval_loss': 0.09219853579998016, 'eval_accuracy': 0.9720434062902336, 'eval_f1_micro': 0.9720434062902336, 'eval_f1_macro': 0.6897397529127746, 'eval_runtime': 1.5635, 'eval_samples_per_second': 127.914, 'eval_steps_per_second': 2.558, 'epoch': 3.0}


<b> Next, we will fine-tune the model but with 3000 examples instead of 1000 </b>

In [24]:
# Create a subset of the dataset for training (first 1000 sentences)
train_subset_3k = tokenized_arabic_ner_dataset.select(range(3000))

# Create a subset for testing (next 200 sentences)
test_subset_3k = tokenized_arabic_ner_dataset.select(range(3000, 3300))

# Initialize the Trainer with the training subset, test subset, and compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_3k,
    eval_dataset=test_subset_3k,  
    compute_metrics=compute_metrics  
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

# Save the model
model.save_pretrained("./my_fine_tuned_arabert_3k")

Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,0.064,0.082661,0.968054,0.968054,0.770795
2,0.0576,0.082629,0.97145,0.97145,0.793495
3,0.0289,0.085617,0.972205,0.972205,0.796702


{'eval_loss': 0.08561719954013824, 'eval_accuracy': 0.9722047541189788, 'eval_f1_micro': 0.9722047541189788, 'eval_f1_macro': 0.7967018511261825, 'eval_runtime': 2.3795, 'eval_samples_per_second': 126.075, 'eval_steps_per_second': 2.101, 'epoch': 3.0}


<b> Lastely, we will fine-tune the model with 3000 sentences again, but with frozen embedding. This means that the embedding weights of the pretrained model are kept as they are "frozen". Only the other weights of the model will be changed. This approach is useful when having small dataset and to avoid overfitting</b>

In [25]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")

# Load the pretrained model 
num_labels = len(label_to_id)
model = BertForTokenClassification.from_pretrained("aubmindlab/bert-base-arabertv02", num_labels=num_labels)

# Freeze the embeddings. This is where we found how to do it: https://discuss.huggingface.co/t/how-to-freeze-some-layers-of-bertmodel/917
for param in model.bert.embeddings.parameters():
    param.requires_grad = False

# Define training arguments (can be the same or adjusted)
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Create a  subset of the dataset for training 
train_subset_3k = tokenized_arabic_ner_dataset.select(range(3000, 6000))

# Create a subset for testing 
test_subset = tokenized_arabic_ner_dataset.select(range(6000, 6300))

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_3k,
    eval_dataset=test_subset,
    compute_metrics=compute_metrics  
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

# Save the model
model.save_pretrained("./my_fine_tuned_arabert_3k_frozen_embedding")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,0.1168,0.078664,0.968983,0.968983,0.676443
2,0.0771,0.071621,0.970937,0.970937,0.730858
3,0.0491,0.07163,0.97289,0.97289,0.757397


{'eval_loss': 0.0716298520565033, 'eval_accuracy': 0.9728904628159727, 'eval_f1_micro': 0.9728904628159727, 'eval_f1_macro': 0.7573972146390583, 'eval_runtime': 2.3612, 'eval_samples_per_second': 127.053, 'eval_steps_per_second': 2.118, 'epoch': 3.0}
