# Machine Learning for Natural Language Processing: Named Entity Recognition using BERT
This notebook shows an example of how to use pretrained models to perform named entity recognition. We will perform the task on Turkish language. We will use two models trained on different sizes of data, and one model with frozen layers.

The notebook follows the structure below:

dataset import,

dataset prepration,

model setup,

finetuning,

evaluation.

In [1]:
!pip install datasets
!pip install huggingface
!pip install datasets transformers
!pip install transformers[torch]
!pip install accelerate -U



In [2]:
from datasets import load_dataset, concatenate_datasets
import sklearn
import huggingface
import transformers
import torch
from transformers import BertTokenizer, BertForTokenClassification, Trainer, TrainingArguments
import accelerate
from sklearn.metrics import f1_score

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

After importing the libraries, we import the dataset. We will use polyglot-ner in turkish. The first set will include 1000 rows, second one 3000, and the evaluation set will include 2000 rows.

In [4]:
train_set1000 = load_dataset('polyglot_ner', 'tr', split='train[:1000]')
train_set3000 = load_dataset('polyglot_ner', 'tr', split='train[:3000]')
eval_set = load_dataset('polyglot_ner', 'tr', split='train[7000:9000]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Then, we will label the tags, and name the tokenizer for turkish model, both to be used in the encoder.

In [5]:
# map ner-tags to labels
tags = set()
full_set = concatenate_datasets([train_set1000, train_set3000])
for row in full_set['ner']:
    for tag in row:
        tags.add(tag)
tags_to_labels = {tag: i for i, tag in enumerate(tags)}
labels_to_tags = {v: k for k, v in tags_to_labels.items()}

In [6]:
model_name = 'dbmdz/bert-base-turkish-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)



Ner encoding is done in the function below. the function returns the encoded ner rows and lengths for the input.

We will also include a decode_predictions() function in order to map the predicted ner data to labels.

In [7]:
def encode_dataset(dataset):
    max_len = 512
    enc_set, lengths = [], []

    for words, tags in zip(dataset['words'], dataset['ner']):
      #encode words into tensors, tokenized, padded, and truncated
        enc = tokenizer(words, return_tensors="pt", padding='max_length', max_length=max_len, is_split_into_words=True, truncation=True)
        #assign zeroes to labels for each tensorized token
        enc['labels'] = torch.zeros(1, max_len, dtype=torch.long)
        #assing ner tags to labels, map labels to indices
        for i, tag in enumerate(tags[:max_len]):
            enc['labels'][0][i] = tags_to_labels[tag]

        #simplify tensor's shape
        for key in enc:
            enc[key] = torch.squeeze(enc[key])

        enc_set.append(enc)
        lengths.append(len(tags))
    return enc_set, lengths

In [8]:
def decode_predictions(predictions):
    decoded_predictions = []
    logits = predictions.predictions
    for logit in logits:
        # Find the predicted label for each token
        predicted_labels = torch.argmax(torch.tensor(logit), dim=1)

        # Convert label indices back to tags
        predicted_tags = [labels_to_tags[label.item()] for label in predicted_labels]

        # Append decoded tags to the decoded predictions
        decoded_predictions.append(predicted_tags)

    return decoded_predictions

In [9]:
# encode the two training sets and the evaluation set
enc_trainset1, train_lengths1 = encode_dataset(train_set1000)
enc_trainset2, train_lengths2 = encode_dataset(train_set3000)
enc_evalset, eval_lengths = encode_dataset(eval_set)

Once we put all three datasets through encoding, we know move onto training.

As we will use a pretrained BERT classifier, we do not need to implement most of it manually.

We will define the same parameters for the training args for both trainsets. For the third model, we will implement a frozen layer. Then we will test all three models on eval_set.

In [10]:
def train_predict(dataset, dataset_name):


    model = BertForTokenClassification.from_pretrained(model_name, num_labels=len(tags))
    if dataset_name == "dataset3":
      for param in model.base_model.parameters():
          param.requires_grad = False

    training_args = TrainingArguments(
        output_dir=f'./results_{dataset_name}',
        num_train_epochs=1,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=1,
        weight_decay=0.01,
        logging_dir=f'logs_{dataset_name}',
        no_cuda=False,
    )

    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()
    return trainer

trainer1 = train_predict(enc_trainset1, "dataset1")
preds1 = trainer1.predict(enc_evalset)
trainer2 = train_predict(enc_trainset2, "dataset2")
preds2 = trainer2.predict(enc_evalset)
trainer3 = train_predict(enc_trainset2, "dataset3")
preds3 = trainer3.predict(enc_evalset)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.0213


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.3012


In [11]:
decode1= decode_predictions(preds1)
decode2= decode_predictions(preds2)
decode3= decode_predictions(preds3)


Now that we have obtained the predictions, we will compare the results using f1 micro and f1 macro.

In [12]:
def compute_metrics(preds, enc_inputs, lengths):

    all_preds = []
    all_labels = []
    count = 0
    o_label = tags_to_labels['O']
    for i, (length, enc_input, labels_wpad, pred_probs_wpad) in enumerate(zip(lengths, enc_inputs, preds.label_ids, preds.predictions)):
        labels = labels_wpad[:length]
        pred_probs = pred_probs_wpad[:length]
        preds = pred_probs.argmax(-1)
        all_preds.extend(preds)
        all_labels.extend(labels)
        for pred in preds:
            if pred != o_label:
                count += 1

    print(f'Count non-O-preds: {count}')
    return {'f1_micro': f1_score(all_labels, all_preds, average='micro'),
            'f1_macro': f1_score(all_labels, all_preds, average='macro')
            }

In [13]:
print("First model: ", compute_metrics(preds1, enc_evalset, eval_lengths))
print("Second model: ",compute_metrics(preds2, enc_evalset, eval_lengths))
print("Third model: ",compute_metrics(preds3, enc_evalset, eval_lengths))

Count non-O-preds: 1184
First model:  {'f1_micro': 0.9220162116835927, 'f1_macro': 0.24094855167565363}
Count non-O-preds: 1229
Second model:  {'f1_micro': 0.9220162116835927, 'f1_macro': 0.24692034206331936}
Count non-O-preds: 22274
Third model:  {'f1_micro': 0.31103450417714834, 'f1_macro': 0.12749538428154028}


The numbers show a large difference in the metrics for the frozen and non-frozen dataset evaluation, while there is not a distinctive difference for non-frozen dataset evaluation with different sizes.