# Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing (NLP) problem which involves extracting names of various entities from text. These entities can be a person, country, organization, etc.

This notebook aims at measuring the time taken to train a [`bert-base-cased`](https://huggingface.co/bert-base-cased) checkpoint on 14041 records in the [CoNLL-2003 dataset](https://huggingface.co/datasets/conll2003) using Google Colab (with GPU). 

This time would be compared to time taken by Sagemaker instances for the same training.

----

# Installing Dependencies

In [1]:
%pip install transformers datasets evaluate seqeval numpy mlflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


----

# Imports

In [2]:
import numpy as np
import mlflow

----

## Configurations

In [3]:
# setting up mlflow tracking
import mlflow

mlflow.set_tracking_uri("https://0972-85-255-233-139.eu.ngrok.io")
mlflow.set_experiment("sagemaker-instances-benchmark-ner")


<Experiment: artifact_location='mlflow-artifacts:/630633120093223427', creation_time=1680197171812, experiment_id='630633120093223427', last_update_time=1680197171812, lifecycle_stage='active', name='sagemaker-instances-benchmark-ner', tags={}>

----

# Processing Data

## Loading Datasets


In [4]:
from datasets import load_dataset

In [5]:
datasets = load_dataset("conll2003",split={"train":"train","validation":"validation"})
datasets



  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
})

The datasets contain various features, and out of these this task only requires `tokens` and `ner_tags`.

In [6]:
datasets = datasets.remove_columns(["id","pos_tags","chunk_tags"])
datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 3250
    })
})

## Tokenizing the dataset



In [7]:
samples = np.random.choice(range(14041),5)

for sample in samples:
  print(datasets["train"]["tokens"][sample])
  print(datasets["train"]["ner_tags"][sample])

['FORT', 'LAUDERDALE', ',', 'Fla.', '1996-08-26']
[5, 6, 0, 5, 0]
['"', 'On', 'Friday', ',', 'all', 'Moslems', ',', 'including', 'Palestinians', 'in', 'Israel', '...']
[0, 0, 0, 0, 0, 7, 0, 0, 7, 0, 5, 0]
['They', 'were', 'put', 'on', 'microfilm', 'about', '30', 'years', 'ago', 'through', 'a', 'grant', 'from', 'the', 'United', 'Daughters', 'of', 'the', 'Confederacy', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 4, 0]
['President', 'Bill', 'Clinton', 'earlier', 'this', 'month', 'invoked', 'special', 'powers', 'to', 'appoint', 'Fowler', 'during', 'the', 'congressional', 'recess', 'because', 'the', 'Senate', 'delayed', 'confirming', 'his', 'nomination', '.']
[0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0]
['-', 'Mortgage', 'lending', 'rates', 'are', 'on', 'the', 'way', 'up', 'with', 'banks', 'and', 'building', 'societies', 'poised', 'to', 'add', 'around', 'a', 'quarter', 'of', 'a', 'percentage', 'point', 'to', 'their', 'main', 'variable', 'rates', '

From the above cell we can infer that the dataset contains pretokenized sentences with a simple tokenizer. Further research has to be carried out to know which tokenizer was used. This is out of scope for this project.

To work with the `bert-base-cased` checkpoint this data needs to be retokenized using the checkpoint's tokenizer. Doing this will increase the number of tokens per record. The issue that this creates is that the label `ner_tags` is present for each token in the dataset. Retokenizing would split the words into sub-words and this would create a mismatch between the tokens and corresponding ner tags.

To solve this issue multiple things can be done. The chosen way is to keep the ner tags for the first subword generated for the token, and ignore the rest.

The output in the previous code cell shows that the `ner_tags` are represented as integers. The corresponding values can be retrieved from the code below:

In [8]:
label_names = datasets["train"].features["ner_tags"].feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [9]:
checkpoint = "bert-base-cased"


In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [11]:
def align_labels(labels:list[int],word_ids:list[int]):
  last_word_id = None

  def generate_new_label(word_id):
    nonlocal last_word_id
    label: int = -100

    if word_id!=last_word_id and word_id is not None:
      label = labels[word_id]

    last_word_id = word_id
    return label

  return [generate_new_label(word_id) for word_id in word_ids]

In [12]:
def tokenize_and_align_labels(batch):
  tokenized_inputs = tokenizer(
      batch["tokens"],
      truncation=True,
      is_split_into_words=True,
  )

  all_labels = batch["ner_tags"]

  new_labels = [align_labels(all_labels[i],tokenized_inputs.word_ids(i)) for i in range(len(all_labels))]
  tokenized_inputs["labels"] = new_labels
  return tokenized_inputs

In [13]:
tokenized_datasets = datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns= datasets["train"].column_names,
)

tokenized_datasets



Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
})

In [14]:
tokenized_datasets["train"][0]

{'input_ids': [101,
  7270,
  22961,
  1528,
  1840,
  1106,
  21423,
  1418,
  2495,
  12913,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, -100, 0, -100]}

In [15]:
# not needed anymore
del datasets

----

# Fine-Tuning the model with the Trainer API

## Data Collator

In [16]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Metrics

In [17]:
import evaluate

metric = evaluate.load("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

## Initializing Model

In [18]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [19]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.config.num_labels

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

9

## Training the model

In [20]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [21]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


In [22]:
with mlflow.start_run(run_name="colab"):
  trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0549,0.04638,0.937752,0.938068,0.93791,0.989389
2,0.021,0.042347,0.938769,0.949512,0.94411,0.990849
3,0.0116,0.041664,0.94595,0.951363,0.948649,0.99118
