# Developing and assessing named entity recognition models using the MultiNERD dataset

**Author**: [Mahbub Ul Alam](https://www.linkedin.com/in/anondo),
 Email: mahbub.ul.alam.anondo@gmail.com

## Introduction

In this notebook training and evaluation of a named entity recognition (ner) model will be performed based on the English subset of the [**MultiNERD**](https://huggingface.co/datasets/Babelscape/multinerd) dataset.

[**MultiNERD**](https://huggingface.co/datasets/Babelscape/multinerd) dataset covers 10 languages (Chinese, Dutch, English, French, German, Italian, Polish, Portuguese, Russian and Spanish), 15 NER categories (Person (PER), Location (LOC), Organization (ORG}), Animal (ANIM), Biological entity (BIO), Celestial Body (CEL), Disease (DIS), Event (EVE), Food (FOOD), Instrument (INST), Media (MEDIA), Plant (PLANT), Mythological entity (MYTH), Time (TIME) and Vehicle (VEHI)), and 2 textual genres (Wikipedia and WikiNews).

A BERT (Bidirectional Encoder Representations from Transformers)-based Language Model will be fine-tuned with the English subset to create the model.

In this work, I'll walk through the process of training and evaluating named entity recognition (NER) models using the English section of the [**MultiNERD**](https://huggingface.co/datasets/Babelscape/multinerd) dataset. This dataset is quite extensive, encompassing 10 languages and 15 NER categories, such as Person, Location, Organization, among others, and includes texts from two genres: Wikipedia and WikiNews. The method will involve fine-tuning a BERT (Bidirectional Encoder Representations from Transformers)-based language model with the English portion of the [**MultiNERD**](https://huggingface.co/datasets/Babelscape/multinerd) dataset.

### requirements.txt file

Initially, a requirements.txt file is created to establish the environment setup.

In [None]:
%%writefile requirements.txt

datasets==2.16.0
git+https://github.com/MartinKondor/jsonl.git
nervaluate==0.1.8
plotly==5.9.0
protobuf==3.20.1
span_marker==1.5.0
transformers==4.26.1
pandas==1.4.2

### Installing Required Packages

Next, the relevant packages can be installed.

In [None]:
%pip install -r requirements.txt

### Loading Required Libraries

In [None]:
from datasets import load_dataset
from jsonl import jsonl
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer
from nervaluate import Evaluator
import plotly.graph_objects as go
import pandas as pd
import os

### Downloading the MultiNERD Dataset

This notebook operates under the assumption that the dataset has already been downloaded. To download the data, use the following commands in a terminal, ensuring you have git (installation guide: https://github.com/git-guides/install-git) and git-lfs (https://git-lfs.com) installed.

    $git lfs install
    $git clone https://huggingface.co/datasets/Babelscape/multinerd

### Connecting to Google Drive for Google Colab Usage

This assignment was completed on Google Colab using an A100 GPU. To work there, access permission for Google Drive is required. However, this step can be skipped if the notebook is being run locally.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Experiment Directory

In [None]:
# The directory where all the work related items will be stored.
#exp_dir='/content/drive/MyDrive/EXP_DIRECTORY_NAME/'
#if not os.path.isdir(exp_dir):
#    os.mkdir(exp_dir)
exp_dir=''

### Dataset Directory

In [None]:
# The directory where the dataset is downloaded
#data_dir='/content/drive/MyDrive/DATA_DIRECTORY_NAME/'
data_dir=exp_dir+'multinerd/'

### The Base BERT (Bidirectional Encoder Representations from Transformers)-based Language Model

Setting Up the Base BERT Language Model

For the base model, I've used the "***bert-base-multilingual-cased***" (https://huggingface.co/bert-base-multilingual-cased) BERT model, which will be fine-tuned in the upcoming steps. Alternatively, the following BERT models are also suitable options:

    prajjwal1/bert-tiny
    prajjwal1/bert-mini
    prajjwal1/bert-small
    prajjwal1/bert-medium
    bert-base-cased
    bert-large-cased
    bert-base-multilingual-cased
    bert-base-multilingual-uncased
    roberta-base
    roberta-large
    xlm-roberta-base
    xlm-roberta-large

In [None]:
bert_base_model_name = "bert-base-multilingual-cased"

### Model Directory

In [None]:
# The directory where all the models will be stored.
model_dir=exp_dir+'results/'
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

## Implementing System A

In System A, the dataset will be constructed using the English subset of the [**MultiNERD**](https://huggingface.co/datasets/Babelscape/multinerd) dataset.

### System A Dataset

In [None]:
#Creating the dataset based on only English subset.
system_A_dataset=load_dataset('json', data_files={'train': data_dir+'train/train_en.jsonl',
                                          'test': data_dir+'test/test_en.jsonl',
                                         'val': data_dir+'val/val_en.jsonl'})

In [None]:
#Creating a list of all 15 NER categories in IOB2 format keeping the same order as used in the dataset.
system_A_all_labels = [
        "O",
        "B-PER",
        "I-PER",
        "B-ORG",
        "I-ORG",
        "B-LOC",
        "I-LOC",
        "B-ANIM",
        "I-ANIM",
        "B-BIO",
        "I-BIO",
        "B-CEL",
        "I-CEL",
        "B-DIS",
        "I-DIS",
        "B-EVE",
        "I-EVE",
        "B-FOOD",
        "I-FOOD",
        "B-INST",
        "I-INST",
        "B-MEDIA",
        "I-MEDIA",
        "B-MYTH",
        "I-MYTH",
        "B-PLANT",
        "I-PLANT",
        "B-TIME",
        "I-TIME",
        "B-VEHI",
        "I-VEHI",
    ]

In [None]:
#Creating a list with all the 15 NER tags
system_A_all_tags_list=["PER", "ORG", "LOC", "ANIM", "BIO", "CEL", "DIS",
          "EVE", "FOOD", "INST", "MEDIA", "MYTH", "PLANT", "TIME", "VEHI"]

### System A Output Directory

In [None]:
# The outpit directory for system A
model_A_output_dir=model_dir+bert_base_model_name+'-system-A/'
if not os.path.isdir(model_A_output_dir):
    os.mkdir(model_A_output_dir)

### Training Model A

For training NER models, I utilized the SpanMarker (https://github.com/tomaarsen/SpanMarkerNER/) Python framework. This model processes individual sentences using a pretrained encoder, such as BERT or RoBERTa. It fine-tunes the encoder by employing a simple cross-entropy loss of embeddings compared to labels. SpanMarker incorporates "markers" - special tokens indicating the beginning and end of a span within the input sentence.

In [None]:
# Initialize a SpanMarker model using a pretrained BERT-style encoder
model_A = SpanMarkerModel.from_pretrained(
        bert_base_model_name,
        labels=system_A_all_labels,
        # SpanMarker hyperparameters:
        model_max_length=256,
        marker_max_length=128,
        entity_max_length=12,
    )


In [None]:
# Prepare the 🤗 transformers training arguments
model_A_args = TrainingArguments(
        output_dir=model_A_output_dir,
        # Training Hyperparameters:
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        # gradient_accumulation_steps=2,
        num_train_epochs=1,
        weight_decay=0.01,
        warmup_ratio=0.1,
        bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
        #fp16=True,
        # Other Training parameters
        logging_first_step=True,
        logging_steps=50,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=1000,
        save_total_limit=2,
        dataloader_num_workers=2,
    )


In [None]:
# Initialize the trainer using our model, training args & dataset, and train
model_A_trainer = Trainer(
        model=model_A,
        args=model_A_args,
        train_dataset=system_A_dataset['train'],
        eval_dataset=system_A_dataset['val'],
    )

In [None]:
model_A_trainer.train()

### Save Directory of Model A

In [None]:
# Save directory of model A
model_A_save_dir=model_A_output_dir+bert_base_model_name+'-system-A-trained-model/'

if not os.path.isdir(model_A_save_dir):
    os.mkdir(model_A_save_dir)

In [None]:
model_A_trainer.save_model(model_A_save_dir)

### Evaluating and Saving Model A Metrics

Initially, I am calculating and storing the evaluations using the ***seqeval*** (https://github.com/chakki-works/seqeval) evaluation metric, which will be analyzed at a later stage.

In [None]:
# Compute seqeval metrics on the test set
model_A_metrics = model_A_trainer.evaluate(system_A_dataset['test'], metric_key_prefix="test")

In [None]:
#Save seqeval metrics on the test set
jsonl.dump([model_A_metrics], model_A_output_dir+'seqeval_evaluation.jsonl')

The model generates predictions in the Prodigy span format (https://prodi.gy/). I am saving these predictions along with the input labels in the IOB2 format for future reference,

In [None]:
def create_IOB2_ground_truth_and_predicted_seqs(dataset, labels, model, output_dir):

    IOB2_ground_truth_seqs=[]
    IOB2_predicted_seqs=[]

    for dataset_index in range(len(dataset)):
        if dataset_index%1000==0:
            print(str(dataset_index+1)+'/'+str(len(dataset))+' instances completed.')


        IOB2_ground_truth_entity_seq=dataset[dataset_index]['ner_tags']

        for ground_truth_seq_index in range (len(IOB2_ground_truth_entity_seq)):
            IOB2_ground_truth_entity_seq[ground_truth_seq_index]=labels[IOB2_ground_truth_entity_seq[ground_truth_seq_index]]

        IOB2_ground_truth_seqs.append(IOB2_ground_truth_entity_seq)

        IOB2_predicted_entity_seq=['O' for i in range(len(IOB2_ground_truth_entity_seq))]

        input_tokens_seq = dataset[dataset_index]['tokens']

        predicted_outcomes = model.predict(input_tokens_seq)

        for result in predicted_outcomes:

            entity_start_index=result['word_start_index']
            entity_end_index=result['word_end_index']-1
            predicted_entity_label=result['label']

            for entity_index in range(entity_start_index, entity_end_index+1):
                if entity_index==entity_start_index:
                    IOB2_predicted_entity_seq[entity_index]='B-'+predicted_entity_label
                else:
                    IOB2_predicted_entity_seq[entity_index]='I-'+predicted_entity_label

        IOB2_predicted_seqs.append(IOB2_predicted_entity_seq)

    print(len(IOB2_ground_truth_seqs))
    print(len(IOB2_predicted_seqs))

    jsonl.dump(IOB2_ground_truth_seqs, output_dir+'IOB2_ground_truth_seqs.jsonl')
    jsonl.dump(IOB2_predicted_seqs, output_dir+'IOB2_predicted_seqs.jsonl')

    return IOB2_ground_truth_seqs, IOB2_predicted_seqs

#### Loading Saved Model A

In [None]:
# Uncomment to load Saved model A
#model_A = SpanMarkerModel.from_pretrained(model_A_output_dir+bert_base_model_name+'-system-A-trained-model').cuda()

In [None]:
# Generating ground truths and predictions in IOB2 format
model_A_IOB2_ground_truth_seqs, model_A_IOB2_predicted_seqs = create_IOB2_ground_truth_and_predicted_seqs(system_A_dataset['test'], system_A_all_labels, model_A, model_A_output_dir)

I am now calculating and storing evaluations using the '***nervaluate***' (https://github.com/MantisAI/nervaluate) evaluation metric, which will be examined in a later analysis.

In [None]:
# Compute nervaluate metrics on the test set
model_A_nervaluate_evaluator = Evaluator(model_A_IOB2_ground_truth_seqs, model_A_IOB2_predicted_seqs, tags=system_A_all_tags_list, loader="list")

model_A_nervaluate_results, model_A_nervaluate_results_by_tag = model_A_nervaluate_evaluator.evaluate()

In [None]:
# Compute nervaluate metrics on the test set
jsonl.dump([model_A_nervaluate_results], model_A_output_dir+'nervaluate_overall_evaluation.jsonl')
jsonl.dump([model_A_nervaluate_results_by_tag], model_A_output_dir+'nervaluate_tag_wise_evaluation.jsonl')

## Implementing System B

In System B, the dataset will be identical to the System A dataset, but it will only include the following five entity types:

    PERSON (PER), ORGANIZATION (ORG), LOCATION (LOC), DISEASES (DIS), ANIMAL (ANIM)

For this system, I am excluding all other entity types present in the System A dataset. Since label IDs need to maintain sequential order, the IDs for "***B-DIS***" and "***I-DIS***" have been altered to ***9*** and ***10***, respectively, from their original ***13*** and ***14*** in the System A dataset. For more clarity, refer to the '***system_A_all_labels***' list.

In [None]:
#I am excluding all other entity types present in the System A dataset. Since label IDs need
#to maintain sequential order, the IDs for "B-DIS" and "I-DIS" have been altered to
#9 and 10, respectively, from their original 13 and 14 in the System A dataset.
#For more clarity, refer to the 'system_A_all_labels' list.
def create_system_B_dataset(jsonl_data_file):

    #PERSON(PER), ORGANIZATION(ORG), LOCATION(LOC), ANIMAL(ANIM), DISEASES(DIS),
    #selected_labels = ["O","B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC",
    #                   "I-LOC", "B-ANIM", "I-ANIM", "B-DIS", "I-DIS"]

    system_B_label_ids=[0, 1, 2, 3, 4, 5 , 6, 7, 8, 13, 14]

    data = jsonl.load(jsonl_data_file)

    system_B_change_found=False

    for instance_index in range(len(data)):

        ner_tag_ids_list=data[instance_index]['ner_tags']

        system_B_change_found=False

        for ner_tag_id_index in range(len(ner_tag_ids_list)):

            if ner_tag_ids_list[ner_tag_id_index] not in system_B_label_ids:

                ner_tag_ids_list[ner_tag_id_index]=0

                system_B_change_found=True

            elif ner_tag_ids_list[ner_tag_id_index]==13:

                ner_tag_ids_list[ner_tag_id_index]=9

                system_B_change_found=True

            elif ner_tag_ids_list[ner_tag_id_index]==14:

                ner_tag_ids_list[ner_tag_id_index]=10

                system_B_change_found=True

        if system_B_change_found:
            data[instance_index]['ner_tags']=ner_tag_ids_list

    return data

In [None]:
system_B_val_data = create_system_B_dataset(data_dir+'val/val_en.jsonl')
system_B_test_data = create_system_B_dataset(data_dir+'test/test_en.jsonl')
system_B_train_data = create_system_B_dataset(data_dir+'train/train_en.jsonl')

In [None]:
jsonl.dump(system_B_val_data, data_dir+'val/val_system_B_en.jsonl')
jsonl.dump(system_B_test_data, data_dir+'test/test_system_B_en.jsonl')
jsonl.dump(system_B_train_data, data_dir+'train/train_system_B_en.jsonl')

### System B Dataset

In [None]:
#Creating the System B dataset
system_B_dataset=load_dataset('json', data_files={'train': data_dir+'train/train_system_B_en.jsonl',
                                          'test': data_dir+'test/test_system_B_en.jsonl',
                                         'val': data_dir+'val/val_system_B_en.jsonl'})

In [None]:
#Creating a list of all 5 NER categories in IOB2 format keeping the same order as used in the dataset.
system_B_all_labels= [
        "O",
        "B-PER",
        "I-PER",
        "B-ORG",
        "I-ORG",
        "B-LOC",
        "I-LOC",
        "B-ANIM",
        "I-ANIM",
        "B-DIS",
        "I-DIS",
    ]

In [None]:
#Creating a list with all the 5 NER tags
system_B_all_tags_list=["PER", "ORG", "LOC", "ANIM", "DIS"]

### System B Output Directory

In [None]:
model_B_output_dir=model_dir+bert_base_model_name+'-system-B/'
if not os.path.isdir(model_B_output_dir):
    os.mkdir(model_B_output_dir)

### Training Model B

In [None]:
# Initialize a SpanMarker model using a pretrained BERT-style encoder
model_B = SpanMarkerModel.from_pretrained(
        bert_base_model_name,
        labels=system_B_all_labels,
        # SpanMarker hyperparameters:
        model_max_length=256,
        marker_max_length=128,
        entity_max_length=12,
    )

In [None]:
# Prepare the 🤗 transformers training arguments
model_B_args = TrainingArguments(
        output_dir=model_B_output_dir,
        # Training Hyperparameters:
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        # gradient_accumulation_steps=2,
        num_train_epochs=1,
        weight_decay=0.01,
        warmup_ratio=0.1,
        bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
        #fp16=True,
        # Other Training parameters
        logging_first_step=True,
        logging_steps=50,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=1000,
        save_total_limit=2,
        dataloader_num_workers=2,
    )

In [None]:
# Initialize the trainer using our model, training args & dataset, and train
model_B_trainer = Trainer(
        model=model_B,
        args=model_B_args,
        train_dataset=system_B_dataset['train'],
        eval_dataset=system_B_dataset['val'],
    )

In [None]:
model_B_trainer.train()

### Save Directory of Model B

In [None]:
# Save directory of model A
model_B_save_dir=model_B_output_dir+bert_base_model_name+'-system-B-trained-model/'

if not os.path.isdir(model_B_save_dir):
    os.mkdir(model_B_save_dir)

In [None]:
model_B_trainer.save_model(model_B_save_dir)

### Evaluating and Saving Model B Metrics

Initially, I am calculating and storing the evaluations using the ***seqeval*** (https://github.com/chakki-works/seqeval) evaluation metric, which will be analyzed at a later stage.

In [None]:
# Compute seqeval metrics on the test set
model_B_metrics = model_B_trainer.evaluate(system_B_dataset['test'], metric_key_prefix="test")

In [None]:
#Save seqeval metrics on the test set
jsonl.dump([model_B_metrics], model_B_output_dir+'seqeval_evaluation.jsonl')

#### Loading Saved Model B

In [None]:
#Uncomment to load saved model B
#model_B = SpanMarkerModel.from_pretrained(model_B_output_dir+bert_base_model_name+'-system-B-trained-model').cuda()

The model generates predictions in the Prodigy span format (https://prodi.gy/). I am saving these predictions along with the input labels in the IOB2 format for future reference,

In [None]:
# Generating ground truths and predictions in IOB2 format
model_B_IOB2_ground_truth_seqs, model_B_IOB2_predicted_seqs = create_IOB2_ground_truth_and_predicted_seqs(system_B_dataset['test'], system_B_all_labels, model_B, model_B_output_dir)

I am now calculating and storing evaluations using the '***nervaluate***' (https://github.com/MantisAI/nervaluate) evaluation metric, which will be examined in a later analysis.

In [None]:
# Compute nervaluate metrics on the test set
model_B_nervaluate_evaluator = Evaluator(model_B_IOB2_ground_truth_seqs, model_B_IOB2_predicted_seqs, tags=system_B_all_tags_list, loader="list")

model_B_nervaluate_results, model_B_nervaluate_results_by_tag = model_B_nervaluate_evaluator.evaluate()

In [None]:
# Save nervaluate metrics on the test set
jsonl.dump([model_B_nervaluate_results], model_B_output_dir+'nervaluate_overall_evaluation.jsonl')
jsonl.dump([model_B_nervaluate_results_by_tag], model_B_output_dir+'nervaluate_tag_wise_evaluation.jsonl')

## Evaluating Performance Through Tables and Graphs Analysis

This method will produce tables from the evaluation results obtained with '***seqeval***' and '***nervaluate***',

In [None]:
def print_evaluation_score_table(results_dict, title, tags=0):

    if 'seqeval' in title:

        table_dict=dict()
        for tag in tags:
            table_dict[tag]=results_dict['test_'+tag]

        table_dict['Overall']={'precision':results_dict['test_overall_precision'],
                                 'recall':results_dict['test_overall_recall'],
                                 'f1':results_dict['test_overall_f1'],
                                 'number': None}

        print(title)
        print()
        print(pd.DataFrame.from_dict(table_dict).round(3))

    elif 'Overall' in title:
        print(title)
        print()
        print(pd.DataFrame.from_dict(results_dict).round(3))

    else:
        print(title+tags)
        print()
        print(pd.DataFrame.from_dict(results_dict[tags]).round(3))

This method will produce plots from the evaluation results obtained with '***seqeval***' and '***nervaluate***',

In [None]:
def create_plot(result_dict, tags, title):

    precisions=[]
    recalls=[]
    f1s=[]

    if 'seqeval' in title:

        precisions.append(result_dict['test_overall_precision'])
        recalls.append(result_dict['test_overall_recall'])
        f1s.append(result_dict['test_overall_f1'])
        for tag in tags:
            precisions.append(result_dict['test_'+tag]['precision'])
            recalls.append(result_dict['test_'+tag]['recall'])
            f1s.append(result_dict['test_'+tag]['f1'])

        x_values=["Overall"]
        for tag in tags:
            x_values.append(tag)

        xaxes_title='Entity Type'

        ticktext=tags

    else:

        if tags!='Overall':
            result_dict=result_dict[tags]

        for eval_catagory in result_dict.keys():
            precisions.append(result_dict[eval_catagory]['precision'])
            recalls.append(result_dict[eval_catagory]['recall'])
            f1s.append(result_dict[eval_catagory]['f1'])

        x_values=list(result_dict.keys())

        ticktext=x_values

        xaxes_title='nervaluate Evaluation Type'


    fig = go.Figure(data=[
    go.Bar(name='precision', x=x_values, y=precisions),
    go.Bar(name='recall', x=x_values, y=recalls),
    go.Bar(name='f1', x=x_values, y=f1s)])

    # Change the bar mode
    fig.update_layout(barmode='group', title_text=title,
                 xaxis1=dict(#tickvals=special_case,
                            ticktext=ticktext, tickangle=-45,
                             ))
    fig.update_yaxes(title='Evaluation Score')
    fig.update_xaxes(title=xaxes_title)

    fig.show()

### Evaluation with 'seqeval'

'seqeval' is a Python framework designed for evaluating sequence labeling tasks. It's particularly useful for assessing the performance of chunking tasks, like named-entity recognition (NER).

This library is extensively used and provides standard classification metrics, including F1 score, Recall, and Precision, specifically for sequence data tasks.

In [None]:
model_A_seqeval_evaluation_dict=jsonl.load(model_A_output_dir+'seqeval_evaluation.jsonl')[0]

In [None]:
print_evaluation_score_table(model_A_seqeval_evaluation_dict, "seqeval, Model A: Evaluation", tags=system_A_all_tags_list)

Analyzing the seqeval results from Model A, it's evident that the overall performance is satisfactory, and there appears to be a correlation between the scores and the total number of instances for each entity. Interestingly, some entities with fewer instances, such as INST (with only 12 instances), achieved surprisingly high scores (f1 score = 0.783). This may be attributed to fine-tuning the model with a relatively small dataset. Another factor to consider is the use of a cased BERT model as the base. A comparison with an uncased BERT model could offer further insights into these outcomes.

The results are also visually presented in the following interactive graph created using Plotly,

In [None]:
create_plot(model_A_seqeval_evaluation_dict, system_A_all_tags_list, "seqeval Evaluation for System A")

In [None]:
model_B_seqeval_evaluation_dict=jsonl.load(model_B_output_dir+'seqeval_evaluation.jsonl')[0]

In [None]:
print_evaluation_score_table(model_B_seqeval_evaluation_dict, "seqeval, Model B: Evaluation", tags=system_B_all_tags_list)

Analyzing the seqeval results from Model B, it's noticeable that while the overall score has significantly improved compared to Model A, there is a slight decrease in the individual scores for each entity. This suggests that in Model A, the five entities were positively influenced by the presence of the other ten entities.

The results are also visually presented in the following interactive graph created using Plotly,

In [None]:
create_plot(model_B_seqeval_evaluation_dict, system_B_all_tags_list, "seqeval Evaluation for System B")

###  Evaluation with 'nervaluate'

Considering partial matches in NER evaluations raises an interesting point. Unlike a complete mismatch, partial matches affect precision and recall more negatively, even though they might be more useful in practical applications where linking an entity is vital.

In-depth evaluation of NER systems goes beyond treating it as a binary classification problem. There are nuances in NER, where nearly correct answers are also valuable. Here, we can apply five distinct metrics to categorize errors:

1. **Correct (COR)**: The system's output matches the golden annotation exactly.
2. **Incorrect (INC)**: The system's output and the golden annotation do not match.
3. **Partial (PAR)**: The system's output is somewhat similar to the golden annotation but not identical.
4. **Missing (MIS)**: A golden annotation is not identified by the system.
5. **Spurious (SPU)**: The system produces a response that isn't present in the golden annotation.

These metrics can be evaluated in four different ways:

1. **Strict**: Requires an exact match in both the boundary surface string and entity type.
2. **Exact**: Demands an exact boundary match over the surface string, irrespective of the type.
3. **Partial**: Allows for a partial boundary match over the surface string, regardless of the type.
4. **Type**: Requires some overlap between the system tagged entity and the gold annotation.

The 'Strict' evaluation aligns with what seqeval uses, corresponding to the micro-average scores.

For a more detailed understanding, you can refer to [MantisAI's nervaluate](https://github.com/MantisAI/nervaluate).

In [None]:
model_A_nervaluate_overall_evaluation_dict=jsonl.load(model_A_output_dir+'nervaluate_overall_evaluation.jsonl')[0]

In [None]:
print_evaluation_score_table(model_A_nervaluate_overall_evaluation_dict, "nervaluate, Model A: Overall")

Reviewing the comprehensive nervaluate results from Model A, it becomes apparent that the outcomes are moderately satisfactory when 'partial' evaluation is taken into account.

The results are also visually presented in the following interactive graph created using Plotly,

In [None]:
create_plot(model_A_nervaluate_overall_evaluation_dict, 'Overall', 'Overall: nervaluate Evaluation for System A')

In [None]:
model_A_nervaluate_tag_wise_evaluation_dict=jsonl.load(model_A_output_dir+'nervaluate_tag_wise_evaluation.jsonl')[0]

In [None]:
for tag in system_A_all_tags_list:
    print_evaluation_score_table(model_A_nervaluate_tag_wise_evaluation_dict, "nervaluate, Model A: ", tags=tag)
    print()

Examining the entity-specific nervaluate results from Model A, a pattern similar to the seqeval findings emerges: the total number of instances influences the improvement of results. A notable observation is in the 'BIO' category, where the metrics other than 'strict' are significantly higher.

The results are also visually presented in the following interactive graph created using Plotly,

In [None]:
for tag in system_A_all_tags_list:
    create_plot(model_A_nervaluate_tag_wise_evaluation_dict, tag, tag+': nervaluate Evaluation for System A')

In [None]:
model_B_nervaluate_overall_evaluation_dict=jsonl.load(model_B_output_dir+'nervaluate_overall_evaluation.jsonl')[0]

In [None]:
print_evaluation_score_table(model_B_nervaluate_overall_evaluation_dict, "nervaluate, Model B: Overall")

Looking at the overall nervaluate results from Model B, it's evident that the scores for the 'strict' metric align closely with the other three metrics. This consistency suggests a similar degree of performance across the five different entities. This finding corresponds with the uniformity in the total number of instances for these entities.

The results are also visually presented in the following interactive graph created using Plotly,

In [None]:
create_plot(model_B_nervaluate_overall_evaluation_dict, 'Overall', 'Overall: nervaluate Evaluation for System B')

In [None]:
model_B_nervaluate_tag_wise_evaluation_dict=jsonl.load(model_B_output_dir+'nervaluate_tag_wise_evaluation.jsonl')[0]

In [None]:
for tag in system_B_all_tags_list:
    print_evaluation_score_table(model_B_nervaluate_tag_wise_evaluation_dict, "nervaluate, Model B: ", tags=tag)
    print()

Reviewing the entity-specific nervaluate results from Model B, it's observed that, akin to Model A, there is a consistent scoring pattern across all metrics for all entities.

The results are also visually presented in the following interactive graph created using Plotly,

In [None]:
for tag in system_B_all_tags_list:
    create_plot(model_B_nervaluate_tag_wise_evaluation_dict, tag, tag+': nervaluate Evaluation for System B')

## Future Directions and Limitations

The constraints in computational resources limited my ability to conduct extensive hyper-parameter tuning for optimal results. Additionally, this restriction prevented me from exploring and comparing various transformer-based models for fine-tuning with the English subset of the dataset. Addressing these limitations will be a focus in future work.

Another critical area for future investigation is the evaluation of data distribution within the dataset and its real-world relevance, as well as assessing the performance of the different models trained. An interesting approach to this is outlined in the paper "Interpretable Multi-dataset Evaluation for Named Entity Recognition" (https://aclanthology.org/2020.emnlp-main.489.pdf) and its corresponding GitHub repository (https://github.com/neulab/InterpretEval). This methodology could provide deeper insights into the dataset's practical application and the effectiveness of various models.

## Additional Tasks Completed

The use of a cased model, which differentiates between 'Jon' and 'jon' as distinct tokens, could potentially impact the results negatively. To determine if an uncased BERT base model might yield better outcomes than a cased one, I trained two additional models for systems A and B using the same hyperparameters previously mentioned. For these models, I chose the 'bert-base-multilingual-uncased' as the base model. Interestingly, the performance of these uncased models was not as good as their cased counterparts. This unexpected result might be attributed to the limited selection of hyperparameters, warranting further investigation. The detailed results for all these models are available in the 'results' directory.

## Accessing Results and Trained Models

You can find all the saved results and trained models at the following link:

https://drive.google.com/drive/folders/1T0LyFGaHQctgBLVnZUHNpvLzhcvOWjMt