# Named Entity Recognition (NER) Assignment

In this assignment, you will fine-tune a pre-trained transformer model for a Named Entity Recognition (NER) task using the CoNLL-2003 dataset.

## Step 1: Setup
Let's start by installing the necessary libraries.

In [1]:
# Install the transformers library
!pip install transformers datasets
!pip install transformers[torch]
!pip install seqeval

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m


In [2]:
# Install the necessary libraries
!pip install transformers datasets seqeval



In [3]:
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import load_dataset, load_metric, ClassLabel
import numpy as np

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Step 2: Load Dataset
We will use the CoNLL-2003 dataset, which is available through the `datasets` library.

In [5]:
# Load the CoNLL-2003 dataset
datasets = load_dataset("conll2003", trust_remote_code=True)
label_list = datasets["train"].features["ner_tags"].feature.names

# Display dataset structure
datasets

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

## Step 3: Tokenization
We need to tokenize the dataset using a pre-trained tokenizer.

In [7]:
# Load a pre-trained tokenizer
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenize the dataset
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, padding="max_length")
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])# append label of patricular word index in label_ids # begin code() # end code()
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
# Apply the tokenization and alignment function to the dataset
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True, remove_columns=datasets["train"].column_names)
tokenized_datasets.set_format("torch")

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [9]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

## Step 4: Load Pre-trained Model
We will use a pre-trained `DistilBERT` model for token classification.

In [10]:
# Load a pre-trained model for token classification
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 5: Training
We will train the model using the `Trainer` API.

In [11]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Define the metric
metric = load_metric("seqeval", trust_remote_code=True)

  metric = load_metric("seqeval", trust_remote_code=True)


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2) # get the argmax of predictions # begin code() # end code()

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    #create list of actual labels
    true_labels = [ # create list of actual labels # begin code() # end code()
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss


## Step 6: Evaluation
We will evaluate the model on the test set.

In [None]:
### Ex-5-Task-1
f1_score, precision, recall = None, None, None

### BEGIN SOLUTION
# Replace the resulted metric obtained from training
# Put the metric resulted in the last epoch of the training
f1_score = 0.936501
precision = 0.931126
recall = 0.941939
# raise NotImplementedError()
### END SOLUTION

In [None]:
# Evaluate the model
results = trainer.evaluate(tokenized_datasets["test"])


# Print the evaluation results
print(results)

In [None]:
# Function for NER inference
def ner_inference(texts):
    inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True).to(device) # tokenize the texts and load it in cuda # begin code() # end code()
    with torch.no_grad():
        outputs = model(**inputs) # pass input to model # begin code() # end code()
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)
    results = []
    for i, text in enumerate(texts):
        tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][i]) # convert ids to token # begin code() # end code()
        pred = predictions[i].tolist()
        result = [(token, label_list[p]) for token, p in zip(tokens, pred) if token not in ["[CLS]", "[SEP]", "[PAD]"]]
        results.append(result)
    return results

In [None]:
# Example sentences for NER inference
texts = ["Hugging Face Inc. is a company based in New York City.", "The quick brown fox jumps over the lazy dog."]
ner_results = ner_inference(texts)

# Print inference results
for i, result in enumerate(ner_results):
    print(f"Sentence {i+1}:")
    for token, label in result:
        print(f"{token}: {label}")