# Introduction (Score: +100)

Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying named entities within text, such as names of people, organizations, locations, and other specific categories. In this homework, you will work with DistilBERT, a highly effective, distilled version of the BERT transformer-based language model, to fine-tune it for the NER task. DistilBERT’s deep contextualized embeddings allow it to understand nuanced information in text, capturing the unique attributes of each named entity within various contexts. By leveraging DistilBERT’s pre-trained knowledge, you will explore how fine-tuning on NER data improves the model's ability to recognize and accurately classify entities, enhancing its performance in practical applications like information retrieval, content analysis, and conversational AI. This task will deepen your understanding of both NER and the process of adapting pre-trained language models for specific natural language processing applications.

In this HW, we want to fine-tune the DistilBERT model for the NER task on the [CoNLL-2003](https://huggingface.co/datasets/eriktks/conll2003) dataset.

---

FULLNAME: Zahra Fallah Mirmousavi

STUDENT NUMBER: 401207192

In [1]:
!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install evaluate

import re
import torch
import pandas as pd

import numpy as np
import datasets
import transformers
import evaluate
import matplotlib.pyplot as plt



In [2]:
label2id = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
id2label = {value: key for key, value in label2id.items()}

num_labels = len(label2id)

In [3]:
######################   TODO 1.1   ########################
# Load pretrained DistilBertModel model (distilbert-base-uncased)
# Load tokenizer
# Freeze base model for fine-tuning
###################### (5 points) ##########################

from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast

# Load pretrained DistilBERT model for token classification
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels)

# Load the fast tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# Freeze base model parameters
for param in model.base_model.parameters():
    param.requires_grad = False



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
######################   TODO 1.2  ########################
# Load datasets
###################### (5 points) ##########################
from datasets import load_dataset

# Load the CoNLL-2003 dataset
dataset = load_dataset("conll2003", trust_remote_code=True)

from datasets import load_dataset

# Load the full dataset
dataset = load_dataset("conll2003", trust_remote_code=True)

# Take a random 10% sample from each split
train_dataset = dataset["train"].shuffle(seed=42).select(range(int(len(dataset["train"]) * 0.1)))
val_dataset = dataset["validation"].shuffle(seed=42).select(range(int(len(dataset["validation"]) * 0.1)))
test_dataset = dataset["test"].shuffle(seed=42).select(range(int(len(dataset["test"]) * 0.1)))





# Check the structure of the dataset
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [5]:
######################   TODO 1.3   ########################
# Complete custom data set
# Use Torch dataloader for datasets
# Apply the tokenizer to the data
# Align labels correctly with the tokenized data
###################### (15 points) ##########################
from torch.utils.data import Dataset, DataLoader

class NERDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data["tokens"])

    def __getitem__(self, idx):
        tokens = self.data["tokens"][idx]
        labels = self.data["ner_tags"][idx]

        # Tokenize the input and align labels
        encoded = self.tokenizer(tokens, is_split_into_words=True, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")

        # Align labels with tokenized output
        labels_aligned = [-100 if k is None else labels[k] for k in encoded.word_ids()]
        encoded["labels"] = torch.tensor(labels_aligned)

        return {key: val.squeeze(0) for key, val in encoded.items()}


In [6]:
######################   TODO 2.1   ########################
# Write a code for training model with training dataset
# At each epoch report accurcy for validation dataset
# Save best model by accuracy
# Use only pytorch in this section
###################### (30 points) ##########################
from tqdm import tqdm
from torch.optim import AdamW
from torch.amp import GradScaler, autocast

def train(model, train_dataloader, val_dataloader, epochs, optimizer, device, accumulation_steps=4):
    model.to(device)
    scaler = GradScaler("cuda")  # Updated GradScaler initialization
    best_accuracy = 0
    best_model_state = None

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        optimizer.zero_grad()

        for step, batch in enumerate(tqdm(train_dataloader, desc=f"Training Epoch {epoch+1}/{epochs}")):
            inputs = {key: val.to(device) for key, val in batch.items()}

            # Mixed precision training
            with autocast(device_type="cuda"):
                outputs = model(**inputs)
                loss = outputs.loss / accumulation_steps  # Gradient accumulation

            scaler.scale(loss).backward()

            # Perform optimizer step every accumulation_steps
            if (step + 1) % accumulation_steps == 0 or step == len(train_dataloader) - 1:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()

            total_loss += loss.item() * accumulation_steps  # Accumulate total loss

        # Validation
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for batch in val_dataloader:
                inputs = {key: val.to(device) for key, val in batch.items()}
                outputs = model(**inputs)
                preds = torch.argmax(outputs.logits, dim=-1)
                labels = inputs["labels"]
                mask = labels != -100
                correct += torch.sum((preds == labels) & mask).item()
                total += torch.sum(mask).item()

        accuracy = correct / total
        print(f"Epoch {epoch+1} Validation Accuracy: {accuracy:.4f}")

        # Save the best model
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model_state = model.state_dict()

    # Save the best model
    if best_model_state:
        model.load_state_dict(best_model_state)
        torch.save(model.state_dict(), "best_model.pth")
        print(f"Best Validation Accuracy: {best_accuracy:.4f}")



# Prepare the data loaders
train_dataset = NERDataset(train_dataset, tokenizer, max_length=64)
val_dataset = NERDataset(val_dataset, tokenizer, max_length=64)

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, num_workers=2, pin_memory=True)

# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train(model, train_dataloader, val_dataloader, epochs=5, optimizer=optimizer, device=device)


Training Epoch 1/5: 100%|██████████| 44/44 [01:20<00:00,  1.83s/it]


Epoch 1 Validation Accuracy: 0.2720


Training Epoch 2/5: 100%|██████████| 44/44 [01:06<00:00,  1.52s/it]


Epoch 2 Validation Accuracy: 0.4490


Training Epoch 3/5: 100%|██████████| 44/44 [01:05<00:00,  1.49s/it]


Epoch 3 Validation Accuracy: 0.5938


Training Epoch 4/5: 100%|██████████| 44/44 [01:17<00:00,  1.77s/it]


Epoch 4 Validation Accuracy: 0.6878


Training Epoch 5/5: 100%|██████████| 44/44 [01:10<00:00,  1.60s/it]


Epoch 5 Validation Accuracy: 0.7472
Best Validation Accuracy: 0.7472


In [7]:
######################   TODO 2.2   ########################
# Report best model accuarcy on test dataset
###################### (5 points) ##########################

def evaluate_test(model, test_dataloader, device):
    model.to(device)
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Testing"):
            inputs = {key: val.to(device) for key, val in batch.items()}
            outputs = model(**inputs)
            preds = torch.argmax(outputs.logits, dim=-1)
            labels = inputs["labels"]
            mask = labels != -100
            correct += torch.sum((preds == labels) & mask).item()
            total += torch.sum(mask).item()

    accuracy = correct / total
    print(f"Test Accuracy: {accuracy:.4f}")


# Load the best model
model.load_state_dict(torch.load("best_model.pth"))

# Prepare the test data loader
test_dataset = NERDataset(test_dataset, tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size=32, num_workers=2, pin_memory=True)

# Evaluate the model on the test dataset
evaluate_test(model, test_dataloader, device)

  model.load_state_dict(torch.load("best_model.pth"))
Testing: 100%|██████████| 11/11 [00:07<00:00,  1.40it/s]

Test Accuracy: 0.7446





In [8]:
######################   TODO 2.1   ########################
# Now implement it with huggingface trainer
###################### (30 points) ##########################
from transformers import Trainer, TrainingArguments
import evaluate

# Define the metric for evaluation
metric = evaluate.load("accuracy")

# Metric computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)

    # Mask labels to ignore padding tokens (-100)
    true_labels = labels[labels != -100]
    true_predictions = predictions[labels != -100]

    # Compute accuracy
    return metric.compute(predictions=true_predictions, references=true_labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",             # Directory to save model checkpoints
    evaluation_strategy="epoch",       # Evaluate at the end of each epoch
    save_strategy="epoch",             # Save model checkpoints at the end of each epoch
    learning_rate=5e-5,                # Learning rate for fine-tuning
    per_device_train_batch_size=16,    # Batch size for training
    per_device_eval_batch_size=16,     # Batch size for evaluation
    num_train_epochs=5,                # Number of training epochs
    weight_decay=0.01,                 # Weight decay for regularization
    logging_dir="./logs",              # Directory for logging
    logging_strategy="steps",          # Log every few steps
    logging_steps=100,                 # Number of steps between logs
    save_total_limit=2,                # Save only the last 2 checkpoints
    load_best_model_at_end=True,       # Load the best model at the end of training
    metric_for_best_model="eval_accuracy", # Use accuracy to select the best model
    report_to="none",                  # Disable reporting to external loggers
)

# Create Trainer instance
trainer = Trainer(
    model=model,                       # Fine-tuning model
    args=training_args,                # Training arguments
    train_dataset=train_dataset,       # Training dataset
    eval_dataset=val_dataset,          # Validation dataset
    tokenizer=tokenizer,               # Tokenizer
    compute_metrics=compute_metrics    # Metric computation function
)

# Train the model
trainer.train()


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.281045,0.797239
2,1.481900,1.045474,0.797239
3,1.129400,0.936791,0.797239
4,0.984700,0.889586,0.797239
5,0.907500,0.876063,0.797239


TrainOutput(global_step=440, training_loss=1.1070097142999822, metrics={'train_runtime': 416.3046, 'train_samples_per_second': 16.863, 'train_steps_per_second': 1.057, 'total_flos': 114662606123520.0, 'train_loss': 1.1070097142999822, 'epoch': 5.0})

In [10]:
# Evaluate the fine-tuned model on the test dataset
test_results = trainer.evaluate(test_dataset)

# Print the evaluation results
print(f"Test Accuracy: {test_results['eval_accuracy']:.4f}")


Test Accuracy: 0.7813


In [12]:
######################   TODO 2.2   ########################
# Push your fine-tuned model to the Huggingface Hub
# Provide the link to your fine-tuned model
###################### (10 points) ##########################
!huggingface-cli login
# Push the fine-tuned model to the Hugging Face Hub
model.push_to_hub("Zahra-FMMA/distilbert-ner-conll2003")
tokenizer.push_to_hub("Zahra-FMMA/distilbert-ner-conll2003")

# Provide the Hugging Face Hub link
print("Fine-tuned model available at: https://huggingface.co/Zahra-FMMA/distilbert-ner-conll2003")



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `Zahra` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `Zahra`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Fine-tuned model available at: https://huggingface.co/Zahra-FMMA/distilbert-ner-conll2003
