### Project Overview

This project demonstrates Named Entity Recognition (NER) using the CoNLL-2003 dataset. It leverages the Hugging Face datasets library for data loading and preprocessing, and uses a pre-trained DistilBERT model from the Transformers library fine-tuned for token classification.

### Workflow Summary

    Dataset: CoNLL-2003 (from Hugging Face datasets)

    Model: Pre-trained DistilBERT (distilbert-base-uncased) fine-tuned for token classification

    Preprocessing: Tokenization using DistilBertTokenizer with alignment of tokens to entity labels

    Fine-tuning: Hugging Face Trainer API with custom training arguments

    Evaluation: Model performance evaluated on the test set using token-level metrics such as F1 score, precision, and recall

    Classification Report: Detailed token classification metrics are generated during evaluation to analyze entity-level performance

### Inference

After fine-tuning, the model can be used to identify and classify named entities in new text samples. The inference process involves tokenizing input sentences, passing them through the fine-tuned model, and extracting predicted entity labels aligned with the input tokens, allowing the detection of entities such as persons, organizations, locations, and miscellaneous categories.

In [1]:
!pip install transformers
!pip install 'accelerate>=0.26.0'
!pip install -U datasets huggingface_hub
!pip install fsspec==2023.9.2

!pip install seqeval

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuff

In [3]:
from datasets import load_dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, f1_score
import numpy as np
import torch

In [9]:
# Load dataset
dataset = load_dataset("wikiann", "en")

README.md: 0.00B [00:00, ?B/s]

en/validation-00000-of-00001.parquet:   0%|          | 0.00/748k [00:00<?, ?B/s]

en/test-00000-of-00001.parquet:   0%|          | 0.00/748k [00:00<?, ?B/s]

en/train-00000-of-00001.parquet:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [10]:
print(dataset)

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 20000
    })
})


In [21]:
label_list = dataset["train"].features["ner_tags"].feature.names
num_labels = len(label_list)

In [28]:
from transformers import DistilBertForTokenClassification
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-cased")
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-cased", num_labels=num_labels)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
# Tokenize and align labels
# Label alignment function
def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(
        example["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding="max_length",  # or use padding=True for dynamic
        max_length=128,
    )

    labels = []
    word_ids = tokenized_inputs.word_ids()

    previous_word_idx = None
    for word_idx in word_ids:
        if word_idx is None:
            labels.append(-100)
        elif word_idx != previous_word_idx:
            labels.append(example["ner_tags"][word_idx])
        else:
            labels.append(-100)  # Ignore subwords
        previous_word_idx = word_idx

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=False)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [35]:
from seqeval.metrics import classification_report, precision_score, recall_score, accuracy_score

# Compute metrics
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = []
    true_predictions = []

    for pred, label in zip(predictions, labels):
        current_labels = []
        current_preds = []

        for p_id, l_id in zip(pred, label):
            if l_id != -100:  # ignore special tokens
                current_labels.append(label_list[l_id])
                current_preds.append(label_list[p_id])

        true_labels.append(current_labels)
        true_predictions.append(current_preds)

    return {
        "accuracy": accuracy_score(true_labels, true_predictions),
    }

In [36]:
# Training
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    learning_rate=3e-5,
    weight_decay=0.01,
    report_to="none"
)

In [37]:
#Step 5: Fine-Tune the Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
)

In [38]:
# Fine-tune the model
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.2216,0.258655,0.920244




TrainOutput(global_step=625, training_loss=0.2285810302734375, metrics={'train_runtime': 175.2626, 'train_samples_per_second': 114.114, 'train_steps_per_second': 3.566, 'total_flos': 653324559360000.0, 'train_loss': 0.2285810302734375, 'epoch': 1.0})

In [39]:
#Step 6: Evaluate the Model
# Evaluate the fine-tuned model
trainer.evaluate()

{'eval_loss': 0.25865504145622253,
 'eval_accuracy': 0.9202435690319374,
 'eval_runtime': 25.1315,
 'eval_samples_per_second': 397.907,
 'eval_steps_per_second': 12.454,
 'epoch': 1.0}

### Inference

In [45]:
def predict_ner(text):
    import torch

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    tokens = text.split()

    # Tokenize with word alignment info
    encoded = tokenizer(tokens, return_tensors="pt", is_split_into_words=True, truncation=True)
    word_ids = encoded.word_ids(0)  # 🔥 get word_ids properly

    # Move inputs to device
    encoded = {k: v.to(device) for k, v in encoded.items()}

    with torch.no_grad():
        outputs = model(**encoded)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)[0].cpu().numpy()

    final_predictions = []
    previous_word_idx = None

    for token_idx, word_idx in enumerate(word_ids):
        if word_idx is None or word_idx == previous_word_idx:
            continue  # Skip special tokens and subwords
        label_id = predictions[token_idx]
        label = label_list[label_id]
        final_predictions.append((tokens[word_idx], label))
        previous_word_idx = word_idx

    return final_predictions


In [46]:
# Test inference
print(predict_ner("Barack Obama was born in Hawaii and worked in Chicago."))

[('Barack', 'B-PER'), ('Obama', 'I-PER'), ('was', 'O'), ('born', 'O'), ('in', 'O'), ('Hawaii', 'B-LOC'), ('and', 'O'), ('worked', 'O'), ('in', 'O'), ('Chicago.', 'B-LOC')]


In [47]:
text = "Apple is looking at buying U.K. startup for $1 billion"
print(predict_ner(text))

[('Apple', 'B-ORG'), ('is', 'O'), ('looking', 'O'), ('at', 'O'), ('buying', 'O'), ('U.K.', 'B-ORG'), ('startup', 'I-ORG'), ('for', 'O'), ('$1', 'O'), ('billion', 'O')]


### Project Overview

This project demonstrates Named Entity Recognition (NER) using a token classification approach. It uses the Hugging Face datasets library for loading and preprocessing data (e.g., WikiAnn), and fine-tunes a pre-trained bert-base-cased model from the Hugging Face transformers library.

### Workflow Summary

    Dataset: WikiAnn (English) — from Hugging Face datasets hub

    Model: Pre-trained BERT (bert-base-cased) for token classification

    Preprocessing: Tokenization with alignment of entity tags using AutoTokenizer and is_split_into_words=True

    Fine-tuning: Hugging Face Trainer API with a custom training loop and token classification head

    Evaluation: Evaluation using seqeval metrics such as token-level accuracy and optionally precision, recall, and F1-score

    Confusion Matrix: Generated using predicted vs. true entity tags to visualize class-wise model performance

### Inference

Once trained, the model can be used to extract named entities from raw text. The inference pipeline:

    Tokenizes the input sentence using the same tokenizer as training

    Aligns token predictions back to original words

    Uses the model to generate predictions and maps the predicted label indices to human-readable NER tags (e.g., B-PER, I-LOC)

    Outputs a list of (word, entity) pairs