# Named Entity Recognition with DistilBERT


This notebook fine-tunes **DistilBERT** (`distilbert-base-uncased`) using Hugging Face Transformers to perform **Named Entity Recognition (NER)**.
The goal is to identify entities such as **persons, organizations, locations, and dates** in text.


**Key steps in this notebook:**
1. Load and preprocess dataset.
2. Tokenize text using DistilBERT tokenizer.
3. Fine-tune DistilBERT for token classification.
4. Evaluate model qualitatively.
5. Visualize sample predictions.

- Install pytorch_lightning (if not already installed)
- Run: !pip install pytorch_lightning

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from transformers import AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification

In [3]:
class NERDataset(Dataset):
    def __init__(self, sentences, labels, label2id, tokenizer, max_len=128):
        self.sentences = sentences
        self.labels = labels
        self.label2id = label2id
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        words, tags = self.sentences[idx], self.labels[idx]
        enc = self.tokenizer(
            words,
            is_split_into_words=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        word_ids = enc.word_ids(batch_index=0)
        labels = []
        for word_id in word_ids:
            if word_id is None:
                labels.append(-100)
            else:
                labels.append(self.label2id[tags[word_id]])
        enc = {k: v.squeeze(0) for k, v in enc.items()}
        enc['labels'] = torch.tensor(labels)
        return enc

## Model: DistilBERT for Token Classification


We use Hugging Face Transformers to load `distilbert-base-uncased` with a classification head for NER.


- **Base Model**: DistilBERT encoder


In [4]:
class LitTokenClassifier(pl.LightningModule):
    def __init__(self, model_name, id2label):
        super().__init__()
        self.model = AutoModelForTokenClassification.from_pretrained(
            model_name,
            num_labels=len(id2label),
            id2label=id2label,
            label2id={v:k for k,v in id2label.items()}
        )

    def training_step(self, batch, batch_idx):
        outputs = self.model(**batch)
        self.log("train_loss", outputs.loss)
        return outputs.loss

    def validation_step(self, batch, batch_idx):
        outputs = self.model(**batch)
        self.log("val_loss", outputs.loss, prog_bar=True)
        return outputs.loss

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=5e-5)

In [5]:
def predict(model, tokenizer, words):
    enc = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors='pt',
        truncation=True,
        padding=True
    )
    model.eval()
    with torch.no_grad():
        outputs = model.model(**enc)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1)
    pred_tags = []
    for idx, word_id in enumerate(enc.word_ids()):
        if word_id is None:
            continue
        label_id = preds[0, idx].item()
        pred_tags.append(model.model.config.id2label[label_id])
    return pred_tags

## Dataset & Preprocessing

- Dataset from: DeepLearning.AI
- Each sample contains a sequence of tokens with corresponding NER tags.

A few tags you might expect to see are:
* `geo`: geographical entity
* `org`: organization
* `per`: person
* `gpe`: geopolitical entity
* `tim`: time indicator
* `art`: artifact
* `eve`: event
* `nat`: natural phenomenon
* `O`: filler word


**Preprocessing steps:**
- Load dataset using Hugging Face `datasets` library.
- Tokenize sentences with DistilBERT tokenizer.
- Align tokenized inputs with entity labels.

In [6]:
def load_dataset(sent_file, label_file):
    sentences = [line.strip().split() for line in open(sent_file, "r")]
    labels = [line.strip().split() for line in open(label_file, "r")]
    return sentences, labels

# Load train/val/test
train_sentences, train_labels = load_dataset("data/train_sentences.txt", "data/train_labels.txt")
val_sentences, val_labels = load_dataset("data/val_sentences.txt", "data/val_labels.txt")
test_sentences, test_labels = load_dataset("data/test_sentences.txt", "data/test_labels.txt")

tags = ['B-art','B-eve','B-geo','B-gpe','B-nat','B-org','B-per','B-tim',
        'I-art','I-eve','I-geo','I-gpe','I-nat','I-org','I-per','I-tim','O']

label2id = {tag: i for i, tag in enumerate(tags)}
id2label = {i: tag for tag, i in label2id.items()}

#### Training the model

In [7]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_dataset = NERDataset(train_sentences, train_labels, label2id, tokenizer)
val_dataset   = NERDataset(val_sentences, val_labels, label2id, tokenizer)
test_dataset  = NERDataset(test_sentences, test_labels, label2id, tokenizer)

collator = DataCollatorForTokenClassification(tokenizer)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=collator)
val_loader   = DataLoader(val_dataset, batch_size=16, collate_fn=collator)
test_loader  = DataLoader(test_dataset, batch_size=16, collate_fn=collator)

if torch.cuda.is_available():
    accelerator = 'gpu'
    devices = 1
else:
    acclerator = 'cpu'
    devices = 'auto'

lit_model = LitTokenClassifier("distilbert-base-uncased", id2label)

trainer = pl.Trainer(max_epochs=3, accelerator=accelerator)
trainer.fit(lit_model, train_loader, val_loader)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                             | Params | Mode
-------------

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


#### Making Predictions

In [8]:
test_words = test_sentences[48]
print("\nWords:", test_words)
print("\nPredicted:", predict(lit_model, tokenizer, test_words))
print("\nTrue:", test_labels[48])


Words: ['During', 'a', 'visit', 'to', 'Hungary', 'Tuesday', ',', 'Mr.', 'Putin', 'said', 'it', 'is', 'quite', 'possible', 'to', 'reach', 'agreement', 'on', 'Moscow', "'s", 'proposal', 'to', 'enrich', 'uranium', 'on', 'Russian', 'soil', 'for', 'Iran', "'s", 'nuclear', 'energy', 'needs', '.']

Predicted: ['O', 'O', 'O', 'O', 'B-geo', 'B-tim', 'O', 'B-per', 'B-per', 'I-per', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O']

True: ['O', 'O', 'O', 'O', 'B-geo', 'B-tim', 'O', 'B-per', 'I-per', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O']


In [9]:
def word_to_label(sentence_idx, lit_model=lit_model, tokenizer=tokenizer):
    words = test_sentences[sentence_idx]
    labels = test_labels[sentence_idx]
    preds = predict(lit_model, tokenizer, words)

    for word, label, pred in zip(words, labels, preds):
        print(f"{word:15} {label:7} → {pred}")


In [10]:
word_to_label(48)

During          O       → O
a               O       → O
visit           O       → O
to              O       → O
Hungary         B-geo   → B-geo
Tuesday         B-tim   → B-tim
,               O       → O
Mr.             B-per   → B-per
Putin           I-per   → B-per
said            O       → I-per
it              O       → O
is              O       → O
quite           O       → O
possible        O       → O
to              O       → O
reach           O       → O
agreement       O       → O
on              O       → O
Moscow          B-geo   → O
's              O       → B-geo
proposal        O       → O
to              O       → O
enrich          O       → O
uranium         O       → O
on              O       → O
Russian         B-gpe   → O
soil            O       → O
for             O       → O
Iran            B-geo   → B-gpe
's              O       → O
nuclear         O       → O
energy          O       → B-geo
needs           O       → O
.               O       → O


- The model rightly identifies entities  including *but not limited to* `Hungary` and `Tuesday`.

- However, it failed to recognise `Moscow` as an entity and misclassifies `said` and `energy` as entities.

- I used `DistilBERT (uncased)` for **speed** and **simplicity**. In practice, a cased model like `bert-base-cased` often performs better on NER tasks, since capitalization matters (`‘Apple’` the company vs `‘apple’` the fruit).

- But DistilBERT makes it easy to demo and deploy.

## Conclusion


- Successfully fine-tuned DistilBERT for Named Entity Recognition.
- Model identifies entities such as persons, locations, organizations, and dates.
---
**Usefulness of NER**:
- Extracting structured data from unstructured text.
- Enhancing search and recommendation systems.
- Supporting legal, biomedical, and financial document analysis.
- This project demonstrates how lightweight transformer models like DistilBERT can achieve strong performance on core NLP tasks.
