<span style="font-size:25px;">J040 Nathan Dsouza</span> 

# Named Entity Recognition (NER) with DeBERTa on PII Dataset

## What is NER?
Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying key information (entities) in text, such as names of people, organizations, locations, dates, and more. NER is widely used in information extraction, data anonymization, search engines, and question answering systems.

## Why is NER used?
NER helps structure unstructured text data by extracting meaningful entities, enabling downstream applications to understand, organize, and protect sensitive information. In the context of PII (Personally Identifiable Information) detection, NER is crucial for identifying and removing sensitive data from educational or public datasets.

## Assignment Overview
This notebook demonstrates how to train a Named Entity Recognition (NER) model using the DeBERTa transformer architecture on the Kaggle PII dataset. The approach includes:
- Preprocessing data with BIO tagging (Begin, Inside, Outside)
- Ensuring subtokens are not labeled (set to -100)
- Training a DeBERTa model for token classification
- Evaluating using the `seqeval` metric
- Performing inference with HuggingFace's pipeline and aggregation strategy

## Approach
1. Install and import required packages
2. Load and preprocess the dataset (BIO tagging)
3. Tokenize and align labels for DeBERTa
4. Train the NER model
5. Evaluate using seqeval
6. Run inference with aggregation strategy

---


## 1. Install and Import Required Packages

In this step, we install and import all necessary libraries for data processing, modeling, and evaluation. This includes HuggingFace Transformers, Datasets, and seqeval for NER metrics.


In [5]:
# Step 2: Import libraries
import json
import numpy as np
from datasets import Dataset
import evaluate
from transformers import DebertaV2TokenizerFast, DebertaV2ForTokenClassification, TrainingArguments, Trainer, pipeline
from seqeval.metrics import classification_report

## 2. Load and Preprocess the Dataset

Here, we load the Kaggle PII dataset and preprocess it by applying BIO tagging to the entities. Subtokens are not labeled and set to -100, as required for proper NER training.

In [6]:
# Step 3: Load train and test data
def load_json(path):
    with open(path, 'r', encoding='utf-8') as f:
        return json.load(f)

train_data = load_json('/kaggle/input/j040-snlp-asgmt-6-2-data/train.json')
test_data = load_json('/kaggle/input/j040-snlp-asgmt-6-2-data/test.json')

## 3. Prepare BIO Labels and Label Mappings

We define functions to convert entity annotations to BIO format and create mappings between labels and their IDs. This ensures the model can learn to identify entities correctly.

In [7]:
# Step 4: Prepare BIO labels and label mappings
label_list = []
bio_labels = ['O']
label2id = {'O': 0}
id2label = {0: 'O'}

def get_bio_labels(text, entities=None):
    words = text.split()
    labels = ['O'] * len(words)
    if entities:
        for ent in entities:
            # Find start and end word indices for the entity
            char_start, char_end, ent_type = ent.get('start'), ent.get('end'), ent.get('type')
            # Map character indices to word indices
            word_start = len(text[:char_start].split())
            word_end = len(text[:char_end].split())
            if word_start < len(words):
                labels[word_start] = f'B-{ent_type}'
                if f'B-{ent_type}' not in label2id:
                    label2id[f'B-{ent_type}'] = len(label2id)
                    id2label[label2id[f'B-{ent_type}']] = f'B-{ent_type}'
                    bio_labels.append(f'B-{ent_type}')
                for i in range(word_start + 1, min(word_end, len(words))):
                    labels[i] = f'I-{ent_type}'
                    if f'I-{ent_type}' not in label2id:
                        label2id[f'I-{ent_type}'] = len(label2id)
                        id2label[label2id[f'I-{ent_type}']] = f'I-{ent_type}'
                        bio_labels.append(f'I-{ent_type}')
    return labels

def prepare_dataset(data):
    filtered = [item for item in data if 'full_text' in item]
    texts = [item['full_text'].split() for item in filtered]
    labels = []
    for item in filtered:
        entities = item.get('entities', None)
        labels.append(get_bio_labels(item['full_text'], entities))
    return Dataset.from_dict({'text': texts, 'labels': labels})

train_dataset = prepare_dataset(train_data)
test_dataset = prepare_dataset(test_data)


## 4. Tokenization and Label Alignment

We tokenize the text using DeBERTa's tokenizer and align the BIO labels to the tokens, ensuring subtokens are set to -100. This step prepares the data for model training.

In [8]:
# Step 5: Tokenization and label alignment
tokenizer = DebertaV2TokenizerFast.from_pretrained('microsoft/deberta-v3-base')

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128,  # You can adjust this value as needed
        is_split_into_words=True
    )
    labels = []
    for i, label in enumerate(examples['labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label2id['O'])
            else:
                label_ids.append(-100)  # Don't label subtokens
            previous_word_idx = word_idx
        # Pad label_ids to max_length
        pad_length = tokenized_inputs['input_ids'][i].__len__() - len(label_ids)
        if pad_length > 0:
            label_ids += [-100] * pad_length
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
test_dataset = test_dataset.map(tokenize_and_align_labels, batched=True)


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]



Map:   0%|          | 0/6807 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

## 5. Model Setup

We initialize the DeBERTa model for token classification, specifying the number of labels and mappings. This sets up the model for training on the NER task.

In [9]:
# Step 6: Model setup
model = DebertaV2ForTokenClassification.from_pretrained(
    'microsoft/deberta-v3-base',
    num_labels=len(bio_labels),
    id2label=id2label,
    label2id=label2id
    )

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 6. Training Arguments

We define the training parameters such as learning rate, batch size, number of epochs, and logging settings for the Trainer.

In [10]:
pip install --upgrade transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting transformers
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m78.3 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Downloading transformers-4.56.1-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading huggingface_hub-0.34.4-py3-none-any.whl (561 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m561.5/561.5 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64

In [11]:
# Step 7: Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    report_to="none"
    )

## 7. Custom Metric Using seqeval

We use the seqeval library to compute precision, recall, F1, and accuracy for NER predictions, ensuring robust evaluation of the model's performance.

In [12]:
# Step 8: Custom metric using seqeval
import evaluate
seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[pred] for (pred, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results.get("overall_precision", 0.0),
        "recall": results.get("overall_recall", 0.0),
        "f1": results.get("overall_f1", 0.0),
        "accuracy": results.get("overall_accuracy", 0.0),
    }


Downloading builder script: 0.00B [00:00, ?B/s]

## 8. Train the Model

We use HuggingFace's Trainer to train the DeBERTa model on the processed dataset, using the defined training arguments and custom metric.

In [13]:
# Step 9: Train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
    )

trainer.train()

  trainer = Trainer(


Step,Training Loss
10,0.0
20,0.0
30,0.0
40,0.0
50,0.0
60,0.0
70,0.0
80,0.0
90,0.0
100,0.0




TrainOutput(global_step=2553, training_loss=0.0, metrics={'train_runtime': 893.5599, 'train_samples_per_second': 22.854, 'train_steps_per_second': 2.857, 'total_flos': 1333997296439040.0, 'train_loss': 0.0, 'epoch': 3.0})

## 9. Inference with Aggregation Strategy

We use the HuggingFace NER pipeline with an aggregation strategy to combine BI-labeled tokens into complete entities, making the output more interpretable.

In [14]:
# Step 10: Inference with aggregation strategy
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # or "first", "max", "average"
    )

sample_text = "John Doe is a teacher at Stanford University."
results = ner_pipeline(sample_text)
print(results)

Device set to use cuda:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[]


## 10. Evaluation Report

We generate a detailed classification report using seqeval to assess the model's performance on the test set, including precision, recall, F1-score, and accuracy for each entity type.

In [15]:
# Step 11: Evaluation report
predictions, labels, _ = trainer.predict(test_dataset)
predictions = np.argmax(predictions, axis=2)
true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
true_predictions = [
    [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
    ]
print('True Labels:', true_labels)
print('True Predictions:', true_predictions)
if any(tag != 'O' for seq in true_labels for tag in seq):
    print(classification_report(true_labels, true_predictions))
else:
    print("No entities found in true labels. Classification report not generated.")



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


True Labels: [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'