# Finetune DistilBERT on the WNUT 17 dataset to detect new entities (person, location, or organization)

## Introduction

**DistilBERT** is a smaller, faster, and lighter version of the BERT (Bidirectional Encoder Representations from Transformers) model.

It is designed to retain 97% of BERT's language understanding capabilities while being 40% smaller and 60% faster.

DistilBERT achieves this through a process called knowledge distillation, where a smaller "student" model learns to mimic a larger "teacher" model. This makes DistilBERT an efficient alternative for various natural language processing tasks like text classification, sentiment analysis, and question answering, especially in environments with limited computational resources.


![](https://www.scaler.com/topics/images/tokenization-text.webp)

## Setup

In [None]:
!pip install transformers datasets evaluate accelerate gradio

In [2]:
# Logged in to HuggingFace Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load WNUT 17 dataset

In [3]:
from datasets import load_dataset

wnut = load_dataset("wnut_17")
wnut

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

The repository for wnut_17 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wnut_17.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3394 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1009 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1287 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})

In [6]:
# Take a look at an example
wnut["train"][0]

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [8]:
# Convert 'ner-tags' numbers to their label names
label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

The letter that prefixes each ner_tag indicates the token position of the entity:

- B- indicates the beginning of an entity.
- I- indicates a token is contained inside the same entity (for example, the State token is a part of an entity like Empire State Building).
- 0 indicates the token doesn’t correspond to any entity.

## Preprocess the data

In [10]:
# Load the DistilBERT tokenizer to preprocess the tokens
from  transformers import AutoTokenizer

model_checkpoint = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [12]:
# tokenize the words into subwords
example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 '@',
 'paul',
 '##walk',
 'it',
 "'",
 's',
 'the',
 'view',
 'from',
 'where',
 'i',
 "'",
 'm',
 'living',
 'for',
 'two',
 'weeks',
 '.',
 'empire',
 'state',
 'building',
 '=',
 'es',
 '##b',
 '.',
 'pretty',
 'bad',
 'storm',
 'here',
 'last',
 'evening',
 '.',
 '[SEP]']

To overcome a mismatch between the input and labels, We’ll need to realign the tokens and labels by:

- Mapping all tokens to their corresponding word with the word_ids method.

- Assigning the label -100 to the special tokens [CLS] and [SEP] so they’re ignored by the PyTorch loss function (CrossEntropyLoss).

- Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

In [13]:
# Function to realign the tokens and labels,
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
    )

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        # Map tokens to their respective words
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        # Set the special tokens to -100
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            # Only label the first token of a given word
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [14]:
# Apply the preprocessing function over the entire dataset
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3394 [00:00<?, ? examples/s]

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

In [16]:
tokenized_wnut["train"][0]

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'input_ids': [101,
  1030,
  2703,
  17122,
  2009,
  1005,
  1055,
  1996,
  3193,
  2013,
  2073,
  1045,
  1005,
  1049,
  2542,
  2005,
  2048,
  3134,
  1012,
  3400,
  2110,
  2311,
  1027,
  9686,
  2497,
  1012,
  3492,
  2919,
  4040,
  2182,
  2197,
  3944,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [-100,
  0,
  -100,
  -100,
  0,
  0,
  -100,
  0,
  0,
  0,
  0,
  0,
  0,
  

In [17]:
# Create a batch of examples using DataCollatorWithPadding
# dynamically pad the sentences to the longest length in a batch during collation

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Evaluate

In [None]:
!pip install seqeval

In [22]:
# load the seqeval framework (precision, recall, F1, accuracy)
import evaluate

sequval = evaluate.load("seqeval")

In [24]:
# Get the NER labels
import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]
labels

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-location',
 'I-location',
 'I-location',
 'O',
 'B-location',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [25]:
# Define a function that passes true predictions and true labels to calculate evaluation scores
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = sequval.compute(
        predictions=true_predictions,
        references=true_labels,
    )

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

## Train the Model

In [26]:
# Create a map of the expected ids to their labels
id2label = {str(i): label for i, label in enumerate(label_list)}
label2id = {v: k for k, v in id2label.items()}

In [29]:
id2label, label2id

({'0': 'O',
  '1': 'B-corporation',
  '2': 'I-corporation',
  '3': 'B-creative-work',
  '4': 'I-creative-work',
  '5': 'B-group',
  '6': 'I-group',
  '7': 'B-location',
  '8': 'I-location',
  '9': 'B-person',
  '10': 'I-person',
  '11': 'B-product',
  '12': 'I-product'},
 {'O': '0',
  'B-corporation': '1',
  'I-corporation': '2',
  'B-creative-work': '3',
  'I-creative-work': '4',
  'B-group': '5',
  'I-group': '6',
  'B-location': '7',
  'I-location': '8',
  'B-person': '9',
  'I-person': '10',
  'B-product': '11',
  'I-product': '12'})

In [30]:
# Load DistilBERT Model
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
# Define training hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="wnut-distilbert-finetuned",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

In [32]:
# Pass the model, dataset, tokenizer, data collator, compute_metrics function and training arguments to Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [33]:
# Call train() to finetune our model
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.275082,0.511387,0.228916,0.316261,0.938481
2,No log,0.262717,0.53985,0.332715,0.411697,0.943397
3,0.183200,0.27037,0.533626,0.338276,0.414067,0.94438


TrainOutput(global_step=639, training_loss=0.16013931817665158, metrics={'train_runtime': 102.309, 'train_samples_per_second': 99.522, 'train_steps_per_second': 6.246, 'total_flos': 137784250920420.0, 'train_loss': 0.16013931817665158, 'epoch': 3.0})

In [34]:
# Push the model to hub
trainer.push_to_hub()

events.out.tfevents.1724433889.2d8984db23bf.903.0:   0%|          | 0.00/7.37k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ashaduzzaman/wnut-distilbert-finetuned/commit/6c0748259869e6425248953e2729baec39f172b7', commit_message='End of training', commit_description='', oid='6c0748259869e6425248953e2729baec39f172b7', pr_url=None, pr_revision=None, pr_num=None)

## Inference

### Perform inference using pipeline

In [39]:
# text = "The Golden State Warriors are an American professional basketball team based in San Francisco."
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."

In [40]:
from transformers import pipeline

classifier = pipeline("ner", model="Ashaduzzaman/wnut-distilbert-finetuned")
classifier(text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-person',
  'score': 0.8594397,
  'index': 1,
  'word': 'barack',
  'start': 0,
  'end': 6},
 {'entity': 'B-person',
  'score': 0.6641145,
  'index': 2,
  'word': 'obama',
  'start': 7,
  'end': 12},
 {'entity': 'B-location',
  'score': 0.7212716,
  'index': 6,
  'word': 'hawaii',
  'start': 25,
  'end': 31},
 {'entity': 'B-person',
  'score': 0.38340038,
  'index': 12,
  'word': 'president',
  'start': 55,
  'end': 64}]

### Perform inference using Gradio

In [37]:
from transformers import DistilBertTokenizerFast, DistilBertForTokenClassification
from transformers import pipeline

# Load the tokenizer and model
model_name = "Ashaduzzaman/wnut-distilbert-finetuned"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForTokenClassification.from_pretrained(model_name)

# Create a pipeline for token classification
nlp_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [38]:
import gradio as gr

def predict_entities(text):
    # Use the pipeline to get NER predictions
    results = nlp_pipeline(text)

    # Format the results
    entities = [{"word": res["word"], "entity": res["entity"], "score": res["score"]} for res in results]
    return entities

# Create a Gradio interface
interface = gr.Interface(
    fn=predict_entities,
    inputs=gr.Textbox(label="Input Text", lines=5),
    outputs=gr.JSON(label="Named Entities"),
    title="NER with Fine-tuned DistilBERT",
    description="Enter text to extract named entities using a fine-tuned DistilBERT model."
)

# Launch the interface
interface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://14c6ca9b65f3386152.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


