### Installing Required Libraries
This cell installs the essential Python libraries used for Named Entity Recognition (NER).

In [1]:
!pip install -q -U datasets==3.6.0 transformers seqeval

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Importing Required Modules
Imports necessary classes and functions:
- `load_dataset` for fetching the CoNLL-2003 NER dataset.
- `AutoTokenizer` and `AutoModelForTokenClassification` for working with pre-trained BERT models.
- `Trainer` and `TrainingArguments` for fine-tuning the model.
- `accuracy_score`, `f1_score`, and `classification_report` for evaluation.
- The `pipeline` function to create an easy-to-use NER inference pipeline.


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
import numpy as np
from seqeval.metrics import accuracy_score, f1_score, classification_report
from transformers import pipeline

### Loading the CoNLL-2003 Dataset
Loads the **CoNLL-2003** dataset, a well-known dataset for Named Entity Recognition tasks containing annotations for entities like PERSON, LOCATION, and ORGANIZATION.


In [3]:
dataset = load_dataset("conll2003", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

conll2003.py: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [4]:
print(dataset["train"][0])
print(f"Train samples: {len(dataset['train'])}")

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}
Train samples: 14041


### Initializing Tokenizer and Extracting Labels
Loads the `bert-base-cased` tokenizer to match the pre-trained BERT model.  
Also retrieves the list of possible NER label names from the dataset (e.g., "B-PER", "I-LOC").


In [5]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = dataset["train"].features["ner_tags"].feature.names

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [6]:
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

### Tokenizing and Aligning Word Labels
Defines a function `tokenize_and_align_labels()` that:
- Tokenizes each word in a sentence while keeping alignment with the original labels.
- Handles word-piece tokenization by assigning `-100` to subword tokens so they are ignored during loss computation.
This ensures each token aligns correctly with its corresponding NER label.


In [7]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    all_labels = []
    for i, word_ids in enumerate(tokenized_inputs.word_ids(batch_index=i) for i in range(len(examples["tokens"]))):
        labels = []
        previous_word_idx = None

        for word_idx in word_ids:
            if word_idx is None:
                labels.append(-100)
            elif word_idx != previous_word_idx:
                labels.append(examples["ner_tags"][i][word_idx])
            else:
                labels.append(-100)
            previous_word_idx = word_idx
        all_labels.append(labels)

    tokenized_inputs["labels"] = all_labels
    return tokenized_inputs

In [8]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True, remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [9]:
print(tokenized_dataset["train"][0].keys())
print("Input IDs:", tokenized_dataset["train"][0]["input_ids"])
print("Labels:", tokenized_dataset["train"][0]["labels"])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
Input IDs: [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102]
Labels: [-100, 3, 0, 7, 0, 0, 0, 7, 0, -100, 0, -100]


### Preparing Data Collator
Initializes a `DataCollatorForTokenClassification`, which dynamically pads input sequences in each batch to the same length — required for training with variable-length sequences.


In [10]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)


### Loading Pretrained BERT Model for Token Classification
Loads `bert-base-cased` pre-trained model, adapting it for token-level classification with the correct number of labels derived from the dataset.


In [11]:
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(label_list)
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Updating Model Label Mappings
Defines mappings between label IDs and label names (`id2label` and `label2id`) to ensure correct encoding and decoding of predictions.


In [12]:
model.config.id2label = {i: label for i, label in enumerate(label_list)}
model.config.label2id = {label: i for i, label in enumerate(label_list)}

In [13]:
model.config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-MISC",
    "8": "I-MISC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 5,
    "B-MISC": 7,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-MISC": 8,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

### Defining Evaluation Metrics
Defines a custom function `compute_metrics()` to calculate model performance:
- Converts raw predictions into class labels.
- Filters out ignored tokens (`-100`).
- Computes `accuracy` and `F1-score` using `seqeval` metrics.


In [14]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "accuracy": accuracy_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }

### Setting Training Parameters
Creates a `TrainingArguments` object to configure the fine-tuning process, including:
- Training and evaluation strategy.
- Learning rate and batch size.
- Number of epochs and weight decay.
- Logging configuration for tracking progress.


In [15]:
args = TrainingArguments(
    "bert-ner",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    report_to="none",
)

### Initializing Trainer
Creates a `Trainer` instance that brings together:
- Model, datasets, tokenizer, and data collator.
- Training arguments and evaluation logic.
This simplifies fine-tuning and evaluation of the BERT model on the NER dataset.


In [16]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.0473,0.040521
2,0.0069,0.040649
3,0.0116,0.037496


TrainOutput(global_step=2634, training_loss=0.05069550948504616, metrics={'train_runtime': 608.463, 'train_samples_per_second': 69.229, 'train_steps_per_second': 4.329, 'total_flos': 1050534559887048.0, 'train_loss': 0.05069550948504616, 'epoch': 3.0})

### Creating Inference Pipeline
Initializes a Hugging Face `pipeline` for Named Entity Recognition using the fine-tuned model.  
This allows for easy NER inference on raw text inputs.


In [18]:
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy=None
)

Device set to use cuda:0


In [19]:
text = "Angela Merkel was born in Hamburg and was the chancellor of Germany."
ner_results = ner_pipeline(text)

### Post-Processing Function to Merge Word Pieces
Defines a helper function `merge_tokens_with_O()` that merges subword tokens (e.g., “Ger” + “##many”) into full words.  
It also averages their confidence scores and maintains the correct label for the merged word.


In [20]:
def merge_tokens_with_O(ner_results):
    merged = []
    current_word = ""
    current_label = ""
    scores = []

    for token in ner_results:
        word = token["word"]
        label = token["entity"]
        score = token["score"]

        if word.startswith("##"):
            current_word += word[2:]
            scores.append(score)
        else:
            if current_word:
                avg_score = sum(scores) / len(scores)
                merged.append((current_word, current_label, avg_score))

            current_word = word
            current_label = label
            scores = [score]

    if current_word:
        avg_score = sum(scores) / len(scores)
        merged.append((current_word, current_label, avg_score))

    return merged


In [21]:
merged_results = merge_tokens_with_O(ner_results)

for word, label, score in merged_results:
    print(f"{word} ({label}) - Score: {score:.2f}")


Angela (B-PER) - Score: 1.00
Merkel (I-PER) - Score: 0.99
Hamburg (B-LOC) - Score: 1.00
Germany (B-LOC) - Score: 1.00
