## Named Entity Recognition (NER) with Transformers 

### Introduction

In this notebook, we will be doing Natural Language Processing with BERT Transformer models. Named Entity Recognition (NER) is a Token Classification task which identifies and extracts entites from text documents. 

### Objectives

- Understand Tokenizing Process 
- Go through NER pipeline


### Dataset

https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus/data - Kaggle Dataset for NER with Corresponding Entity Tags for each Sentence

#### Attributes
- Sentence # - Index (String)
- Sentence - Text Data (String)
- POS - Part of Speech (String)
- Tag - Entity Tag (String)


In [2]:
# Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast
import ast
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Import Data
df = pd.read_csv('./Data/ner.csv')

print(df.columns, df.dtypes)

Index(['Sentence #', 'Sentence', 'POS', 'Tag'], dtype='object') Sentence #    object
Sentence      object
POS           object
Tag           object
dtype: object


In [4]:
df.isna().sum()

Sentence #    0
Sentence      0
POS           0
Tag           0
dtype: int64

In [5]:
df.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [6]:
df.describe()

Unnamed: 0,Sentence #,Sentence,POS,Tag
count,47959,47959,47959,47959
unique,47959,47575,47214,33318
top,Sentence: 47959,VOA 's Mil Arcega reports .,"['NNP', 'POS', 'NNP', 'NNP', 'VBZ', '.']","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
freq,1,17,39,450


In [7]:
# Convert the string representation of a list to a list
df['Tag'] = df['Tag'].apply(lambda x: ast.literal_eval(x))

In [8]:
# Atomize tags with explode and get unique labels
labels = set((df['Tag'].explode().unique())) 

labels

{'B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O'}

## Entity Tags

### Prefixes (Chunks)

`B` - prefix indicates the beginning of a named entity. <br>
`I` - prefix indicates that the token is inside a named entity. <br>
`O` - indicates that the token is not a named entity. <br>
<br>

### Suffixes
`art` Artifacts, e.g., books, songs, etc.<br>
`eve` Events, e.g., battles, elections, holidays, etc.<br>
`geo` Geographical entities, e.g., cities, rivers, countries, etc.<br>
`gpe` Geopolitical entities, e.g., cities, states, countries.<br>
`nat` Natural phenomena, e.g., hurricanes, earthquakes.<br>
`org` Organizations, e.g., companies, government organizations, etc.<br>
`per` Persons.<br>
`tim` Time indicators, e.g., dates, days, months, etc.

In [9]:
label_to_id = {l: i for i, l in enumerate(labels)}
id_to_label = {i: l for l, i in label_to_id.items()}

label_to_id

{'I-gpe': 0,
 'I-per': 1,
 'I-tim': 2,
 'B-gpe': 3,
 'B-org': 4,
 'B-per': 5,
 'B-nat': 6,
 'I-art': 7,
 'B-eve': 8,
 'I-nat': 9,
 'I-geo': 10,
 'B-geo': 11,
 'I-org': 12,
 'B-art': 13,
 'I-eve': 14,
 'B-tim': 15,
 'O': 16}

In [10]:
# Isolate the sentence and tag columns
df = df[['Sentence', 'Tag']]
df.head()

Unnamed: 0,Sentence,Tag
0,Thousands of demonstrators have marched throug...,"[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo..."
1,Families of soldiers killed in the conflict jo...,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,They marched from the Houses of Parliament to ...,"[O, O, O, O, O, O, O, O, O, O, O, B-geo, I-geo..."
3,"Police put the number of marchers at 10,000 wh...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,The protest comes on the eve of the annual con...,"[O, O, O, O, O, O, O, O, O, O, O, B-geo, O, O,..."


In [11]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

sentence = df['Sentence'].iloc[0]

sentence

'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

## Tokenizer 

### Input

Input sequences are expected to be a string sequence of words in order to tokenize them. The max length the tokenizer can handle is 512, therefore sequences over will be truncated.

### Tokenizer Parameters

`add_special_tokens` : Automatically adds **[CLS]** and **[SEP]** tokens

`padding` : If sequence length not reach maximum add **[PAD]** token

`max_length` : maximum sequence length in tokens

`truncation` : truncate sequence if it exceeds max_length

`return_tensors` : tensor return type


### Special Tokens

**[CLS]** - Classifier tokens, Tells our model that this is the start of the sequence

**[SEP]** - Seperator token, Indicates end of sequence, used for others tasks such as QA

**[PAD]** - Padding Token for ensuring all sequences are the same length if under max length


### Outputs

`input_ids` : numeric represnetation of tokens, where {101: **[CLS]**, 102: **[SEP]**, 0: **[PAD]** }

`token_type_ids` : numeric representation of sequence, used in sequence classification or question answering 

`attention_mask` : Boolean for not **[PAD]** token, that is 1 for real tokens, else 0

In [12]:
tokenized_input = tokenizer(sentence, add_special_tokens=True, padding='max_length', truncation=True, max_length=32, return_tensors='pt')

tokenized_input

tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'][0])

word_ids = tokenized_input.word_ids()

word_ids, tokens, tokenized_input,

([None,
  0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  None,
  None,
  None,
  None,
  None,
  None,
  None],
 ['[CLS]',
  'thousands',
  'of',
  'demonstrators',
  'have',
  'marched',
  'through',
  'london',
  'to',
  'protest',
  'the',
  'war',
  'in',
  'iraq',
  'and',
  'demand',
  'the',
  'withdrawal',
  'of',
  'british',
  'troops',
  'from',
  'that',
  'country',
  '.',
  '[SEP]',
  '[PAD]',
  '[PAD]',
  '[PAD]',
  '[PAD]',
  '[PAD]',
  '[PAD]'],
 {'input_ids': tensor([[  101,  5190,  1997, 28337,  2031,  9847,  2083,  2414,  2000,  6186,
           1996,  2162,  1999,  5712,  1998,  5157,  1996, 10534,  1997,  2329,
           3629,  2013,  2008,  2406,  1012,   102,     0,     0,     0,     0,
              0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1,

In [13]:
len(tokens), len(word_ids), len(df['Tag'].iloc[0])

(32, 32, 24)

In [14]:
def tokenize_and_align_labels(text, label_list,label_to_id, tokenizer):
    # Tokenize text
    tokenized_input = tokenizer(text, add_special_tokens=True, truncation=True, max_length=32, padding='max_length', return_tensors='pt')
    word_ids = tokenized_input.word_ids(batch_index=0)  # Assuming batch_size=1 for simplicity
    
    aligned_labels = []
    prev_word_id = None
    for word_id in word_ids:
        if word_id is None:  # Special tokens
            aligned_labels.append(-100)
        elif word_id != prev_word_id:  # New word
            if word_id < len(label_list):    
                aligned_labels.append(label_to_id[label_list[word_id]])
            else:
                aligned_labels.append(-100)
        else:  # Subword tokens
            aligned_labels.append(-100)  # Same label as the first subword or ignore
        prev_word_id = word_id

    print(aligned_labels)
    tokenized_input["labels"] = torch.tensor([aligned_labels]) 
    
    return tokenized_input

In [15]:
class dataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer, max_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __getitem__(self, idx):
        sentence = self.data['Sentence'].iloc[idx]
        labels = self.data['Tag'].iloc[idx]
        encoding = tokenize_and_align_labels(sentence, labels, label_to_id, self.tokenizer)
        return encoding
        
    
    def __len__(self):
        return len(self.data)


In [16]:
# Split the data into train, test, and validation sets
train, test = train_test_split(df, test_size=0.2, random_state=2002)
test, val = train_test_split(test, test_size=0.4, random_state=2002)

train.shape, test.shape, val.shape

((38367, 2), (5755, 2), (3837, 2))

In [17]:
training = dataset(train, tokenizer, 32)
testing = dataset(test, tokenizer, 32)

training[0]

[-100, 11, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 4, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100]


{'input_ids': tensor([[  101, 28352, 15217, 12693,  2874,  6322,  3180,  8647,  1998,  5008,
          4491,  1010,  2788, 11248,  2006, 28352, 10875, 17934, 17773,  1996,
          2430,  2231,  2005,  2062, 12645,  1998,  1037,  3469,  3745,  1997,
          1996,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100,   11, -100, -100,   16,   16,   16,   16,   16,   16,   16,   16,
           16,   16,   16,    4, -100,   16,   16,   16,   16,   16,   16,   16,
           16,   16,   16,   16,   16,   16,   16, -100]])}

In [18]:
for token, label in zip(tokenizer.convert_ids_to_tokens(training[0]['input_ids'][0]), training[0]['labels'][0]):
    if label != -100:
        print(token, id_to_label[label.item()])



[-100, 11, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 4, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100]
[-100, 11, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 4, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100]
bal B-geo
province O
experiences O
regular O
bombing O
and O
shooting O
attacks O
, O
usually O
blamed O
on O
bal B-org
nationalists O
battling O
the O
central O
government O
for O
more O
autonomy O
and O
a O
larger O
share O
of O
the O


In [19]:
params = {
    'TRAIN_BATCH_SIZE': 5,
    'VALID_BATCH_SIZE': 2,
    'EPOCHS': 1,
    'LEARNING_RATE': 1e-5,
    'MAX_LEN': 128,
    'MAX_GRAD_NORM': 10
}

train_params = {
    'batch_size': params['TRAIN_BATCH_SIZE'],
    'shuffle': True,
    'num_workers': 0
}

test_params = {
    'batch_size': params['VALID_BATCH_SIZE'],
    'shuffle': False,
    'num_workers': 0
}

train_loader = torch.utils.data.DataLoader(training, **train_params)
test_loader = torch.utils.data.DataLoader(testing, **test_params)

In [20]:
from transformers import BertForTokenClassification
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(id_to_label), id2label=id_to_label, label2id=label_to_id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

In [21]:
input_ids = training[0]['input_ids']
attention_mask = training[0]['attention_mask']
labels = training[0]['labels']
print("input_ids shape:", input_ids.shape)
print("attention_mask shape:", attention_mask.shape)
print("labels shape:", labels.shape)

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss= outputs.loss
loss

[-100, 11, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 4, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100]
[-100, 11, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 4, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100]
[-100, 11, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 4, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100]
input_ids shape: torch.Size([1, 32])
attention_mask shape: torch.Size([1, 32])
labels shape: torch.Size([1, 32])


tensor(3.0552, grad_fn=<NllLossBackward0>)

In [22]:
training_logits = outputs.logits
print("training_logits shape:", training_logits.shape)

training_logits shape: torch.Size([1, 32, 17])


In [23]:
optimizer = torch.optim.AdamW(model.parameters(), lr=params['LEARNING_RATE'])

In [27]:
# Function to train the model, given an epoch
def train(epoch):

    # Track Loss and Accuracy
    loss, accuracy = 0, 0

    # Track steps and examples 
    num_examples, num_steps = 0, 0

    # Store predictions and labels
    pred, label = [], []

    model.train() # Flag the model to train

    # Iterate over the training data from our DataLoader
    for idx, batch in enumerate(train_loader):

        # Get our mask, input_ids, and labels from the batch
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']

        # Feed the input_ids, attention_mask, and labels to the model, then get loss and logits
        outputs = model(input_ids=input_ids.squeeze(), attention_mask=attention_mask.squeeze(), labels=labels.squeeze())
        loss, logits = outputs.loss, outputs.logits
        loss += loss.item()

        # Increment the number of steps and examples
        num_steps += 1
        num_examples += input_ids.size(0)

        # Every 100 steps, print the loss
        if idx % 100==0:
            loss_step = loss / num_steps
            print(f"Epoch {epoch}, Step {idx}, Loss {loss_step:.4f}")
        
        # Flatten the labels, logits, and predictions
        flat_labels = labels.view(-1)
        flat_logits = logits.view(-1, logits.size(-1))
        flat_preds = torch.argmax(flat_logits, dim=-1)
        
        # Calculate the accuracy
        active_acc = (flat_labels != -100)
        accuracy += ((flat_preds == flat_labels) & active_acc).sum().item()
        
        # Store the predictions and labels
        pred.extend(flat_preds[active_acc].cpu().numpy())
        label.extend(flat_labels[active_acc].cpu().numpy())
        
        # Get the gradients and clip them
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=params['MAX_GRAD_NORM'])

        # Zero the gradients, perform a backward pass, and update the weights
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    epoch_loss = loss / num_steps
    epoch_accuracy = accuracy / num_examples
    print(f"Epoch {epoch}, Loss {epoch_loss:.4f}, Accuracy {epoch_accuracy:.4f}")




In [28]:
for epoch in range(params['EPOCHS']):
    train(epoch)


NameError: name 'trainining' is not defined

In [None]:
def validate(model, test_loader):
    model.eval() # Flag the model to evaluate

    # Same as before, track loss and accuracy, and store predictions and labels
    loss, accuracy = 0, 0
    num_examples, num_steps = 0, 0
    pred, label = [], []

    with torch.no_grad():
        for idx, batch in enumerate(test_loader):
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            labels = batch['labels']

            outputs = model(input_ids=input_ids.squeeze(), attention_mask=attention_mask.squeeze(), labels=labels.squeeze())
            loss, logits = outputs.loss, outputs.logits

            loss += loss.item()
            num_steps += 1
            num_examples += input_ids.size(0)

            if idx % 100 == 0:
                loss_step = loss / num_steps
                print(f"Step {idx}, Loss {loss_step:.4f}")

            flat_labels = labels.view(-1)
            flat_logits = logits.view(-1, logits.size(-1))
            flat_preds = torch.argmax(flat_logits, dim=-1)

            active_acc = (flat_labels != -100)
            accuracy += ((flat_preds == flat_labels) & active_acc).sum().item()
            targets = torch.masked_select(flat_labels, active_acc)

            pred.extend(flat_preds[active_acc])
            label.extend(targets)
    labels = [id_to_label[i] for i in label]
    predictions = [id_to_label[i] for i in pred]

    print(labels)
    print(predictions)

    epoch_loss = loss / num_steps
    epoch_accuracy = accuracy / num_examples
    print(f"Loss {epoch_loss:.4f}, Accuracy {epoch_accuracy:.4f}")
    
    return labels, predictions





In [None]:
labels, predictions = validate(model, test_loader)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(labels, predictions))