## Named Entity Recognition (NER) with Transformers 

### Introduction

In this notebook, we will be doing Natural Language Processing with BERT Transformer models. Named Entity Recognition (NER) is a Token Classification task which identifies and extracts entites from text documents. 

### Objectives

- Understand Tokenizing Process 
- Go through NER pipeline


### Dataset

https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus/data - Kaggle Dataset for NER with Corresponding Entity Tags for each Sentence

#### Attributes
- Sentence # - Index (String)
- Sentence - Text Data (String)
- POS - Part of Speech (String)
- Tag - Entity Tag (String)


In [168]:
# Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast, AutoModelForTokenClassification, DataCollatorForTokenClassification
import ast
import torch

In [169]:
# Import Data
df = pd.read_csv('./Data/ner.csv')

print(df.columns, df.dtypes)

Index(['Sentence #', 'Sentence', 'POS', 'Tag'], dtype='object') Sentence #    object
Sentence      object
POS           object
Tag           object
dtype: object


In [170]:
df.isna().sum()

Sentence #    0
Sentence      0
POS           0
Tag           0
dtype: int64

In [171]:
df.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [172]:
df.describe()

Unnamed: 0,Sentence #,Sentence,POS,Tag
count,47959,47959,47959,47959
unique,47959,47575,47214,33318
top,Sentence: 47959,VOA 's Mil Arcega reports .,"['NNP', 'POS', 'NNP', 'NNP', 'VBZ', '.']","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
freq,1,17,39,450


In [173]:
# Convert the string representation of a list to a list
df['Tag'] = df['Tag'].apply(lambda x: ast.literal_eval(x))

In [174]:
# Atomize tags with explode and get unique labels
labels = set(df['Tag'].explode().unique()) 

labels

{'B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O'}

## Entity Tags

### Prefixes (Chunks)

`B` - prefix indicates the beginning of a named entity. <br>
`I` - prefix indicates that the token is inside a named entity. <br>
`O` - indicates that the token is not a named entity. <br>
<br>

### Suffixes
`art` Artifacts, e.g., books, songs, etc.<br>
`eve` Events, e.g., battles, elections, holidays, etc.<br>
`geo` Geographical entities, e.g., cities, rivers, countries, etc.<br>
`gpe` Geopolitical entities, e.g., cities, states, countries.<br>
`nat` Natural phenomena, e.g., hurricanes, earthquakes.<br>
`org` Organizations, e.g., companies, government organizations, etc.<br>
`per` Persons.<br>
`tim` Time indicators, e.g., dates, days, months, etc.

In [175]:
label_to_id = {l: i for i, l in enumerate(labels)}
id_to_label = {i: l for l, i in label_to_id.items()}

label_to_id

{'I-tim': 0,
 'O': 1,
 'B-art': 2,
 'B-eve': 3,
 'I-gpe': 4,
 'B-org': 5,
 'I-per': 6,
 'B-tim': 7,
 'I-geo': 8,
 'I-nat': 9,
 'B-gpe': 10,
 'I-eve': 11,
 'B-nat': 12,
 'B-geo': 13,
 'I-org': 14,
 'I-art': 15,
 'B-per': 16}

In [176]:
# Isolate the sentence and tag columns
df = df[['Sentence', 'Tag']]
df.head()

Unnamed: 0,Sentence,Tag
0,Thousands of demonstrators have marched throug...,"[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo..."
1,Families of soldiers killed in the conflict jo...,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,They marched from the Houses of Parliament to ...,"[O, O, O, O, O, O, O, O, O, O, O, B-geo, I-geo..."
3,"Police put the number of marchers at 10,000 wh...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,The protest comes on the eve of the annual con...,"[O, O, O, O, O, O, O, O, O, O, O, B-geo, O, O,..."


In [177]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

sentence = df['Sentence'].iloc[0]

sentence

'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

## Tokenizer 

### Input

Input sequences are expected to be a string sequence of words in order to tokenize them. The max length the tokenizer can handle is 512, therefore sequences over will be truncated.

### Tokenizer Parameters

`add_special_tokens` : Automatically adds **[CLS]** and **[SEP]** tokens

`padding` : If sequence length not reach maximum add **[PAD]** token

`max_length` : maximum sequence length in tokens

`truncation` : truncate sequence if it exceeds max_length

`return_tensors` : tensor return type


### Special Tokens

**[CLS]** - Classifier tokens, Tells our model that this is the start of the sequence

**[SEP]** - Seperator token, Indicates end of sequence, used for others tasks such as QA

**[PAD]** - Padding Token for ensuring all sequences are the same length if under max length


### Outputs

`input_ids` : numeric represnetation of tokens, where {101: **[CLS]**, 102: **[SEP]**, 0: **[PAD]** }

`token_type_ids` : numeric representation of sequence, used in sequence classification or question answering 

`attention_mask` : Boolean for not **[PAD]** token, that is 1 for real tokens, else 0

In [178]:
tokenized_input = tokenizer(sentence, add_special_tokens=True, padding='max_length', truncation=True, max_length=32, return_tensors='pt')

tokenized_input

tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'][0])

word_ids = tokenized_input.word_ids()

word_ids, tokens, tokenized_input,

([None,
  0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  None,
  None,
  None,
  None,
  None,
  None,
  None],
 ['[CLS]',
  'thousands',
  'of',
  'demonstrators',
  'have',
  'marched',
  'through',
  'london',
  'to',
  'protest',
  'the',
  'war',
  'in',
  'iraq',
  'and',
  'demand',
  'the',
  'withdrawal',
  'of',
  'british',
  'troops',
  'from',
  'that',
  'country',
  '.',
  '[SEP]',
  '[PAD]',
  '[PAD]',
  '[PAD]',
  '[PAD]',
  '[PAD]',
  '[PAD]'],
 {'input_ids': tensor([[  101,  5190,  1997, 28337,  2031,  9847,  2083,  2414,  2000,  6186,
           1996,  2162,  1999,  5712,  1998,  5157,  1996, 10534,  1997,  2329,
           3629,  2013,  2008,  2406,  1012,   102,     0,     0,     0,     0,
              0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1,

In [179]:
len(tokens), len(word_ids), len(df['Tag'].iloc[0])

(32, 32, 24)

In [180]:
# Split the data into train, test, and validation sets
train, test = train_test_split(df, test_size=0.2, random_state=2002)
test, val = train_test_split(test, test_size=0.4, random_state=2002)

train.shape, test.shape, val.shape

((38367, 2), (5755, 2), (3837, 2))

In [187]:
def tokenize_and_align_labels(text, label_list,label_to_id):
    # Tokenize text
    tokenized_input = tokenizer(text, add_special_tokens=True, truncation=True, max_length=32, padding='max_length', return_tensors='pt')
    word_ids = tokenized_input.word_ids(batch_index=0)  # Assuming batch_size=1 for simplicity
    
    aligned_labels = []
    prev_word_id = None
    for word_id in word_ids:
        if word_id is None:  # Special tokens
            aligned_labels.append(-100)
        elif word_id != prev_word_id:  # New word
            if word_id < len(label_list):    
                aligned_labels.append(label_to_id[label_list[word_id]])
            else:
                aligned_labels.append(-100)
        else:  # Subword tokens
            aligned_labels.append(-100)  # Same label as the first subword or ignore
        prev_word_id = word_id

    print(aligned_labels)
    tokenized_input["labels"] = torch.tensor([aligned_labels])  
    return tokenized_input

In [185]:
tokenized_ids = tokenize_and_align_labels(train['Sentence'].iloc[0], train['Tag'].iloc[0], label_to_id= label_to_id)

tokenized_ids

[-100, 13, -100, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -100]


{'input_ids': tensor([[  101, 28352, 15217, 12693,  2874,  6322,  3180,  8647,  1998,  5008,
          4491,  1010,  2788, 11248,  2006, 28352, 10875, 17934, 17773,  1996,
          2430,  2231,  2005,  2062, 12645,  1998,  1037,  3469,  3745,  1997,
          1996,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100,   13, -100, -100,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    5, -100,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1, -100]])}

In [188]:
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(labels))

data_collator = DataCollatorForTokenClassification(tokenizer)

metric = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model

for epoch in range(3):
    for i in range(len(train)):
        optimizer.zero_grad()
        tokenized_input = tokenize_and_align_labels(train['Sentence'].iloc[i], train['Tag'].iloc[i], label_to_id)
        outputs = model(**tokenized_input)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        if i % 100 == 0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')

model.eval()

for i in range(len(val)):
    tokenized_input = tokenize_and_align_labels(val['Sentence'].iloc[i], val['Tag'].iloc[i], label_to_id)
    outputs = model(**tokenized_input)
    loss = outputs.loss
    if i % 100 == 0:
        print(f'Loss:  {loss.item()}')
        
model.save_pretrained('./models/ner_model')

model = AutoModelForTokenClassification.from_pretrained('./models/ner_model')


tokenized_input = tokenize_and_align_labels(val['Sentence'].iloc[0], val['Tag'].iloc[0], label_to_id)
outputs = model(**tokenized_input)
predicted_labels = outputs.logits.argmax(-1)



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

[-100, 13, -100, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -100]
Epoch: 0, Loss:  2.954951763153076
[-100, 13, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, -100, -100]
[-100, 1, 1, 1, 1, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 13, 8, 1, 1, 1, 1, 1, 1, 1, 10, 1, 16, 6, 1, 1, 1, -100]
[-100, 5, 14, 14, 16, -100, -100, 6, 1, 1, 1, 1, 1, 1, 5, 14, 14, 7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 13, -100, -100]
[-100, 16, 13, 1, -100, -100, -100, -100, 1, 1, 1, 1, 1, 1, 1, 16, 6, 6, -100, 1, -100, -100, 1, 1, 7, 1, -100, -100, -100, -100, -100, -100]
[-100, 7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 16, 6, 1, 1, 1, 1, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -100, -100, -100]
[-100, 16, 6, -100, 6, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -100, 1, 1, 1, 1, 1, 1, -100, -100, -100, -100, -100, -100, -100, -100]
[-100, 1, 1, 13, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 7, 1, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100

KeyboardInterrupt: 