<a href="https://colab.research.google.com/github/astromad/MyDeepLearningRepo/blob/master/BuildingCustomNER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Custom Entity Extraction using Transformers**

Here we will discuss about building **custom entity extraction** model using Huggingface transformers architecture. 



What is entity extraction? Entity extraction is one of the areas of NLP (Natural Language Processing) and recent advancements in deep lerning made it possible to do this lot more efficiently. Let's say you have multiple documents and you want to findout how many times your name or birth date is mentioned. Practical usecases of this is to scan through all the documents you have at your disposal and figureout which ones have perfonal information or if you are in biotech field extract the mentions of a drug or chemical compounds used.

In this tutorial, I will show how to extract name,birthday and phone number from the given input.


*   Let's say the given input is 'my name is madhava avvari born on 1970-01-05 and my phone number is 408-306-1500'
*   Our model should extract:
  *   madhava avvari as **name** entity
  *   1970-01-05 as **birth date** entity
  *   408-306-1500 as **phone number** entity



Goal of this tutorial is to show how you can use same dataset and build and train model from scratch in both PyTorch and Keras.
Here we will be using BERT - Bidirectional Encoder Representations from Transformers. To learn more about BERT, please refer to [BERT pre-training](https://arxiv.org/abs/1810.04805). Also read famous trasfrmers paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)

This article devided in 3 parts:


1.   **Preparing the dataset**
2.   **Training and validating the model using PyTorch**
3.   **Training and validating the model using Keras/Tensorflow**






## **Preparing the Dataset**
We use IOB tagged dataset and data should look like this:


```
Sentense_ID word label
0 Alex I-PER
0 is O
0 going O
0 to O
0 Los I-LOC
0 Angeles I-LOC
0 in O
0 California I-LOC
```
We need to tag the entities in the above format and prepare the dataset. Our goal is to train the model such a way that if it's asked to predict arbitrary text to classify , it should be able to extract entities that belong to correct categories.


Preparing the dataset is the hardest job in any machine learing model, When developing custom NER, you need to train your model with entitied you want it to extract. For this we need representative dataset in GB's to be able to get accurate predictions. People use cloud sourcing to create the datasets, or hand annotate , or machine create. Maybe use another machine learning model to generate data using GANs:) 
If you are lucky, there could be a Kaggle dataset waiting for you that you can play with, but eventually you need to generate this annotated data. 
In this tutorial we will use NER dataset created using a small python program using Faker library


Now it's time to read the dataset and load it as Pandas Dataframe


In [121]:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/ColabData/data_torch_small.csv",
                encoding="ISO-8859-1", error_bad_lines=False)

data = df[['sentence_id', 'words', 'labels']]
print(data.head(30))


    sentence_id         words   labels
0           0.0           his        O
1           0.0          name        O
2           0.0            is        O
3           0.0      Benjamin   I-NAME
4           0.0         Green   I-NAME
5           0.0          Born        O
6           0.0            on        O
7           0.0    1984-02-29   I-DATE
8           0.0           and        O
9           0.0           his        O
10          0.0         phone        O
11          0.0        number        O
12          0.0            is        O
13          0.0  907.066.2767  I-PHONE
14          1.0           his        O
15          1.0          name        O
16          1.0            is        O
17          1.0     Katherine   I-NAME
18          1.0        Parker   I-NAME
19          1.0          Born        O
20          1.0            on        O
21          1.0    2005-06-29   I-DATE
22          1.0           and        O
23          1.0           his        O
24          1.0         p

Split the dataset into Train and Test datasets using sci-kit learn utility.  

In [122]:
!pip install future



In [123]:
from sklearn.model_selection import train_test_split
from future.utils import iteritems

train_df, test_df = train_test_split(df, test_size=0.2,shuffle=False)
print ('Train Dataset shape',train_df.shape)
print ('Test Dataset shape',test_df.shape)
labels =tag_list= train_df['labels'].unique()
label_map =  {i: label for i, label in enumerate(labels)}
label2idx = {t: i for i, t in enumerate(labels)}
idx2label = {v: k for k, v in iteritems(label2idx)}
num_labels = len(labels)
print('Labels are:',labels) 

Train Dataset shape (11228, 3)
Test Dataset shape (2808, 3)
Labels are: ['O' 'I-NAME' 'I-DATE' 'I-PHONE']


As you can see our dataset has DATE,NAME and PHONE entities. We now train our model to recognize these entities. 
Let's first start by importing transformers library from Huggingface.

In [124]:
!pip install transformers



Now process the datasets and create sentenses and corresponding labels

Let's review one sentense from our Train dataset


Lets define some variables that we use. Here we are setting maximum sentense length to 128 words and truncate anything after that. Also the padding token to 0, what this means is that if your sentense is less than 128 words then it will pad remaining with 0's

In [125]:
from torch.nn import CrossEntropyLoss
max_seq_length =128
BATCH_SIZE=32
pad_token=0
pad_token_segment_id=0
sequence_a_segment_id=0

Now let's define some model parameters. We will define Tokenizers and Model details. Here we use BERT uncased pre-trained model and using transfer lerning add train using our own training data on top of it.

In [126]:
!rm -rf CustomNER_cache
!rm -rf results_PT
!rm -rf logs_PT
!rm -rf results_TF
!rm -rf logs_TF

In [127]:
from transformers import (
    AutoConfig,
    AutoTokenizer,
)
model_args = dict()
model_args['model_name'] = 'bert-base-uncased' 
model_args['cache_dir'] = "CustomNER_cache/"
model_args['do_basic_tokenize'] = False

config = AutoConfig.from_pretrained(
    model_args['model_name'],
    num_labels=num_labels,
    id2label=label_map,
    label2id={label: i for i, label in enumerate(labels)},
    cache_dir=model_args['cache_dir']
)

tokenizer = AutoTokenizer.from_pretrained(
    model_args['model_name'],
    cache_dir=model_args['cache_dir'],
    is_pretokenized=model_args['do_basic_tokenize'],
    do_basic_tokenize = model_args['do_basic_tokenize']
)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Let's define function to create input dataset, this function reads each sentense and arrange it into 4 sections:
*   Input_ids
*   token_ids
*   attention_masks
*   label_ids

We use tokenizer.encode_plus to further tokenize each words and we add corresponding labels to the list 


In [128]:
from tqdm import tqdm,trange

def convert_to_input(sentences,tags,pad_token_label_id=-100):
  input_id_list,attention_mask_list,token_type_id_list=[],[],[]
  label_id_list=[]
  label2id={label: i for i, label in enumerate(labels)}
  for x,y in tqdm(zip(sentences,tags),total=len(tags)):
    tokens = []
    label_ids = []
    for word, label in zip(x, y):
      word_tokens = tokenizer.tokenize(word)
      tokens.extend(word_tokens)
      label_ids.extend([label2id[label]] + [label2id[label]] * (len(word_tokens) - 1))
    special_tokens_count =  2
    if len(tokens) > max_seq_length - special_tokens_count:
      tokens = tokens[: (max_seq_length - special_tokens_count)]
      label_ids = label_ids[: (max_seq_length - special_tokens_count)]
    label_ids = [pad_token_label_id]+label_ids+[pad_token_label_id]
    inputs = tokenizer.encode_plus(tokens,add_special_tokens=True, max_length=max_seq_length, padding=True,truncation=True)
    input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
    attention_masks = [1] * len(input_ids)
    attention_mask_list.append(attention_masks)
    input_id_list.append(input_ids)
    token_type_id_list.append(token_type_ids)
    label_id_list.append(label_ids)
  return input_id_list,token_type_id_list,attention_mask_list,label_id_list

Let's validate on some test data if we were able to arrange it in proper format for the model training.


In [129]:
sen=[['phone','408-306-1500','Madhava']]
tok=[['O','I-PHONE','I-NAME']]
input_ids,token_ids,attention_masks,label_ids=convert_to_input(sen,tok)
print('')
print('Converted sentence',tokenizer.convert_ids_to_tokens(input_ids[0]))
print('Input_ids:',input_ids)
print('token_ids',token_ids)
print('attention_masks',attention_masks)
print('label_ids',label_ids)

100%|██████████| 1/1 [00:00<00:00, 400.99it/s]


Converted sentence ['[CLS]', 'phone', '40', '##8', '##-', '##30', '##6', '##-', '##15', '##00', 'mad', '##ha', '##va', '[SEP]']
Input_ids: [[101, 3042, 2871, 2620, 29624, 14142, 2575, 29624, 16068, 8889, 5506, 3270, 3567, 102]]
token_ids [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_masks [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
label_ids [[-100, 0, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, -100]]





In [130]:
# This function is used to decorate dataset to the format Tensorflow needs
def example_to_features(input_ids,attention_masks,token_type_ids,y):
  return {"input_ids": input_ids,
          "attention_mask": attention_masks,
          "token_type_ids": token_type_ids},y

Now let's prepare training and test data in the format the Token Classification model accepts, our createDataset function returns training and test datasets for model consumption

In [131]:
import logging
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import tensorflow as tf


def createDataset(framework='pt'):
  logging.basicConfig(level=logging.ERROR)
  agg_func = lambda s: [ [w,t] for w,t in zip(s["words"].values.tolist(),s["labels"].values.tolist())]
  x_train_grouped = train_df.groupby("sentence_id").apply(agg_func)
  x_test_grouped = test_df.groupby("sentence_id").apply(agg_func)
  x_train_sentences = [[s[0] for s in sent] for sent in x_train_grouped.values]
  x_test_sentences = [[s[0] for s in sent] for sent in x_test_grouped.values]
  x_train_tags = [[t[1] for t in tag] for tag in x_train_grouped.values]
  x_test_tags = [[t[1] for t in tag] for tag in x_test_grouped.values]
  if framework=='pt':
    input_ids_train,token_ids_train,attention_masks_train,label_ids_train=convert_to_input(x_train_sentences,x_train_tags,pad_token_label_id=-100)
    input_ids_test,token_ids_test,attention_masks_test,label_ids_test=convert_to_input(x_test_sentences,x_test_tags,pad_token_label_id=-100)
  else:
    input_ids_train,token_ids_train,attention_masks_train,label_ids_train=convert_to_input(x_train_sentences,x_train_tags,pad_token_label_id=-1)
    input_ids_test,token_ids_test,attention_masks_test,label_ids_test=convert_to_input(x_test_sentences,x_test_tags,pad_token_label_id=-1)
  # pad Train and Test sequence to max_seq_length
  input_ids_train = pad_sequences(input_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  token_ids_train = pad_sequences(token_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  attention_masks_train = pad_sequences(attention_masks_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  label_ids_train = pad_sequences(label_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  print('')
  print('Dimentions of Training data')
  print(np.shape(input_ids_train),np.shape(token_ids_train),np.shape(attention_masks_train),np.shape(label_ids_train))
  input_ids_test = pad_sequences(input_ids_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  token_ids_test = pad_sequences(token_ids_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  attention_masks_test = pad_sequences(attention_masks_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  label_ids_test = pad_sequences(label_ids_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
  print('Dimentions of Test data')
  print(np.shape(input_ids_test),np.shape(token_ids_test),np.shape(attention_masks_test),np.shape(label_ids_test))
  if framework=='pt':
    train_ds = TorchNERDataset(input_ids_train,attention_masks_train,token_ids_train,label_ids_train)
    test_ds= TorchNERDataset(input_ids_test,attention_masks_test,token_ids_test,label_ids_test)
  else:
    train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features)
    test_ds=tf.data.Dataset.from_tensor_slices((input_ids_test,attention_masks_test,token_ids_test,label_ids_test)).map(example_to_features)
  return train_ds,test_ds

# **Training and validating the model using PyTorch**

Now that the data is available in the format token classification model expects, let's prepare for training the model. As the data need to be fed in batches to take advantage of efficient distribution of data to train to each worker, This data need to be converted to tensors and be part of Data loader for PyTorch model to read, What this following class doing is preparing data in a dictionary for model to read

In [132]:
import torch
class TorchNERDataset(torch.utils.data.Dataset):
    def __init__(self,ids,mask,tokid, labels):
        self.ids = ids
        self.mask = mask
        self.tokid = tokid
        self.labels = labels

    def __getitem__(self, idx):
        #item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item ={}
        item['input_ids']=torch.tensor(self.ids[idx])
        item['token_type_ids']=torch.tensor(self.tokid[idx])
        item['attention_mask']=torch.tensor(self.mask[idx])
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

Now let's convert the data into required format and validate the dataset

In [133]:
train_ds,test_ds = createDataset(framework='pt')
print('One record of Training dataset')
print(train_ds[0])

100%|██████████| 800/800 [00:00<00:00, 1496.07it/s]
100%|██████████| 201/201 [00:00<00:00, 1475.91it/s]



Dimentions of Training data
(800, 128) (800, 128) (800, 128) (800, 128)
Dimentions of Test data
(201, 128) (201, 128) (201, 128) (201, 128)
One record of Training dataset
{'input_ids': tensor([  101,  2010,  2171,  2003,  6425,  2665,  2141,  2006,  3118, 29624,
         2692,  2475, 29624, 24594,  1998,  2010,  3042,  2193,  2003,  3938,
         2581, 29625,  2692, 28756, 29625, 22907,  2575,  2581,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0

Ok, As you have seen, majority of the machine learning task is to get the data ready for the model to train. Now let's use Hugginface's new **Trainer** module to train the model


In [134]:
def get_special_tokens(tokenizer, label2idx):

    pad_tok = tokenizer.vocab["[PAD]"]
    sep_tok = tokenizer.vocab["[SEP]"]
    cls_tok = tokenizer.vocab["[CLS]"]
    o_lab = label2idx["O"]

    return pad_tok, sep_tok, cls_tok, o_lab

In [135]:
from typing import Dict, List, Optional, Tuple
from torch import nn
def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
        pad_tok, sep_tok, cls_tok, o_lab = get_special_tokens(tokenizer, label2idx)
        print('pad_tok, sep_tok, cls_tok, o_lab',pad_tok, sep_tok, cls_tok, o_lab)
        preds = np.argmax(predictions, axis=2)
        batch_size, seq_len = preds.shape

        out_label_list = [[] for _ in range(batch_size)]
        preds_list = [[] for _ in range(batch_size)]

        for i in range(batch_size):
            for j in range(seq_len):
                #if label_ids[i, j] != nn.CrossEntropyLoss().ignore_index:
                if label_ids[i, j] not in [pad_tok, sep_tok, cls_tok,nn.CrossEntropyLoss().ignore_index]:
                    out_label_list[i].append(label_map[label_ids[i][j]])
                    preds_list[i].append(label_map[preds[i][j]])

        return preds_list, out_label_list


In [136]:
!pip install seqeval



In [137]:
from transformers import EvalPrediction
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(p: EvalPrediction) -> Dict:
        preds_list, out_label_list = align_predictions(p.predictions, p.label_ids)
        return {
            "accuracy_score": accuracy_score(out_label_list, preds_list),
            "precision": precision_score(out_label_list, preds_list),
            "recall": recall_score(out_label_list, preds_list),
            "f1": f1_score(out_label_list, preds_list),
        }


In [138]:
from transformers import (
    AutoModelForTokenClassification,
    Trainer,
    TrainingArguments
)
model = AutoModelForTokenClassification.from_pretrained(
    model_args['model_name'],
    config=config,
    cache_dir=model_args['cache_dir']
)
training_args = TrainingArguments(
    output_dir='./results_PT',          
    num_train_epochs=3,              
    per_device_train_batch_size=16,  
    per_device_eval_batch_size=64,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs_PT',            
    logging_steps=3,
)

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_ds,        
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,  
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [139]:
# Lets tain the model now
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=50.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=50.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=50.0, style=ProgressStyle(description_wid…





TrainOutput(global_step=150, training_loss=0.37813485144947967)

In [140]:
# Let's evaluate model now
trainer.evaluate()

HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=4.0, style=ProgressStyle(description_wid…


pad_tok, sep_tok, cls_tok, o_lab 0 102 101 0


{'epoch': 3.0,
 'eval_accuracy_score': 1.0,
 'eval_f1': 1.0,
 'eval_loss': 0.000774016254581511,
 'eval_precision': 1.0,
 'eval_recall': 1.0}

In [141]:
predictions, label_ids, metrics = trainer.predict(test_ds)
for key, value in metrics.items():
    print("  %s = %s", key, value)

HBox(children=(FloatProgress(value=0.0, description='Prediction', max=4.0, style=ProgressStyle(description_wid…


pad_tok, sep_tok, cls_tok, o_lab 0 102 101 0
  %s = %s eval_loss 0.000774016254581511
  %s = %s eval_accuracy_score 1.0
  %s = %s eval_precision 1.0
  %s = %s eval_recall 1.0
  %s = %s eval_f1 1.0


Now that the model is trained, let's infer the model and check if it's working

In [142]:
from transformers import pipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to('cpu')
ner = pipeline('ner', model=model, tokenizer=tokenizer,grouped_entities=True)
#ner = pipeline('ner', model=model, tokenizer=tokenizer)

print(ner('my name is madhava avvari born on 1970-01-05 and my phone number is 408-306-1500'))

[{'entity_group': 'I-NAME', 'score': 0.9626313845316569, 'word': 'madhava avvari'}, {'entity_group': 'I-DATE', 'score': 0.998680313428243, 'word': '1970-01-05'}, {'entity_group': 'I-PHONE', 'score': 0.9993234500288963, 'word': '408-306-1500'}, {'entity_group': 'I-PHONE', 'score': 0.9993234500288963, 'word': '408-306-1500'}]


# **Training and validating the model using Keras/Tensorflow**

Now let's train the model using Keras/Tensorflow. TF needs training data in slightly different format, so let's prepare data for model training

In [143]:
train_ds,test_ds = createDataset(framework='tf')


100%|██████████| 800/800 [00:00<00:00, 1456.08it/s]
100%|██████████| 201/201 [00:00<00:00, 1453.73it/s]



Dimentions of Training data
(800, 128) (800, 128) (800, 128) (800, 128)
Dimentions of Test data
(201, 128) (201, 128) (201, 128) (201, 128)


Let's validate one record of Training data

In [144]:
for x,y in train_ds.take(1):
  print(x)
  print(y)

{'input_ids': <tf.Tensor: shape=(128,), dtype=int64, numpy=
array([  101,  2010,  2171,  2003,  6425,  2665,  2141,  2006,  3118,
       29624,  2692,  2475, 29624, 24594,  1998,  2010,  3042,  2193,
        2003,  3938,  2581, 29625,  2692, 28756, 29625, 22907,  2575,
        2581,   102,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,   

In [145]:
!rm -rf CustomNER_cache
!rm -rf results_PT
!rm -rf logs_PT
!rm -rf results_TF
!rm -rf logs_TF

In [146]:
from transformers import (
    TFAutoModelForTokenClassification,
    TFTrainer,
    TFTrainingArguments
)

training_args = TFTrainingArguments(
    output_dir='./results_TF',          
    num_train_epochs=3,              
    per_device_train_batch_size=1,  
    per_device_eval_batch_size=1,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs_TF',            
    logging_steps=3,
)
with training_args.strategy.scope():
  model = TFAutoModelForTokenClassification.from_pretrained(
    model_args['model_name'],
    config=config,
    cache_dir=model_args['cache_dir']
  )
trainer = TFTrainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_ds,        
    eval_dataset=test_ds,  
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




In [147]:
# Lets tain the model now
trainer.train()

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [148]:
# Let's evaluate model now
trainer.evaluate()

{'eval_loss': 4.9191905e-05}

In [149]:
from transformers import pipeline
ner = pipeline('ner', model=model, tokenizer=tokenizer,grouped_entities=True)

print(ner('my name is madhava avvari born on 1970-01-05 and my phone number is 408-306-1500'))

[{'entity_group': 'I-NAME', 'score': 0.996915360291799, 'word': 'madhava avvari'}, {'entity_group': 'I-DATE', 'score': 0.9997925659020742, 'word': '1970-01-05'}, {'entity_group': 'I-PHONE', 'score': 0.999878041446209, 'word': '408-306-1500'}, {'entity_group': 'I-PHONE', 'score': 0.999878041446209, 'word': '408-306-1500'}]
