<a href="https://colab.research.google.com/github/astromad/MyDeepLearningRepo/blob/master/BuildingCustomNER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Custom Entity Extraction using Transformers**

Here we will talk about building **custom entity extraction** using Huggingface transformers. Goal of this work is to show how you can use same dataset and build and train model from scratch in both PyTorch and Keras.
Here we will be using BERT - Bidirectional Encoder Representations from Transformers. To learn more about BERT, please refer to https://arxiv.org/abs/1810.04805


This article devided in 3 parts:


1.   **Preparing the dataset**
2.   **Training and validating the model using PyTorch**
3.   **Training and validating the model using Keras/Tensorflow**






## **Preparing the Dataset**
We use IOB tagged dataset and data should look like this:


```
Sentense_ID word label
0 Alex I-PER
0 is O
0 going O
0 to O
0 Los I-LOC
0 Angeles I-LOC
0 in O
0 California I-LOC
```
We need to tag the entities in the above format and prepare the dataset. Our goal is to train the model such a way that if it's asked to predict arbitrary text to classify , it should be able to extract entities that belong to correct categories.


Preparing the dataset is the hardest job in any machine learing model, When developing custom NER, you need to train your model with entitied you want it to extract. For this we need representative dataset in GB's to be able to get accurate predictions. people use cloud source to create the datasets, or hand annotate , or machine create. May be another machine learning model to generate data using GANs:) 
If you are lucky, there could be a Kaggle dataset waiting for you that you can play with, but eventually you need to generate this annotated data. 
In this tutorial we will use NER dataset created using a small python program using Faker library


Now it's time to read the dataset and load it as Pandas Dataframe


In [1]:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/ColabData/data_torch.csv",
                encoding="ISO-8859-1", error_bad_lines=False)

data = df[['sentence_id', 'words', 'labels']]
print(data.head(30))


    sentence_id             words   labels
0             0               his        O
1             0              name        O
2             0                is        O
3             0             Keith   I-NAME
4             0            Melton   I-NAME
5             0              Born        O
6             0                on        O
7             0        1981-12-17   I-DATE
8             0                 ,        O
9             0               His        O
10            0               SSN        O
11            0                is        O
12            0       653-31-7274    I-SSN
13            0               His        O
14            0              card        O
15            0            number        O
16            0                is        O
17            0  4586172786598806  I-CCARD
18            0                 ,        O
19            0               and        O
20            0               his        O
21            0             phone        O
22         

Split the dataset into Train and Test datasets using sci-kit learn utility. Also 

In [2]:
!pip install future



In [3]:
from sklearn.model_selection import train_test_split
from future.utils import iteritems

train_df, test_df = train_test_split(df, test_size=0.2,shuffle=False)
print ('Train Dataset shape',train_df.shape)
print ('Test Dataset shape',test_df.shape)
labels =tag_list= train_df['labels'].unique()
label_map =  {i: label for i, label in enumerate(labels)}
label2idx = {t: i for i, t in enumerate(labels)}
idx2label = {v: k for k, v in iteritems(label2idx)}
num_labels = len(labels)
print('Labels are:',labels) 

Train Dataset shape (715211, 3)
Test Dataset shape (178803, 3)
Labels are: ['O' 'I-NAME' 'I-DATE' 'I-SSN' 'I-CCARD' 'I-PHONE']


As you can see our dataset has DATE,NAME,SSN,PHONE and Credit Card entities. We now train our model to recognize these entities. Let's first start by importing transformers library from Huggingface.

In [4]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 2.0MB/s 
Collecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 13.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 31.2MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB

Now process the datasets and create sentenses and corresponding labels

In [5]:
agg_func = lambda s: [ [w,t] for w,t in zip(s["words"].values.tolist(),s["labels"].values.tolist())]
x_train_grouped = train_df.groupby("sentence_id").apply(agg_func)
x_test_grouped = test_df.groupby("sentence_id").apply(agg_func)
x_train_sentences = [[s[0] for s in sent] for sent in x_train_grouped.values]
x_test_sentences = [[s[0] for s in sent] for sent in x_test_grouped.values]
x_train_tags = [[t[1] for t in tag] for tag in x_train_grouped.values]
x_test_tags = [[t[1] for t in tag] for tag in x_test_grouped.values]

Let's review one sentense from our Train dataset


In [6]:
print('one sentense from training set',x_train_sentences[0])
print('corresponding Labels',x_train_tags[0])

one sentense from training set ['his', 'name', 'is', 'Keith', 'Melton', 'Born', 'on', '1981-12-17', ',', 'His', 'SSN', 'is', '653-31-7274', 'His', 'card', 'number', 'is', '4586172786598806', ',', 'and', 'his', 'phone', 'number', 'is', '(071)650-8889', 'his', 'name', 'is', 'Lauren', 'Alexander', 'Born', 'on', '1982-07-23', ',', 'His', 'SSN', 'is', '532-05-9538', 'His', 'credit', 'card', 'is', '213111415699782', ',', 'and', 'his', 'phone', 'number', 'is', '+1-947-972-2133x6430', 'his', 'name', 'is', 'Cassandra', 'Howard', 'Born', 'on', '1997-12-04', ',', 'His', 'S', 'S', 'N', 'is', '493 22 1163', 'His', 'CCARD', 'is', '4973719373945823612', ',', 'and', 'his', 'phone', 'number', 'is', '0632655726']
corresponding Labels ['O', 'O', 'O', 'I-NAME', 'I-NAME', 'O', 'O', 'I-DATE', 'O', 'O', 'O', 'O', 'I-SSN', 'O', 'O', 'O', 'O', 'I-CCARD', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PHONE', 'O', 'O', 'O', 'I-NAME', 'I-NAME', 'O', 'O', 'I-DATE', 'O', 'O', 'O', 'O', 'I-SSN', 'O', 'O', 'O', 'O', 'I-CCARD', 'O

Lets define some variables that we use. Here we are setting maximum sentense length to 128 words and truncate anything after that. Also the padding token to be something that model ignores, which is -100

In [7]:
from torch.nn import CrossEntropyLoss
max_seq_length =128
pad_token_label_id = CrossEntropyLoss().ignore_index # value -100
BATCH_SIZE=32
pad_token=0
pad_token_segment_id=0
sequence_a_segment_id=0

Now let's define some model parameters. We will define Tokenizers and Model details. Here we use BERT uncased pre-trained model and using transfer lerning add train using our own training data on top of it.

In [8]:
!rm -rf CustomNER_cache
!rm -rf results_PT
!rm -rf logs_PT

In [9]:
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
model_args = dict()
model_args['model_name'] = 'bert-base-uncased' 
model_args['cache_dir'] = "CustomNER_cache/"
model_args['do_basic_tokenize'] = False

config = AutoConfig.from_pretrained(
    model_args['model_name'],
    num_labels=num_labels,
    id2label=label_map,
    label2id={label: i for i, label in enumerate(labels)},
    cache_dir=model_args['cache_dir']
)

tokenizer = AutoTokenizer.from_pretrained(
    model_args['model_name'],
    cache_dir=model_args['cache_dir'],
    is_pretokenized=model_args['do_basic_tokenize'],
    do_basic_tokenize = model_args['do_basic_tokenize']
)

model = AutoModelForTokenClassification.from_pretrained(
    model_args['model_name'],
    config=config,
    cache_dir=model_args['cache_dir']
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

Let's define function to create input dataset, this function reads each sentense and arrange it into 4 sections:
*   Input_ids
*   token_ids
*   attention_masks
*   label_ids

We use tokenizer.encode_plus to further tokenize the words and we add corresponding labels to the list 


In [10]:
from tqdm import tqdm,trange

def convert_to_input(sentences,tags):
  input_id_list,attention_mask_list,token_type_id_list=[],[],[]
  label_id_list=[]
  label2id={label: i for i, label in enumerate(labels)}
  for x,y in tqdm(zip(sentences,tags),total=len(tags)):
    tokens = []
    label_ids = []
    for word, label in zip(x, y):
      word_tokens = tokenizer.tokenize(word)
      tokens.extend(word_tokens)
      label_ids.extend([label2id[label]] + [label2id[label]] * (len(word_tokens) - 1))
    special_tokens_count =  2
    if len(tokens) > max_seq_length - special_tokens_count:
      tokens = tokens[: (max_seq_length - special_tokens_count)]
      label_ids = label_ids[: (max_seq_length - special_tokens_count)]
    label_ids = [pad_token_label_id]+label_ids+[pad_token_label_id]
    inputs = tokenizer.encode_plus(tokens,add_special_tokens=True, max_length=max_seq_length, padding=True,truncation=True)
    input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
    attention_masks = [1] * len(input_ids)
    attention_mask_list.append(attention_masks)
    input_id_list.append(input_ids)
    token_type_id_list.append(token_type_ids)
    label_id_list.append(label_ids)
  return input_id_list,token_type_id_list,attention_mask_list,label_id_list

Let's validate on some test data if we were able to arrange it in proper format for the model training.


In [11]:
sen=[['phone','408-306-1500','Madhava'],['I','am','working']]
tok=[['O','I-PHONE','I-NAME'],['O','O','O']]
input_ids,token_ids,attention_masks,label_ids=convert_to_input(sen,tok)
print('')
print('Input_ids:',input_ids)
print('token_ids',token_ids)
print('attention_masks',attention_masks)
print('label_ids',label_ids)

100%|██████████| 2/2 [00:00<00:00, 1016.31it/s]


Input_ids: [[101, 3042, 2871, 2620, 29624, 14142, 2575, 29624, 16068, 8889, 5506, 3270, 3567, 102], [101, 1045, 2572, 2551, 102]]
token_ids [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
attention_masks [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]
label_ids [[-100, 0, 5, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, -100], [-100, 0, 0, 0, -100]]





Now let's prepare Training and Test data in the format the Token Classification model accepts

In [12]:
import logging
from keras.preprocessing.sequence import pad_sequences
import numpy as np

logging.basicConfig(level=logging.ERROR)

input_ids_train,token_ids_train,attention_masks_train,label_ids_train=convert_to_input(x_train_sentences,x_train_tags)
input_ids_test,token_ids_test,attention_masks_test,label_ids_test=convert_to_input(x_test_sentences,x_test_tags)
input_ids_train = pad_sequences(input_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
token_ids_train = pad_sequences(token_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
attention_masks_train = pad_sequences(attention_masks_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
label_ids_train = pad_sequences(label_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
print('')
print('Dimentions of Training data')
print(np.shape(input_ids_train),np.shape(token_ids_train),np.shape(attention_masks_train),np.shape(label_ids_train))

input_ids_test = pad_sequences(input_ids_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
token_ids_test = pad_sequences(token_ids_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
attention_masks_test = pad_sequences(attention_masks_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
label_ids_test = pad_sequences(label_ids_test,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
print('Dimentions of Test data')
print(np.shape(input_ids_test),np.shape(token_ids_test),np.shape(attention_masks_test),np.shape(label_ids_test))

100%|██████████| 20000/20000 [00:34<00:00, 586.58it/s]
100%|██████████| 5000/5000 [00:08<00:00, 594.86it/s]



Dimentions of Training data
(20000, 128) (20000, 128) (20000, 128) (20000, 128)
Dimentions of Test data
(5000, 128) (5000, 128) (5000, 128) (5000, 128)


# **Training and validating the model using PyTorch**

Now that the data is available in the format token classification model expects, let's prepare for training the model. As the data need to be fed in batches to take advantage of efficient distribution of data to train to each worker, This data need to be converted to tensors and be part of Data loader for PyTorch model to read, What this following class doing is preparing data in a dictionary for model to read

In [13]:
import torch
class TorchNERDataset(torch.utils.data.Dataset):
    def __init__(self,ids,mask,tokid, labels):
        self.ids = ids
        self.mask = mask
        self.tokid = tokid
        self.labels = labels

    def __getitem__(self, idx):
        #item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item ={}
        item['input_ids']=torch.tensor(self.ids[idx])
        item['token_type_ids']=torch.tensor(self.tokid[idx])
        item['attention_mask']=torch.tensor(self.mask[idx])
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

Now let's convert the data into required format and validate the dataset

In [14]:
train_ds = TorchNERDataset(input_ids_train,attention_masks_train,token_ids_train,label_ids_train)
test_ds= TorchNERDataset(input_ids_test,attention_masks_test,token_ids_test,label_ids_test)
print(train_ds[0])


{'input_ids': tensor([  101,  2010,  2171,  2003,  6766, 14899,  2239,  2141,  2006,  3261,
        29624, 12521, 29624, 16576,  1010,  2010,  7020,  2078,  2003,  3515,
         2509, 29624, 21486, 29624,  2581, 22907,  2549,  2010,  4003,  2193,
         2003,  3429, 20842, 16576, 22907, 20842, 28154,  2620, 17914,  2575,
         1010,  1998,  2010,  3042,  2193,  2003,  1006,  2692,  2581,  2487,
        29620, 26187,  2692, 29624,  2620,  2620,  2620,  2683,  2010,  2171,
         2003, 10294,  3656,  2141,  2006,  3196, 29624,  2692,  2581, 29624,
        21926,  1010,  2010,  7020,  2078,  2003,  5187,  2475, 29624,  2692,
         2629, 29624,  2683, 22275,  2620,  2010,  4923,  4003,  2003, 19883,
        14526, 16932, 16068,  2575,  2683,  2683,  2581,  2620,  2475,  1010,
         1998,  2010,  3042,  2193,  2003,  1009,  2487, 29624,  2683, 22610,
        29624,  2683,  2581,  2475, 29624, 17465, 22394,  2595, 21084, 14142,
         2010,  2171,  2003, 15609,  4922,  2141, 

Ok, As you have seen, majority of the machine learning task is to get the data ready for the model to train. Now let's use Hugginface's new **Trainer** module to train the model


In [15]:
from transformers import (
    Trainer,
    TrainingArguments
)

training_args = TrainingArguments(
    output_dir='./results_PT',          
    num_train_epochs=3,              
    per_device_train_batch_size=16,  
    per_device_eval_batch_size=64,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs_PT',            
    logging_steps=3,
)

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_ds,        
    eval_dataset=test_ds,  
)

In [None]:
# Lets tain the model now
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1250.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description='Iteration', max=1250.0, style=ProgressStyle(description_w…

In [None]:
# Let's evaluate model now
trainer.evaluate()

Now that the model is trained, let's infer the model and check if it's working

In [None]:
from transformers import pipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to('cpu')
ner = pipeline('ner', model=model, tokenizer=tokenizer,grouped_entities=True)
#ner = pipeline('ner', model=model, tokenizer=tokenizer)

print(ner('my name is madhava avvari born on 1970-01-05 and my phone number is 408-306-1500 and my ssn is 626-89-2356'))

# **Training and validating the model using Keras/Tensorflow**

Now let's train the model using Keras/Tensorflow. TF needs training data in slightly different format, so let's prepare data for model training

In [None]:
import tensorflow as tf
def example_to_features(input_ids,attention_masks,token_type_ids,y):
  return {"input_ids": input_ids,
          "attention_mask": attention_masks,
          "token_type_ids": token_type_ids},y
train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features)
test_ds=tf.data.Dataset.from_tensor_slices((input_ids_test,attention_masks_test,token_ids_test,label_ids_test)).map(example_to_features)

Let's validate one record of Training data

In [None]:
for x,y in train_ds.take(1):
  print(x)
  print(y)

In [None]:
!rm -rf CustomNER_cache
!rm -rf results_TF
!rm -rf logs_TF

In [None]:
from transformers import (
    Trainer,
    TrainingArguments
)

training_args = TrainingArguments(
    output_dir='./results_TF',          
    num_train_epochs=3,              
    per_device_train_batch_size=16,  
    per_device_eval_batch_size=64,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs_TF',            
    logging_steps=3,
)
with training_args.strategy.scope():
  model = AutoModelForTokenClassification.from_pretrained(
    model_args['model_name'],
    config=config,
    cache_dir=model_args['cache_dir']
  )
trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_ds,        
    eval_dataset=test_ds,  
)

In [None]:
# Lets tain the model now
trainer.train()

In [None]:
# Let's evaluate model now
trainer.evaluate()

In [None]:
from transformers import pipeline
ner = pipeline('ner', model=model, tokenizer=tokenizer,grouped_entities=True)
#nlp_bert_lg = pipeline('ner', model=model, tokenizer=tokenizer)

print(ner('my name is madhava avvari born on 1970-01-05 and my phone number is 408-306-1500 and my ssn is 626-89-2356'))