<a href="https://colab.research.google.com/github/crazycloud/dl-blog/blob/master/_notebooks/2020_09_20_Entity_Extraction_Transformers_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# "Entity Extraction (NER) - Training and Inference using Transformers - Part 2"
> Learn to train a NER model using Transformers Trainer Class, and to run Inference using Pipeline function

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [nlp, token classification, ]
- image: images/transformers-ner-part2.png
- hide: false

In the Part-1, we talked about how to use the pretrained language model, Tokenizer and TokenClassification model.

## Fine Tuning Token Classification Model

Steps for Finetuning TC Model

**Step 1.** Dataset - Get labelled dataset for training and testing. For token classification, we need word level labels like following

- `The` O  
- `battery` Aspect
- `of` O 
- `the` O  
- `speaker` O 
- `is` O 
- `very` Sentiment
- `poor` Sentiment

  
  We need train.txt, test.txt and labels.txt file to finetune the model, and to verify the performance of the model.

  train.txt/test.txt - word and corresponding label per line and a blank line between each example.

  labels.txt - list of unique tags that we have labelled in the train.txt and test.txt

**Step 2.** If the data is in any other format, we have to either convert it into this format, or write a new Task class and implement the methods of [TokenClassificationTask](https://github.com/huggingface/transformers/blob/4f6e52574248636352a746cfe6cc0b13cf3eb7f9/examples/token-classification/tasks.py#L13)

For this example, we will use the Task class NER in the [tasks.py](https://github.com/huggingface/transformers/blob/4f6e52574248636352a746cfe6cc0b13cf3eb7f9/examples/token-classification/tasks.py#L13) file. This will prepare the list of InputExample

```python
class InputExample:
    """
    A single training/test example for token classification.
    Args:
        guid: Unique id for the example.
        words: list. The words of the sequence.
        labels: (Optional) list. The labels for each word of the sequence. This should be
        specified for train and dev examples, but not for test examples.
    """
    
    guid: str
    words: List[str]
    labels: Optional[List[str]]
```

**Step 3.** NER class extends the [TokenClassificationTask](https://github.com/huggingface/transformers/blob/4f6e52574248636352a746cfe6cc0b13cf3eb7f9/examples/token-classification/utils_ner.py#L68) class which has a method `convert_examples_to_features` to convert list of example in to input features 

```python
class InputFeatures:
    """
    A single set of features of data.
    Property names are the same names as the corresponding inputs to a model.
    """

    input_ids: List[int]
    attention_mask: List[int]
    token_type_ids: Optional[List[int]] = None
    label_ids: Optional[List[int]] = None

```

This method convert_examples_to_features uses the Tokenizer class to convert InputExample into InputFeature. In the Part-1 we discussed about the Tokenizer and how to prepate input for the model.

**Step 4.** Convert InputFeatures into Pytorch Dataset. The utils_ner.py has a function to convert InputFeatures into Dataset required for training.

**Step 5.** Call the Trainer class in run_ner.py which trains the model and evaluates the model.


## Installation
Install the latest transformers library 

In [None]:
!pip install git+https://github.com/huggingface/transformers.git
!pip install seqeval
!pip install conllu

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Prepare the Training Data

Download the dataset from https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

download the file `new_dataset.csv`

Let's load the dataset in pandas dataframe and look at the data

In [4]:
import pandas as pd
data_path = '/content/drive/My Drive/transformers-ner/ner_dataset.csv'
df = pd.read_csv(data_path, encoding="latin-1")

# fille the empty Sentence# with the previous available value
df.loc[:, "Sentence #"] = df["Sentence #"].fillna(method="ffill")
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [6]:
sentences = df.groupby("Sentence #")["Word"].apply(list).values
tags = df.groupby("Sentence #")["Tag"].apply(list).values

In [7]:
print(sentences[0])
print(tags[0])

['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through', 'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and', 'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from', 'that', 'country', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']


### Split the dataset into Train and Test using sklearn 

In [9]:
from sklearn.model_selection import train_test_split
#split into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(sentences, tags, test_size=0.2, random_state=42)

In [11]:
TRAIN_FILE_PATH = '/content/drive/My Drive/transformers-ner/data/train.txt'
TEST_FILE_PATH = '/content/drive/My Drive/transformers-ner/data/test.txt'
LABELS_FILE_PATH = '/content/drive/My Drive/transformers-ner/data/labels.txt'
with open(TRAIN_FILE_PATH,'w') as ftrain:
  for (k,v) in zip(X_train, y_train):
    [ftrain.write(s+' '+t+'\n') for s,t in zip(k,v)]
    ftrain.write('\n')

with open(TEST_FILE_PATH,'w') as ftest:
  for (k,v) in zip(X_test, y_test):
    [ftest.write(s+' '+str(t)+'\n') for s,t in zip(k,v)]
    ftest.write('\n')

Prepare the labels file with list of unique labels

In [12]:
with open(LABELS_FILE_PATH,'w') as f:
  for tag in df['Tag'].unique():
    print(tag)
    f.write(str(tag)+'\n')

O
B-geo
B-gpe
B-per
I-geo
B-org
I-org
B-tim
B-art
I-art
I-per
I-gpe
I-tim
B-nat
B-eve
I-eve
I-nat


### Download Finetuning Code from Transformers Package

Download following files from transformers github repo.

In [None]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/token-classification/utils_ner.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/token-classification/run_ner.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/token-classification/tasks.py

## Training Code


The datafiles are converted into Input Example and and input features using the tasks.py NER class. The run_ner.py takes care of converting 


>Datafile -> InputExample -> InputFeature -> Dataset -> DataLoader

which is required for Training Loop. We can simply run the run_ner.py file with all the required parameters. 

We will perform training using run_ner.py but before that let us look at how the InputExample and InputFeature looks like 

In [18]:
from tasks import NER
task = NER()

examples = task.read_examples_from_file('/content/drive/My Drive/transformers-ner/data',mode= 'train')

In [19]:
examples[:2]

[InputExample(guid='train-1', words=['South', 'Korea', "'s", 'government', 'Tuesday', 'also', 'unveiled', 'a', 'so-called', 'Green', 'New', 'Job', 'Creation', 'Plan', ',', 'expected', 'to', 'create', '9,60,000', 'new', 'jobs', '.'], labels=['B-geo', 'I-geo', 'O', 'O', 'B-tim', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']),
 InputExample(guid='train-2', words=['When', 'the', 'Lion', 'found', 'that', 'he', 'could', 'not', 'escape', ',', 'he', 'flew', 'upon', 'the', 'sheep', 'and', 'killed', 'them', ',', 'and', 'then', 'attacked', 'the', 'oxen', '.'], labels=['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])]

In [31]:
labels = []
for t in open(LABELS_FILE_PATH).readlines():
  labels.append(t.replace('\n',''))
labels 

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [32]:
from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained('roberta-base')

In [33]:
features = task.convert_examples_to_features(examples,label_list=labels, max_seq_length=128, tokenizer=tokenizer)

In [35]:
features[:2]

[InputFeatures(input_ids=[3, 10050, 530, 33594, 18, 11455, 25464, 19726, 879, 548, 6691, 102, 2527, 12, 4155, 19247, 4030, 43128, 40008, 1258, 35351, 6, 10162, 560, 32845, 466, 6, 2466, 6, 151, 4651, 41207, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=None, label_ids=[-100, 1, 4, -100, 0, 0, 7, 0, 0, -100, -100, 0, 0, -100, -100, 0, 0, 0,

In [None]:
!python run_ner.py  --model_name_or_path 'roberta-base'\
 --labels '/content/drive/My Drive/transformers-ner/data/labels.txt' \
--data_dir '/content/drive/My Drive/transformers-ner/data' \
--output_dir 'model' \
--max_seq_length  '128' \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--save_steps 1000000 \
--seed 16 \
--do_train \
--do_predict \
--overwrite_output_dir \
--fp16

Change the per_device_train_batch_size if you are facing out of memory issues.

## Model Prediction 

Once the model is trained, the final model weights, configuration, tokenizer will be avilable in the output_dir. We will use the Pipeline module to do the model predictions.


In [36]:
from transformers import pipeline
model_name = '/content/drive/My Drive/transformers-ner/model'
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name, framework="pt",grouped_entities=False)

In [43]:
sequence = """
Remind me to do those 11 things at 10 pm.
"""
result = nlp(sequence)
result

[{'entity_group': 'B-tim', 'score': 0.9684648811817169, 'word': ' 10 pm'}]

In [39]:
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name, framework="pt",grouped_entities=True)
result = nlp(sequence)
result

[{'entity_group': 'B-tim', 'score': 0.9991254210472107, 'word': 'Today'},
 {'entity_group': 'B-geo', 'score': 0.7912563681602478, 'word': ' India'},
 {'entity_group': 'B-geo', 'score': 0.6456469893455505, 'word': ' Pakistan'},
 {'entity_group': 'B-tim', 'score': 0.981935441493988, 'word': ' 2pm'}]

We will try Training Token Classification for one more dataset in Part-3 to see how easy it is do the prediction and training for any dataset.