<a href="https://colab.research.google.com/github/XuRui314/HITSZ_2022_NLP_Project/blob/main/Ner_with_Bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> This notebook is made from Ruben Winastwan's article [Named Entity Recognition with BERT in PyTorch](https://towardsdatascience.com/named-entity-recognition-with-bert-in-pytorch-a454405e0b6a).

## Preparation

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!nvidia-smi

Sat Jun 11 16:19:33 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 42.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.9 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Unins

In [5]:
# 这个也是必须要做的，因为不check cuda的话可能会检测不到
import torch
print(torch.cuda.device_count())
print(torch.cuda.is_available())

1
True


In [6]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Jun 11 16:19:55 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Introduction

When it comes to dealing with NLP problems, BERT oftentimes comes up as a machine learning model that we can count on in terms of its performance. The fact that it’s been pre-trained on more than 2,500M words and its bidirectional nature to learn information from a sequence of words makes it a powerful model to use.:

I wrote about how we can leverage BERT for text classification before, and in this article, we’re going to focus more on how to use BERT for named entity recognition (NER) tasks.

### **What is NER?**

NER is a task in NLP to identify and extract meaningful information (or we can call it entities) in a sentence or text. An entity can be a single word or even a group of words that refer to the same category.

![](https://miro.medium.com/max/875/1*T1R3_XyaUMQW2ChmIDb2vw.png)

As an example, let’s say we the following sentence and we want to extract information about a person’s name from this sentence.


The first step of a NER task is to detect an entity. This can be a word or a group of words that refer to the same category. As an example:

- ‘Bond’ ➡️ an entity that consists of a single word
- ‘James Bond’ ➡️ an entity that consists of two words, but they are referring to the same category.
To make sure that our BERT model knows that an entity can be a single word or a group of words, then we need to provide information about the beginning and the ending of an entity on our training data via the so-called Inside-Outside-Beginning (IOB) tagging. We will see more about this on our dataset later in this article.

After detecting an entity, the next step in a NER task is to categorize the detected entity. The categories of an entity can be anything depending on our use case. Below is an example of categories of entities:

- **Person**: Bond, James Bond, Sam, Anna, Frank, Leonardo DiCaprio
Location: New York, Vienna, Munich, London
- **Organization**: Google, Apple, Stanford University, Deutsche Bank
- **Location**: Central Park, Brandenburger Tor, Times Square
These entities are basically the label of our data during the training process of our BERT model, which we will look at in detail later in the following section.





### **BERT for NER**

As previously mentioned, BERT is a transformers-based machine learning model that will come in pretty handy if we want to solve NLP-related tasks.

If you’re not yet familiar with BERT, I recommend you to read my previous article about text classification with BERT before reading this article. There you’ll find information about what BERT actually is, what kind of input data the model expects, and the output that you’ll get from the model.[Bert text classification](https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f)



What differentiates between BERT for text classification and the NER problem is how we set the output of the model. For a text classification problem, we only use the embedding vector output from the special [CLS] token, as you can see in the visualization below:

![](https://miro.medium.com/max/875/1*ARJzucvPAmzdaB_y3J3ngg.png)

Meanwhile, if we want to use BERT for NER tasks, we need to use the embedding vector output from all of the tokens, as you can see in the visualization below:

![](https://miro.medium.com/max/875/1*98osfdBNVYyl-M_NtvuNyA.png)

Image by author
By using the embedding vector output from all of the tokens, then we can classify texts at the token level. This is exactly what we want since we want our BERT model to predict the entity of each token. Now without further ado, let’s go to the implementation.

### About the Dataset
The dataset that we’re going to use in this article is the CoNLL-2003 dataset, which is a dataset specifically used for NER task. You can download the data on Kaggle via the link below.[NER DATA](https://www.kaggle.com/datasets/rajnathpatel/ner-data)

## Code

### **Dataset**

This dataset is distributed under Open Database v1.0 license, so we are free to share and use this dataset for our own purpose. Now let’s take a look at what the dataset looks like.

In [7]:
import pandas as pd

In [11]:
df = pd.read_csv('drive/MyDrive/dataset/ner.csv')
df.head()

Unnamed: 0,text,labels
0,Thousands of demonstrators have marched throug...,O O O O O O B-geo O O O O O B-geo O O O O O B-...
1,Iranian officials say they expect to get acces...,B-gpe O O O O O O O O O O O O O O B-tim O O O ...
2,Helicopter gunships Saturday pounded militant ...,O O B-tim O O O O O B-geo O O O O O B-org O O ...
3,They left after a tense hour-long standoff wit...,O O O O O O O O O O O
4,U.N. relief coordinator Jan Egeland said Sunda...,B-geo O O B-per I-per O B-tim O B-geo O B-gpe ...


As we can see above, we have a dataframe which consists of the text and the label. The label corresponds to entity category of each word in a text.

In total, there are 9 entity categories, which are:

- `geo` for geographical entity
- `org` for organization entity
- `per` for person entity
- `gpe` for geopolitical entity
- `tim` for time indicator entity
- `art` for artifact entity
- `eve` for event entity
- `nat` for natural phenomenon entity
- `O` is assigned if a word doesn’t belong to any entity.
Let’s take a look at the unique labels available on our dataset:

In [13]:
# Split labels based on whitespace and turn them into a list
labels = [i.split() for i in df['labels'].values.tolist()]

# Check how many labels are there in the dataset
unique_labels = set()

for lb in labels:
  [unique_labels.add(i) for i in lb if i not in unique_labels]
 
print(unique_labels)


{'I-org', 'B-org', 'I-tim', 'I-nat', 'B-tim', 'I-geo', 'B-per', 'I-eve', 'O', 'I-per', 'I-art', 'B-eve', 'B-nat', 'B-art', 'B-geo', 'I-gpe', 'B-gpe'}


In [14]:
# Map each label into its id representation and vice versa
labels_to_ids = {k: v for v, k in enumerate(sorted(unique_labels))}
ids_to_labels = {v: k for v, k in enumerate(sorted(unique_labels))}
print(labels_to_ids)

{'B-art': 0, 'B-eve': 1, 'B-geo': 2, 'B-gpe': 3, 'B-nat': 4, 'B-org': 5, 'B-per': 6, 'B-tim': 7, 'I-art': 8, 'I-eve': 9, 'I-geo': 10, 'I-gpe': 11, 'I-nat': 12, 'I-org': 13, 'I-per': 14, 'I-tim': 15, 'O': 16}


As you might notice, each entity category is preceeded with the letter I or B . This corresponds to what previously mentioned as IOB tagging. I means Intermediate and B means Beginning. Let’s take a look at the following sentence to understand the concept of IOB tagging a little bit more.

![](https://miro.medium.com/max/875/1*9AujtffwNyoychjkHOiYHQ.png)

- ‘Kevin’ has `B-pers` label since it’s the beginning of a person entity
- ‘Durant’ has `I-pers` label because it’s the continuation of a person entity
- ‘Brooklyn’ has `B-org` since it’s the beginning of an organization entity
- ‘Nets’ has `I-org` label since it’s the continuation of an organization entity
Other words are assigned `O` label as they don’t belong to any entity

### **Data Preprocessing**

Before we are able to use a BERT model to classify the entity *of* a token, of course, we need to do data preprocessing first, which includes two parts: tokenization and adjusting the label to match the tokenization. Let’s start with tokenization first.



#### **Tokenization**

Tokenization can be easily implemented with BERT, as we can use `BertTokenizerFast` class from a pretrained BERT base model with HuggingFace.

To give you an example how BERT tokenizer works, let’s take a look at one of the texts from our dataset:

In [15]:
# Let's take a look at how can we preprocess the text - Take first example
text = df['text'].values.tolist()
example = text[36]

print(example)

Prime Minister Geir Haarde has refused to resign or call for early elections .


Tokenizing the text above with `BertTokenizerFast` is very straightforward:

In [16]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
text_tokenized = tokenizer(example, padding='max_length', max_length=512, truncation=True, return_tensors="pt")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

We provide several arguments when calling `tokenizer` method from `BertTokenizerFast` class above:

- `padding` : to pad the sequence with a special **[PAD]** token to the maximum length that we specify. The maximum length of a sequence for a BERT model is 512.

- `max_length` : maximum length of a sequence.

- `truncation` : this is a Boolean value. If we set the value to True, then tokens that exceed the maximum length will not be used.

- `return_tensors` : the tensor type that is returned, depending on machine learning frameworks that we use. Since we’re using PyTorch, then we use pt .

And below is the output of the tokenization process:

In [17]:
print(text_tokenized)

{'input_ids': tensor([[  101,  3460,  2110,   144,  6851,  1197, 11679,  2881,  1162,  1144,
          3347,  1106, 13133,  1137,  1840,  1111,  1346,  3212,   119,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

As you can see, the output that we get from the tokenization process is a dictionary, which contains three variables:

- `input_ids`: The id representation of the tokens in a sequence. In BERT, the id 101 is reserved for the special **[CLS]** token, the id 102 is reserved for the special **[SEP]** token, and the id 0 is reserved for **[PAD]** token.

- `token_type_ids`: To identify the sequence in which a token belongs to. Since we only have one sequence per text, then all the values of `token_type_ids` will be 0.

- `attention_mask` : To identify whether a token is a real token or padding. The value would be 1 if it’s a real token, and 0 if it’s a **[PAD]** token.

From the `input_ids` above, we can decode the ids back into the original sequence with `decode` method as follows:

In [18]:
print(tokenizer.decode(text_tokenized.input_ids[0]))

[CLS] Prime Minister Geir Haarde has refused to resign or call for early elections. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

We got our original sequence back after implementing `decode` method with the addition of special tokens from BERT such as **[CLS]** token at the beginning of the sequence, **[SEP]** token at the end of the sequence, and a bunch of **[PAD]** tokens to fulfill the required maximum length of 512.

After this tokenization process, we need to proceed to the next step, which is adjusting the label of each token.

#### **Adjusting Label After Tokenization**

This is a very important step that we need to do after the tokenization process. This is because the length of the sequence is no longer matching the length of the original label after the tokenization process.

The BERT tokenizer uses the so-called word-piece tokenizer under the hood, which is a sub-word tokenizer. This means that BERT tokenizer will likely to split one word into one or more meaningful sub-words.

As an example, let’s say we have the following sequence:

[](https://miro.medium.com/max/875/1*n66Vk78mSHXrxh45b3xlSw.png)

The sequence above has in total 13 tokens and thus, it also has 13 labels. However, after BERT tokenization, we get the following result:



In [19]:
print(tokenizer.convert_ids_to_tokens(text_tokenized["input_ids"][0]))

['[CLS]', 'Prime', 'Minister', 'G', '##ei', '##r', 'Ha', '##ard', '##e', 'has', 'refused', 'to', 'resign', 'or', 'call', 'for', 'early', 'elections', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',

There are two problems that we need to address after tokenization process:

The addition of special tokens from BERT such as **[CLS]**, **[SEP]**, and **[PAD]**
The fact that some tokens are splitted into sub-words.
As sub-word tokenization, word-piece tokenization splits uncommon words into their sub-words, such as *‘Geir’* and *‘Haarde’* in the example above. This sub-word tokenization helps the BERT model to learn the semantic meaning of related words.

The consequence of this word piece tokenization and the addition of special tokens from BERT is that the sequence length after tokenization is no longer matching the length of the initial label.

From the example above, now there are in total 512 tokens in the sequence after tokenization, while the length of the label is still the same as before. Also, the first token in a sequence is no longer the word ‘Prime’, but the newly added **[CLS]** token, so we need to shift our label as well.

To solve this problem, we need to adjust the label such that it has the same length as the sequence after tokenization. To do this, we can utilize the word_ids method from the tokenization result as follows:


In [20]:
word_ids = text_tokenized.word_ids()
print(tokenizer.convert_ids_to_tokens(text_tokenized["input_ids"][0]))
print(word_ids)

['[CLS]', 'Prime', 'Minister', 'G', '##ei', '##r', 'Ha', '##ard', '##e', 'has', 'refused', 'to', 'resign', 'or', 'call', 'for', 'early', 'elections', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',

As you can see from the code snippet above, each splitted token shares the same `word_ids` , where special tokens from BERT such as **[CLS]**, **[SEP]**, and **[PAD]** all do not have specific `word_ids`.

These `word_ids` will be very useful to adjust the length of the label by applying either of these two methods:

1. We only provide a label to the first sub-word of each splitted token. The continuation of the sub-word then will simply have ‘-100’ as a label. All tokens that don’t have `word_ids` will also be labeled with ‘-100’.

2. We provide the same label among all of the sub-words that belong to the same token. All tokens that don’t have word_ids will be labeled with ‘-100’.
The function in the code snippet below will do exactly the step defined above.

In [21]:
def align_label_example(tokenized_input, labels):

        word_ids = tokenized_input.word_ids()

        previous_word_idx = None
        label_ids = []
   
        for word_idx in word_ids:

            if word_idx is None:
                label_ids.append(-100)
                
            elif word_idx != previous_word_idx:
                try:
                  label_ids.append(labels_to_ids[labels[word_idx]])
                except:
                  label_ids.append(-100)
        
            else:
                label_ids.append(labels_to_ids[labels[word_idx]] if label_all_tokens else -100)
            previous_word_idx = word_idx
      

        return label_ids

If you want to apply the first method, set `label_all_tokens` to False. If you want to apply the second method, set `label_all_tokens` to True, as you can see in the following code snippet:

In [23]:
label = labels[36]

#If we set label_all_tokens to True.....
label_all_tokens = True

new_label = align_label_example(text_tokenized, label)
print(new_label)
print(tokenizer.convert_ids_to_tokens(text_tokenized["input_ids"][0]))

[-100, 16, 16, 6, 6, 6, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 

In [24]:
#If we set label_all_tokens to False.....
label_all_tokens = False

new_label = align_label_example(text_tokenized, label)
print(new_label)
print(tokenizer.convert_ids_to_tokens(text_tokenized["input_ids"][0]))

[-100, 16, 16, 6, -100, -100, 14, -100, -100, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -1

In the rest of this article, we’re going to implement the first method, in which we will only provide a label to the first sub-word in each token and set `label_all_tokens` to False.

### **Dataset Class**

Before we train our BERT model for NER task, we need to create a dataset class to generate and fetch data in a batch.

In [25]:
import torch

def align_label(texts, labels):
    tokenized_inputs = tokenizer(texts, padding='max_length', max_length=512, truncation=True)

    word_ids = tokenized_inputs.word_ids()

    previous_word_idx = None
    label_ids = []

    for word_idx in word_ids:

        if word_idx is None:
            label_ids.append(-100)

        elif word_idx != previous_word_idx:
            try:
                label_ids.append(labels_to_ids[labels[word_idx]])
            except:
                label_ids.append(-100)
        else:
            try:
                label_ids.append(labels_to_ids[labels[word_idx]] if label_all_tokens else -100)
            except:
                label_ids.append(-100)
        previous_word_idx = word_idx

    return label_ids

class DataSequence(torch.utils.data.Dataset):

    def __init__(self, df):

        lb = [i.split() for i in df['labels'].values.tolist()]
        txt = df['text'].values.tolist()
        self.texts = [tokenizer(str(i),
                               padding='max_length', max_length = 512, truncation=True, return_tensors="pt") for i in txt]
        self.labels = [align_label(i,j) for i,j in zip(txt, lb)]

    def __len__(self):

        return len(self.labels)

    def get_batch_data(self, idx):

        return self.texts[idx]

    def get_batch_labels(self, idx):

        return torch.LongTensor(self.labels[idx])

    def __getitem__(self, idx):

        batch_data = self.get_batch_data(idx)
        batch_labels = self.get_batch_labels(idx)

        return batch_data, batch_labels

In the code snippet above, we call `TokenizerFast` class with tokenizer variable in the `__init__` function to tokenize our input texts, and align_label function to adjust our label after tokenization process.

Next, let’s split our data randomly into training, vaidation, and test. However, mind you that the total number of data is 47959. Hence, for demonstration purpose and to speed up the training process, I’m going to take only 1000 of them. You can, of course, take all of the data for model training.

In [26]:
import numpy as np

df = df[0:1000]
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42),
                            [int(.8 * len(df)), int(.9 * len(df))])

### Model Building

In this article, we’re going to use a pretrained BERT base model from HuggingFace. Since we’re going to classify text in the token level, then we need to use `BertForTokenClassification` class.

B`ertForTokenClassification` class is a model that wraps BERT model and adds linear layers on top of BERT model that will act as token-level classifiers.

In [28]:
from transformers import BertForTokenClassification

class BertModel(torch.nn.Module):

    def __init__(self):

        super(BertModel, self).__init__()

        self.bert = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(unique_labels))

    def forward(self, input_id, mask, label):

        output = self.bert(input_ids=input_id, attention_mask=mask, labels=label, return_dict=False)

        return output

In the code snippet above, first, we instantiate the model and set the output of each token classifier equal to the number of unique entities on our dataset, which in our case is 17.

Next, we will define a function for the training loop.

### Training Loop

The training loop for our BERT model is the standard PyTorch training loop with a few additions, as you can see below:

In [34]:
from torch.utils.data import DataLoader
from torch.optim import SGD
from tqdm import tqdm

def train_loop(model, df_train, df_val):

    train_dataset = DataSequence(df_train)
    val_dataset = DataSequence(df_val)

    train_dataloader = DataLoader(train_dataset, num_workers=4, batch_size=1, shuffle=True)
    val_dataloader = DataLoader(val_dataset, num_workers=4, batch_size=1)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    optimizer = SGD(model.parameters(), lr=LEARNING_RATE)

    if use_cuda:
        model = model.cuda()

    best_acc = 0
    best_loss = 1000

    for epoch_num in range(EPOCHS):

        total_acc_train = 0
        total_loss_train = 0

        model.train()

        for train_data, train_label in tqdm(train_dataloader):

            train_label = train_label[0].to(device)
            mask = train_data['attention_mask'][0].to(device)
            input_id = train_data['input_ids'][0].to(device)

            optimizer.zero_grad()
            loss, logits = model(input_id, mask, train_label)

            logits_clean = logits[0][train_label != -100]
            label_clean = train_label[train_label != -100]

            predictions = logits_clean.argmax(dim=1)

            acc = (predictions == label_clean).float().mean()
            total_acc_train += acc
            total_loss_train += loss.item()

            loss.backward()
            optimizer.step()

        model.eval()

        total_acc_val = 0
        total_loss_val = 0

        for val_data, val_label in val_dataloader:

            val_label = val_label[0].to(device)
            mask = val_data['attention_mask'][0].to(device)

            input_id = val_data['input_ids'][0].to(device)

            loss, logits = model(input_id, mask, val_label)

            logits_clean = logits[0][val_label != -100]
            label_clean = val_label[val_label != -100]

            predictions = logits_clean.argmax(dim=1)          

            acc = (predictions == label_clean).float().mean()
            total_acc_val += acc
            total_loss_val += loss.item()

        val_accuracy = total_acc_val / len(df_val)
        val_loss = total_loss_val / len(df_val)

        print(
            f'Epochs: {epoch_num + 1} | Loss: {total_loss_train / len(df_train): .3f} | Accuracy: {total_acc_train / len(df_train): .3f} | Val_Loss: {total_loss_val / len(df_val): .3f} | Accuracy: {total_acc_val / len(df_val): .3f}')

LEARNING_RATE = 1e-2
EPOCHS = 5

model = BertModel()
train_loop(model, df_train, df_val)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

Epochs: 1 | Loss:  0.490 | Accuracy:  0.873 | Val_Loss:  0.349 | Accuracy:  0.904


100%|██████████| 800/800 [00:54<00:00, 14.68it/s]


Epochs: 2 | Loss:  0.337 | Accuracy:  0.906 | Val_Loss:  0.295 | Accuracy:  0.918


100%|██████████| 800/800 [00:54<00:00, 14.67it/s]


Epochs: 3 | Loss:  0.271 | Accuracy:  0.919 | Val_Loss:  0.297 | Accuracy:  0.918


100%|██████████| 800/800 [00:54<00:00, 14.68it/s]


Epochs: 4 | Loss:  0.218 | Accuracy:  0.931 | Val_Loss:  0.273 | Accuracy:  0.927


100%|██████████| 800/800 [00:54<00:00, 14.68it/s]


Epochs: 5 | Loss:  0.173 | Accuracy:  0.944 | Val_Loss:  0.271 | Accuracy:  0.931


In the training loop above, I only train the model for 5 epochs and then use SGD as the optimizer. The loss computation in each batch is already taken care of by `BertForTokenClassification` class.

In each epoch of the training loop, there is also an important step that we need to do. After model prediction, we need to ignore all of the tokens that have ‘-100’ as the label, as you can see in lines 36, 37, 62, and 63.

Of course, the output that you’ll see may vary when you train your own BERT model as there is stochasticity in the training process.

There are a lot of things that you can do to improve the performance of our model. If you notice, we have a data imbalance problem as there are a lot of tokens with ‘O’ label. We can improve our model, for example, by applying class weights during the training process.

Also, you can try different optimizers such as the Adam optimizer with weight decay regularization.

### Evaluate Model on Test Data

Now that we have trained our model, we can evaluate its performance on unseen test data with the following code snippet.

In [35]:
def evaluate(model, df_test):

    test_dataset = DataSequence(df_test)

    test_dataloader = DataLoader(test_dataset, num_workers=4, batch_size=1)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    total_acc_test = 0.0

    for test_data, test_label in test_dataloader:

        test_label = test_label[0].to(device)
        mask = test_data['attention_mask'][0].to(device)
        input_id = test_data['input_ids'][0].to(device)
          
        loss, logits = model(input_id, mask, test_label.long())

        logits_clean = logits[0][test_label != -100]
        label_clean = test_label[test_label != -100]

        predictions = logits_clean.argmax(dim=1)
              
        acc = (predictions == label_clean).float().mean()
        total_acc_test += acc

    val_accuracy = total_acc_test / len(df_test)
    print(f'Test Accuracy: {total_acc_test / len(df_test): .3f}')


evaluate(model, df_test)

  cpuset_checked))


Test Accuracy:  0.925


In my case, the trained model achieved an average of 92,22% accuracy on the test set. You can of course, change the metrics to F1 score, precision, or recall.

Alternatively, we can use the trained model to predict the entity of each word of a text or a sentence with the following code:

In [36]:
def align_word_ids(texts):
  
    tokenized_inputs = tokenizer(texts, padding='max_length', max_length=512, truncation=True)

    word_ids = tokenized_inputs.word_ids()

    previous_word_idx = None
    label_ids = []

    for word_idx in word_ids:

        if word_idx is None:
            label_ids.append(-100)

        elif word_idx != previous_word_idx:
            try:
                label_ids.append(1)
            except:
                label_ids.append(-100)
        else:
            try:
                label_ids.append(1 if label_all_tokens else -100)
            except:
                label_ids.append(-100)
        previous_word_idx = word_idx

    return label_ids


def evaluate_one_text(model, sentence):


    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    text = tokenizer(sentence, padding='max_length', max_length = 512, truncation=True, return_tensors="pt")

    mask = text['attention_mask'][0].unsqueeze(0).to(device)

    input_id = text['input_ids'][0].unsqueeze(0).to(device)
    label_ids = torch.Tensor(align_word_ids(sentence)).unsqueeze(0).to(device)

    logits = model(input_id, mask, None)
    logits_clean = logits[0][label_ids != -100]

    predictions = logits_clean.argmax(dim=1).tolist()
    prediction_label = [ids_to_labels[i] for i in predictions]
    print(sentence)
    print(prediction_label)
            
evaluate_one_text(model, 'Bill Gates is the founder of Microsoft')

Bill Gates is the founder of Microsoft
['B-per', 'I-per', 'O', 'O', 'O', 'O', 'B-org']


If everything works perfectly, then our model will be able to perform reasonably well to predict the entity of each word of an unseen sentence as you can see above.

## Conclusion

In this article, we have implemented BERT for Named Entity Recognition (NER) task. This means that we have trained BERT model to predict the IOB tagging of a custom text or a custom sentence in a token level.