# Named Entity Recognition in Mandarin on the MSRA/SIGHAN2006 Dataset

---

[Github](https://github.com/eugenesiow/practical-ml/blob/master/notebooks/Named_Entity_Recognition_Mandarin_MSRA.ipynb) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---

Notebook to train/fine-tune a pre-trained chinese BERT model to perform named entity recognition (NER). 

The [dataset](https://github.com/yzwww2019/Sighan-2006-NER-dataset) used is the SIGHAN 2006, or commonly known as the MSRA NER dataset. It contains 46,364 samples in the training set and 4,365 samples in the test set. The original workshop/paper for the dataset is by [Levow (2006)](https://faculty.washington.edu/levow/papers/sighan06.pdf).

The current state-of-the-art model on this dataset is the Lattice LSTM from [Zhang et al. (2018)](https://arxiv.org/pdf/1805.02023.pdf) with an F1-score of **93.2%**.

Our BERT model (with only 1 epoch training) has an F1-score of **93.9%** which is slightly better than the state-of-the-art!

The notebook is structured as follows:
* Setting up the GPU Environment
* Getting Data
* Training and Testing the Model
* Using the Model (Running Inference)

## Task Description

> Named entity recognition (NER) is the task of tagging entities in text with their corresponding type. Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities. O is used for non-entity tokens.

# Setting up the GPU Environment

#### Ensure we have a GPU runtime

If you're running this notebook in Google Colab, select `Runtime` > `Change Runtime Type` from the menubar. Ensure that `GPU` is selected as the `Hardware accelerator`. This will allow us to use the GPU to train the model subsequently.

#### Install Dependencies and Restart Runtime

In [None]:
!pip install -q transformers
!pip install -q simpletransformers

[K     |████████████████████████████████| 1.5MB 12.5MB/s 
[K     |████████████████████████████████| 2.9MB 57.1MB/s 
[K     |████████████████████████████████| 890kB 56.0MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 204kB 24.0MB/s 
[K     |████████████████████████████████| 7.4MB 24.6MB/s 
[K     |████████████████████████████████| 51kB 9.3MB/s 
[K     |████████████████████████████████| 1.8MB 52.8MB/s 
[K     |████████████████████████████████| 1.1MB 48.1MB/s 
[K     |████████████████████████████████| 71kB 9.5MB/s 
[K     |████████████████████████████████| 317kB 61.9MB/s 
[K     |████████████████████████████████| 163kB 63.0MB/s 
[K     |████████████████████████████████| 4.5MB 53.0MB/s 
[K     |████████████████████████████████| 81kB 11.8MB/s 
[K     |████████████████████████████████| 112kB 64.8MB/s 
[K     |████████████████████████████████| 102kB 15.0MB/s 
[K     |████████████████████████████████| 133kB 6

You might see the error `ERROR: google-colab X.X.X has requirement ipykernel~=X.X, but you'll have ipykernel X.X.X which is incompatible` after installing the dependencies. **This is normal** and caused by the `simpletransformers` library.

The **solution** to this will be to **reset the execution environment** now. Go to the menu `Runtime` > `Restart runtime` then continue on from the next section to download and process the data.

# Getting Data

#### Pulling the data from Github

The dataset, includes train and test sets, which we pull from a [Github repository](https://github.com/yzwww2019/Sighan-2006-NER-dataset).

In [None]:
import urllib.request
from pathlib import Path

def download_file(url, output_file):
  Path(output_file).parent.mkdir(parents=True, exist_ok=True)
  urllib.request.urlretrieve (url, output_file)

download_file('https://raw.githubusercontent.com/yzwww2019/Sighan-2006-NER-dataset/master/train.txt', '/content/data/train.txt')
download_file('https://raw.githubusercontent.com/yzwww2019/Sighan-2006-NER-dataset/master/test.txt', '/content/data/test.txt')

Since the data is formatted in the CoNLL `BIO` type format (you can read more on the tagging format from this [wikipedia article](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))), we need to format it into a `pandas` dataframe with the following function. The 3 columns in the dataframe are a word token (for mandarin this is a single character), a `BIO` label and a sentence_id to differentiate samples/sentences.

In [None]:
import pandas as pd
def read_conll(filename):
    df = pd.read_csv(filename,
                    sep = '\t', header = None, keep_default_na = False,
                    names = ['words', 'labels'], skip_blank_lines = False)
    df['sentence_id'] = (df.words == '').cumsum()
    return df[df.words != '']

Now we execute the function on the train and test sets we have downloaded from Github. We also `.head()` the training set dataframe for the first 100 rows to check that the words, labels and sentence_id have been split properly.

In [None]:
train_df = read_conll('/content/data/train.txt')
test_df = read_conll('/content/data/test.txt')
train_df.head(100)

Unnamed: 0,words,labels,sentence_id
0,当,O,0
1,希,O,0
2,望,O,0
3,工,O,0
4,程,O,0
...,...,...,...
97,夺,O,2
98,文,O,2
99,物,O,2
100,详,O,2


We now print out the statistics of the train and test set. We can see that we have the right distribution of 46,364 samples in the training set and 4,365 samples in the test set.

In [None]:
data = [[train_df['sentence_id'].nunique(), test_df['sentence_id'].nunique()]]

# Prints out the dataset sizes of train and test sets per label.
pd.DataFrame(data, columns=["Train", "Test"])

Unnamed: 0,Train,Test
0,46364,4365


# Training and Testing the Model

#### Set up the Training Arguments

We set up the training arguments. Here we train to 1 epoch to reduce the training time as much as possible (we are impatient). We set a sliding window as NER sequences can be quite long and because we have limited GPU memory we can't increase the `max_seq_length` too long.

In [None]:
train_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'sliding_window': True,
    'max_seq_length': 64,
    'num_train_epochs': 1,
    'train_batch_size': 32,
    'fp16': True,
    'output_dir': '/outputs/',
}

#### Train the Model

Once we have setup the `train_args` dictionary, the next step would be to train the model. We use the pre-trained mandarin BERT model, `bert_base_cased` from the awesome [Hugging Face Transformers](https://github.com/huggingface/transformers) library as the base and use the [Simple Transformers library](https://simpletransformers.ai/docs/classification-models/) on top of it to make it so we can train the NER (sequence tagging) model with just a few lines of code.

In [None]:
from simpletransformers.ner import NERModel
from transformers import AutoTokenizer
import pandas as pd
import logging

logging.basicConfig(level=logging.DEBUG)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# We use the bert base cased pre-trained model.
tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
model = NERModel('bert', 'bert-base-chinese', args=train_args)

# Train the model, there is no development or validation set for this dataset 
# https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model.train_model(train_df)

# Evaluate the model in terms of accuracy score
result, model_outputs, preds_list = model.eval_model(test_df)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-chinese/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:filelock:Attempting to acquire lock 140606403559264 on /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae.lock
INFO:filelock:Lock 140606403559264 acquired on /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /bert-base-chinese/resolve/main/config.json HTTP/1.1" 200 624


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=624.0), HTML(value='')))

DEBUG:filelock:Attempting to release lock 140606403559264 on /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae.lock
INFO:filelock:Lock 140606403559264 released on /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-chinese/resolve/main/vocab.txt HTTP/1.1" 200 0
DEBUG:filelock:Attempting to acquire lock 140606403082840 on /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c68ba47e93abd9448825202377ddb16dae8114fe07.accd894ff58c6ff7bd4f3072890776c14f4ea34fcc08e79cd88c2d157756dceb.lock
INFO:filelock:Lock 140606403082840 acquired on /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c6




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=109540.0), HTML(value='')))

DEBUG:filelock:Attempting to release lock 140606403082840 on /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c68ba47e93abd9448825202377ddb16dae8114fe07.accd894ff58c6ff7bd4f3072890776c14f4ea34fcc08e79cd88c2d157756dceb.lock
INFO:filelock:Lock 140606403082840 released on /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c68ba47e93abd9448825202377ddb16dae8114fe07.accd894ff58c6ff7bd4f3072890776c14f4ea34fcc08e79cd88c2d157756dceb.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-chinese/resolve/main/tokenizer.json HTTP/1.1" 200 0
DEBUG:filelock:Attempting to acquire lock 140606403558816 on /root/.cache/huggingface/transformers/7e23f4e1f58f867d672f84d9a459826e41cea3be6d0fe62502ddce9920f57e48.4495f7812b44ff0568ce7c4ff3fdbb2bac5eaf330440ffa30f46893bf749184d.lock
INFO:filelock:Lock 140606403558816 acquired on /root/.cache/huggingface/transformers/7e23f4e1f58f867d672




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=268943.0), HTML(value='')))

DEBUG:filelock:Attempting to release lock 140606403558816 on /root/.cache/huggingface/transformers/7e23f4e1f58f867d672f84d9a459826e41cea3be6d0fe62502ddce9920f57e48.4495f7812b44ff0568ce7c4ff3fdbb2bac5eaf330440ffa30f46893bf749184d.lock
INFO:filelock:Lock 140606403558816 released on /root/.cache/huggingface/transformers/7e23f4e1f58f867d672f84d9a459826e41cea3be6d0fe62502ddce9920f57e48.4495f7812b44ff0568ce7c4ff3fdbb2bac5eaf330440ffa30f46893bf749184d.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-chinese/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-chinese/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
DEBUG:filelock:Attempting to acquire lock 140606403559264 on /root/.cache/huggingface/transformers/58592490276d9ed1e8e33f3c12caf23




DEBUG:urllib3.connectionpool:https://cdn-lfs.huggingface.co:443 "GET /bert-base-chinese/8a693db616eaf647ed2bfe531e1fa446637358fc108a8bf04e8d4db17e837ee9 HTTP/1.1" 200 411577189


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=411577189.0), HTML(value='')))

DEBUG:filelock:Attempting to release lock 140606403559264 on /root/.cache/huggingface/transformers/58592490276d9ed1e8e33f3c12caf23000c22973cb2b3218c641bd74547a1889.fabda197bfe5d6a318c2833172d6757ccc7e49f692cb949a6fabf560cee81508.lock
INFO:filelock:Lock 140606403559264 released on /root/.cache/huggingface/transformers/58592490276d9ed1e8e33f3c12caf23000c22973cb2b3218c641bd74547a1889.fabda197bfe5d6a318c2833172d6757ccc7e49f692cb949a6fabf560cee81508.lock





Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-c

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=46364.0), HTML(value='')))




HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(HTML(value='Running Epoch 0 of 1'), FloatProgress(value=0.0, max=1449.0), HTML(value='')))











INFO:simpletransformers.ner.ner_model: Training of bert model complete. Saved to /outputs/.
INFO:simpletransformers.ner.ner_model: Converting to features started.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4365.0), HTML(value='')))




HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=546.0), HTML(value='')))




INFO:simpletransformers.ner.ner_model:{'eval_loss': 0.02290273037066645, 'precision': 0.9304742112223692, 'recall': 0.9478444957659738, 'f1_score': 0.939079035179712}


The F1-score for the model is **93.9%** ('f1_score': 0.939079035179712).

That score is better than the previous state-of-the-art model with **93.2%** by about 0.7 percentage points (absolute).

> We have a new SOTA NER model in mandarin!

## Using the Model (Running Inference)

Running the model to do some predictions/inference is as simple as calling `model.predict(samples)`. Character level tokenization with spaces: Do note that for mandarin each character needs to be split with spaces between each character (e.g. `一 节 课 的 时 间`) so that the tokenizer will work properly to split them to tokens (if you're processing them for input into the model when building an app).

In [None]:
samples = ['我 的 名 字 叫 蕭 文 仁 。 我 是 新 加 坡 人 。']
predictions, _ = model.predict(samples)
for idx, sample in enumerate(samples):
  print('{}: '.format(idx))
  for word in predictions[idx]:
    print('{}'.format(word))

INFO:simpletransformers.ner.ner_model: Converting to features started.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Running Prediction'), FloatProgress(value=0.0, max=1.0), HTML(value='')))


0: 
{'我': 'O'}
{'的': 'O'}
{'名': 'O'}
{'字': 'O'}
{'叫': 'O'}
{'蕭': 'B-PER'}
{'文': 'I-PER'}
{'仁': 'I-PER'}
{'。': 'O'}
{'我': 'O'}
{'是': 'O'}
{'新': 'B-LOC'}
{'加': 'I-LOC'}
{'坡': 'I-LOC'}
{'人': 'O'}
{'。': 'O'}


We can connect to Google Drive with the following code to save any files you want to persist. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the model checkpount files which are saved in the `/outputs/` directory to your Google Drive.

In [None]:
import shutil
shutil.move('/outputs/', "/content/drive/My Drive/outputs/")

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).