# Named Entity Recognition in Mandarin on a Weibo Social Media Dataset

---

[Github](https://github.com/eugenesiow/practical-ml/blob/master/notebooks/Named_Entity_Recognition_Mandarin_Weibo.ipynb) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---

Notebook to train a [flair](https://github.com/flairNLP/flair) model in mandarin using stacked embeddings (with word and BERT embeddings) to perform named entity recognition (NER). 

The [dataset](https://github.com/hltcoe/golden-horse) used contains 1,890 Sina Weibo messages annotated with four entity types (person, organization, location and geo-political entity), including named and nominal mentions from the paper [Peng et al. (2015)](https://www.aclweb.org/anthology/D15-1064/) and with revised annotated data from [He et al. (2016)](https://arxiv.org/abs/1611.04234).

The current state-of-the-art model on this dataset is from [Peng et al. (2016)](https://www.aclweb.org/anthology/P16-2025/) with an average F1-score of **47.0%** (Table 1) and from [Peng et al. (2015)](https://www.aclweb.org/anthology/D15-1064.pdf) with an F1-score of **44.1%** (Table 2). The authors say that the poor results on the test set show the "difficulty of this task" - which is true a sense because the dataset is really quite small for the NER task with 4 classes (x2 as they differentiate nominal and named entities) with a test set of only 270 sentences.

Our flair model is able to improve the state-of-the-art with an F1-score of **67.5%**, which is a cool 20+ absolute percentage points better than the current state-of-the-art performance.

The notebook is structured as follows:
* Setting up the GPU Environment
* Getting Data
* Training and Testing the Model
* Using the Model (Running Inference)

## Task Description

> Named entity recognition (NER) is the task of tagging entities in text with their corresponding type. Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities. O is used for non-entity tokens.

# Setting up the GPU Environment

#### Ensure we have a GPU runtime

If you're running this notebook in Google Colab, select `Runtime` > `Change Runtime Type` from the menubar. Ensure that `GPU` is selected as the `Hardware accelerator`. This will allow us to use the GPU to train the model subsequently.

#### Install Dependencies

In [2]:
pip install -q flair

[K     |████████████████████████████████| 450kB 22.3MB/s 
[K     |████████████████████████████████| 71kB 10.7MB/s 
[K     |████████████████████████████████| 798kB 59.6MB/s 
[K     |████████████████████████████████| 1.3MB 52.2MB/s 
[K     |████████████████████████████████| 1.1MB 55.9MB/s 
[K     |████████████████████████████████| 19.7MB 1.3MB/s 
[K     |████████████████████████████████| 983kB 52.5MB/s 
[K     |████████████████████████████████| 2.9MB 39.5MB/s 
[K     |████████████████████████████████| 890kB 52.4MB/s 
[?25h  Building wheel for segtok (setup.py) ... [?25l[?25hdone
  Building wheel for ftfy (setup.py) ... [?25l[?25hdone
  Building wheel for sqlitedict (setup.py) ... [?25l[?25hdone
  Building wheel for mpld3 (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Building wheel for overrides (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


# Getting Data

The dataset, including the train, test and dev sets, has just been included in the `0.7 release` of flair, hence, we just use the `flair.datasets` loader to load the `WEIBO_NER` dataset into the flair `Corpus`. The [raw datasets](https://github.com/87302380/WEIBO_NER) are also available on Github.

In [1]:
import flair.datasets
from flair.data import Corpus
corpus = flair.datasets.WEIBO_NER()
print(corpus)

2020-12-23 06:35:01,004 Reading data from /root/.flair/datasets/weibo_ner
2020-12-23 06:35:01,009 Train: /root/.flair/datasets/weibo_ner/weiboNER_2nd_conll_format.train
2020-12-23 06:35:01,009 Dev: /root/.flair/datasets/weibo_ner/weiboNER_2nd_conll_format.dev
2020-12-23 06:35:01,010 Test: /root/.flair/datasets/weibo_ner/weiboNER_2nd_conll_format.test
Corpus: 1350 train + 270 dev + 270 test sentences


We can see that the total 1,890 sentences have already been split into train (1,350), dev (270) and test (270) sets in a 5:1:1 ratio.

# Training and Testing the Model

#### Train the Model

To train the flair `SequenceTagger`, we use the `ModelTrainer` object with the corpus and the tagger to be trained. We use flair's sensible default options in the `.train()` method, while specifying the output folder for the `SequenceTagger` model to be `/content/model/`. We also set the `embeddings_storage_mode` to be `gpu` to utilise the GPU to store the embeddings for more speed. Note that if you run this with a larger dataset you might run out of GPU memory, so be sure to set this option to `cpu` - it will still use the GPU to train but the embeddings will not be stored in the CPU and there will be a transfer to the GPU each epoch.

Be prepared to allow the training to run for about 0.5 to 1 hour. We set the `max_epochs` to 50 so the the training will complete faster, for higher F1-score you can increase this number to 100 or 150.

In [2]:
import flair
from typing import List
from flair.trainers import ModelTrainer
from flair.models import SequenceTagger
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, BertEmbeddings, BytePairEmbeddings

tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# For an even faster training time, you can comment out the BytePairEmbeddings
# Note: there will be a small drop in performance if you do so.
embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('zh-crawl'),
    BytePairEmbeddings('zh'),
    BertEmbeddings('bert-base-chinese'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('/content/model/',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=50,
              embeddings_storage_mode='gpu')

  del sys.path[0]


2020-12-23 06:35:34,075 ----------------------------------------------------------------------------------------------------
2020-12-23 06:35:34,078 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('zh-crawl')
    (list_embedding_1): BytePairEmbeddings(model=1-bpe-zh-100000-50)
    (list_embedding_2): BertEmbeddings(
      (model): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(21128, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
  

{'dev_loss_history': [15.872177124023438,
  17.027008056640625,
  9.663846015930176,
  6.951746463775635,
  6.2898736000061035,
  5.883848667144775,
  4.86557149887085,
  4.57301664352417,
  4.379792213439941,
  4.182097911834717,
  3.9913041591644287,
  3.9053568840026855,
  3.975558042526245,
  3.875393867492676,
  4.1406426429748535,
  4.614320755004883,
  3.71307373046875,
  3.8614141941070557,
  3.892859697341919,
  3.6830766201019287,
  4.1208062171936035,
  3.6688966751098633,
  3.964456796646118,
  3.8440518379211426,
  3.710570812225342,
  3.776811361312866,
  3.6381967067718506,
  3.6773457527160645,
  3.7301440238952637,
  3.7477946281433105,
  3.6717283725738525,
  3.706446409225464,
  3.7481448650360107,
  3.5523171424865723,
  3.80267333984375,
  3.768207550048828,
  3.7337565422058105,
  3.701817035675049,
  3.700406789779663,
  3.7018449306488037,
  3.700836658477783,
  3.7391464710235596,
  3.683899402618408,
  3.7267754077911377,
  3.747907876968384,
  3.7125756740570

We see that the output accuracy (F1-score) for our new model is **67.5%** (F1-score (micro) 0.6748). We use micro F1-score (rather than macro F1-score) as there are multiple entity classes in this setup with [class imbalance](https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin).

> We have a new SOTA NER model in mandarin, over 20 percentage points (absolute) better than the previous SOTA for this Weibo dataset!

## Using the Model (Running Inference)

Running the model to do some predictions/inference is as simple as calling `tagger.predict(sentence)`. Do note that for mandarin each character needs to be split with spaces between each character (e.g. `一 节 课 的 时 间`) so that the tokenizer will work properly to split them to tokens (if you're processing them for input into the model when building an app). For more information on this, check out the [flair tutorial on tokenization](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_1_BASICS.md#tokenization).

In [34]:
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.data import Corpus

# Load the model that we trained, you can comment this out if you already have 
# the model loaded (e.g. if you just ran the training)
tagger: SequenceTagger = SequenceTagger.load("/content/model/final-model.pt")

# Load the WEIBO corpus and use the first 5 sentences from the test set
corpus = flair.datasets.WEIBO_NER()
for idx in range(0, 5):
  sentence = corpus.test[idx]
  tagger.predict(sentence)
  print(sentence.to_tagged_string())

2020-12-23 07:41:49,576 Reading data from /root/.flair/datasets/weibo_ner
2020-12-23 07:41:49,577 Train: /root/.flair/datasets/weibo_ner/weiboNER_2nd_conll_format.train
2020-12-23 07:41:49,578 Dev: /root/.flair/datasets/weibo_ner/weiboNER_2nd_conll_format.dev
2020-12-23 07:41:49,579 Test: /root/.flair/datasets/weibo_ner/weiboNER_2nd_conll_format.test
一 节 课 的 时 间 真 心 感 动 了 李 <B-PER.NAM> 开 <I-PER.NAM> 复 <E-PER.NAM> 感 动
回 复 支 持 ， 赞 成 ， 哈 哈 米 八 吴 够 历 史 要 的 陈 <B-PER.NAM> 小 <I-PER.NAM> 奥 <E-PER.NAM> 丁 丁 我 爱 小 肥 肥 一 族 大 头 仔 大 家 团 结 一 致 ， 誓 要 去 台 <B-GPE.NAM> 湾 <E-GPE.NAM> 饮 喜 酒 ， 由 包 机 ， 团 结 的 力 量 大
剑 网 乱 世 长 安 公 测 盛 典 今 日 开 启 ， 海 量 豪 礼 火 爆 开 送 精 美 挂 件 、 听 雨 · 汉 服 娃 娃 、 诙 谐 双 骑 独 轮 车 等 你 来 拿 ， 更 有 千 台 红 米 手 机 、 等 十 四 重 惊 喜 转 发 即 抽 活 动 地 址 ： 已 有 人 参 与 剑 网 官 方 微 博 剑 网 客 户 服 务
在 街 上 听 见 音 乐 我 舞 动 起 来 很 丢 人 ？ 真 的 很 丢 人 吗 ？
三 <B-PER.NAM> 毛 <E-PER.NAM> 说 我 唯 一 锲 而 不 舍 ， 愿 意 以 自 己 的 生 命 去 努 力 的 ， 只 不 过 是 保 守 我 个 人 的 心 怀 意 念 ， 在 我 有 生 之 日 ， 做 一 个 真 诚 的 人 ， 不 放 弃 对 生 活 的 热 爱 和 执 着 ， 在 有 限 的 时 空 里 ， 过 无

We can connect to Google Drive with the following code to save any files you want to persist. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the model files from our local directory to your Google Drive.

In [None]:
import shutil
shutil.move('/content/model/', "/content/drive/My Drive/model/")

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).