Multilingual Named Entity Recognition

Content

A multi-lingual named entity classifier to perform named entity recognition (NER) on two datasets, International: CoNLL 2003, Chinese: Weibo. We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings. Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.

Data: We used CoNLL 2003 dataset (train, dev) combined with a manually corrected (improved/cleaned) test set from the CrossWeigh paper called CoNLL++ for English corpos and Weibo dataset for Chinese corpos. Then we removed stopwrods and did tokenization using the BERT tokenizer.
Results: We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings. Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.

The notebook is structured as follows:

Setting up the GPU Environment
Getting Data
Training and Testing the Model
Using the Model (Running Inference)

Task Description

Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Literature review: models that can do NER

Dataset

Chinese corpos: flair.datasets.NER_CHINESE_WEIBO
English corpos: CoNLL++

Data prep

Since our dataset is relatively clean already, we only removed stopwrods and did tokenization using the BERT tokenizer.

The ground truth data was in the shown format, in English , each word token is assigned a NE tag, in chinese, each character is assigned a NER tag.

Our Model: BiLSTM-CRF

We combined Bi-directionary LSTM with conditional random field, which has better result than only using BiLSTM.

For deep learning based NER task, it usally has three steps:

Get the representation for the input. We use the pretrained word embedding BERT and Character-level embedding to

In sequence tagging task, we have access to both past and future input features for a given time, we can thus utilize a bidirectional LSTM network to efficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame. For the output layer, There are two different ways to make use of neighbor tag information in predicting current tags. The first is to predict a distribution of tags for each time step and then use beam-like decoding to find optimal tag sequences. The second one is to focus on sentence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models.

Result Evaluation

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Our model beats the Baseline "glove" for precision, recall, F1-score in recognizing organization, location, person, and misc which means all names which are not already in the other categories except the recall for organization.

Sample results for English Sentences

Here's some result for recognizing name entities in English sentences.

We choose some confusing sentences such as George Washington went to Washington to study in University of Washington .

Sample results for Chinese Sentences

And here are some results for Chinese Sentences

Limitations

Since the Conll dataset we used is 2003, which is pretty outdated, we want to try inferencing with our model on new named entity which may be unseen in our training dataset. This is our result: (explain result: 1. although johnny depp was already famous at 2003, amber heard was still a nobody at that time , but our model can recognize her name. 2.similarly trump was also not so famous then, twitter even didn’t exist. 3. the boys is a pretty new TV series first aired in 2019 4. finally a model trained with 2003 dataset can correctly pick up covid-19). in general, our model perform quite well in recognizing new NE.

Challenges & Future Work

For example, we tested model inference for Chinese corpus using Telsla, the model cannot detect it. Hence the future work includes:

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
LICENSE		LICENSE
NER_Chinese.ipynb		NER_Chinese.ipynb
NER_English.ipynb		NER_English.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Named Entity Recognition

Content

Task Description

Literature review: models that can do NER

Dataset

Data prep

Our Model: BiLSTM-CRF

Result Evaluation

Sample results for English Sentences

Sample results for Chinese Sentences

Limitations

Challenges & Future Work

About

Releases

Packages

Languages

License

WideSu/Vanilla-NER

Folders and files

Latest commit

History

Repository files navigation

Multilingual Named Entity Recognition

Content

Task Description

Literature review: models that can do NER

Dataset

Data prep

Our Model: BiLSTM-CRF

Result Evaluation

Sample results for English Sentences

Sample results for Chinese Sentences

Limitations

Challenges & Future Work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages