A multi-lingual named entity classifier to perform named entity recognition (NER) on two datasets, International: CoNLL 2003, Chinese: Weibo. We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings. Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.
- Data: We used CoNLL 2003 dataset (train, dev) combined with a manually corrected (improved/cleaned) test set from the CrossWeigh paper called CoNLL++ for English corpos and Weibo dataset for Chinese corpos. Then we removed stopwrods and did tokenization using the BERT tokenizer.
- Results: We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings. Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.
The notebook is structured as follows:
- Setting up the GPU Environment
- Getting Data
- Training and Testing the Model
- Using the Model (Running Inference)
Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
- Chinese corpos: flair.datasets.NER_CHINESE_WEIBO
- English corpos: CoNLL++
Since our dataset is relatively clean already, we only removed stopwrods and did tokenization using the BERT tokenizer.
The ground truth data was in the shown format, in English , each word token is assigned a NE tag, in chinese, each character is assigned a NER tag.We combined Bi-directionary LSTM with conditional random field, which has better result than only using BiLSTM.
For deep learning based NER task, it usally has three steps:
- Get the representation for the input. We use the pretrained word embedding BERT and Character-level embedding to
In sequence tagging task, we have access to both past and future input features for a given time, we can thus utilize a bidirectional LSTM network to efficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame. For the output layer, There are two different ways to make use of neighbor tag information in predicting current tags. The first is to predict a distribution of tags for each time step and then use beam-like decoding to find optimal tag sequences. The second one is to focus on sentence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Our model beats the Baseline "glove" for precision, recall, F1-score in recognizing organization, location, person, and misc which means all names which are not already in the other categories except the recall for organization.
Here's some result for recognizing name entities in English sentences.
We choose some confusing sentences such as George Washington went to Washington to study in University of Washington .
And here are some results for Chinese Sentences
Since the Conll dataset we used is 2003, which is pretty outdated, we want to try inferencing with our model on new named entity which may be unseen in our training dataset. This is our result: (explain result: 1. although johnny depp was already famous at 2003, amber heard was still a nobody at that time , but our model can recognize her name. 2.similarly trump was also not so famous then, twitter even didn’t exist. 3. the boys is a pretty new TV series first aired in 2019 4. finally a model trained with 2003 dataset can correctly pick up covid-19). in general, our model perform quite well in recognizing new NE.
For example, we tested model inference for Chinese corpus using Telsla, the model cannot detect it. Hence the future work includes: