Performance of the CoNLL 2003 NER pretrained model? #390

nreimers · 2019-01-14T15:13:07Z

Hi,
thank you for open-sourcing this nice library for sequence tagging, great work!

I tried to evaluate the pre-trained CoNLL 2003 English-NER model, but I get slightly different results.

Output from the official Perl evaluation script of the pretrained NER model on the CoNLL 2003 NER test set:

$./conlleval < predictions.txt 
processed 46666 tokens with 5648 phrases; found: 5691 phrases; correct: 5252.
accuracy:  98.36%; precision:  92.29%; recall:  92.99%; FB1:  92.64
              LOC: precision:  94.12%; recall:  94.00%; FB1:  94.06  1666
             MISC: precision:  81.96%; recall:  82.19%; FB1:  82.08  704
              ORG: precision:  89.94%; recall:  92.05%; FB1:  90.98  1700
              PER: precision:  97.35%; recall:  97.59%; FB1:  97.47  1621

According to the conlleval-script, the performance on the test set is 92.64. I would have expected that this model achieves a performance of 93.09 or something close to that. Am I doing something wrong?

My code:

# Usage: python run_model.py > predictions.txt
# Then remove the logging info before analyzing it with conlleval 
from flair.models import SequenceTagger
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher

#Convert the IOBES tags back to BIO tags, so it is compatible with the Perl eval script
def convert_iobes_tags(tag):
    return tag.replace("S-", "B-").replace("E-", "I-")

# Load dataset. I used the files from: https://github.com/kyzhouhzau/BERT-NER/tree/master/NERdata
columns = {0: 'text', 3: 'ner_gold'}
data_folder = 'data/conll2003_ner/'

corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                              train_file='test.txt',
                                                              test_file='test.txt',
                                                              dev_file='test.txt')


# load the NER tagger
tagger = SequenceTagger.load('ner')

#RUN ner model
for sentence in corpus.test:
    tagger.predict(sentence)

    for token in sentence:
        print(token.text, token.get_tag('ner_gold').value, convert_iobes_tags(token.get_tag('ner').value))

    print("")

It would be really great if you could also publish the performance scores of your pretrained models.

The text was updated successfully, but these errors were encountered:

alanakbik · 2019-01-14T17:16:53Z

Hello @nreimers thanks very much for reporting this.

When running your script I found that the final evaluation numbers of the serialized NER model are different with every run and that predictions are currently not entirely deterministic (try sentence = Sentence('SOCCER - BLINKER BAN LIFTED .'); tagger.predict(sentence); print(sentence.to_tagged_string() a few times). I think this is likely due to an old bug that we fixed a while back in which the dropout never gets deactivated (#184). Since we trained this model many Flair versions ago, the bug was serialized along with the embedding class even though it should already be fixed.

I checked a more recently trained model and predictions are deterministic. With this model, I get the following results with your script:

processed 46666 tokens with 5648 phrases; found: 5683 phrases; correct: 5269.
accuracy:  98.43%; precision:  92.72%; recall:  93.29%; FB1:  93.00
              LOC: precision:  93.79%; recall:  94.24%; FB1:  94.02  1676
             MISC: precision:  82.84%; recall:  83.19%; FB1:  83.01  705
              ORG: precision:  91.32%; recall:  92.47%; FB1:  91.89  1682
              PER: precision:  97.35%; recall:  97.53%; FB1:  97.44  1620

What this means from our side is that we need to do the following:

Re-train all models with the current Flair version. This is something we have been meaning to do for a long while anyway since Flair and PyTorch have undergone a lot of changes over the last iterations.
Publish final scores of all models we ship and scripts to reproduce training.

nreimers · 2019-01-15T09:22:33Z

Hi @alanakbik
great, thanks for your fast reaction. Looking forward to the new serialized models.

yahshibu · 2019-01-22T18:25:28Z

Hello @alanakbik
I'm really grateful for your contributions! Thank you.

I also tried to reproduce your results but failed...
And, I was about to ask you my question.

However, it is related to this issue, I guess.
I'm also looking forward to updating.

Thanks.

alanakbik · 2019-02-07T09:58:32Z

@yahshibu thanks for your interest - the next release will probably happen end of next week!

yahshibu · 2019-02-07T19:09:36Z

@alanakbik Thank you for letting me know! I'm so pumped!

stale · 2020-04-30T02:53:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tabergma added the enhancement Improving of an existing feature label Jan 15, 2019

alanakbik mentioned this issue Jan 24, 2019

Reproduce Experiments on CoNLL2003 English NER #416

Closed

stale bot added the wontfix This will not be worked on label Apr 30, 2020

alanakbik removed the wontfix This will not be worked on label Apr 30, 2020

alanakbik closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of the CoNLL 2003 NER pretrained model? #390

Performance of the CoNLL 2003 NER pretrained model? #390

nreimers commented Jan 14, 2019

alanakbik commented Jan 14, 2019

nreimers commented Jan 15, 2019

yahshibu commented Jan 22, 2019

alanakbik commented Feb 7, 2019

yahshibu commented Feb 7, 2019

stale bot commented Apr 30, 2020

Performance of the CoNLL 2003 NER pretrained model? #390

Performance of the CoNLL 2003 NER pretrained model? #390

Comments

nreimers commented Jan 14, 2019

alanakbik commented Jan 14, 2019

nreimers commented Jan 15, 2019

yahshibu commented Jan 22, 2019

alanakbik commented Feb 7, 2019

yahshibu commented Feb 7, 2019

stale bot commented Apr 30, 2020