Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of the CoNLL 2003 NER pretrained model? #390

Closed
nreimers opened this issue Jan 14, 2019 · 6 comments
Closed

Performance of the CoNLL 2003 NER pretrained model? #390

nreimers opened this issue Jan 14, 2019 · 6 comments
Labels
enhancement Improving of an existing feature

Comments

@nreimers
Copy link

Hi,
thank you for open-sourcing this nice library for sequence tagging, great work!

I tried to evaluate the pre-trained CoNLL 2003 English-NER model, but I get slightly different results.

Output from the official Perl evaluation script of the pretrained NER model on the CoNLL 2003 NER test set:

$./conlleval < predictions.txt 
processed 46666 tokens with 5648 phrases; found: 5691 phrases; correct: 5252.
accuracy:  98.36%; precision:  92.29%; recall:  92.99%; FB1:  92.64
              LOC: precision:  94.12%; recall:  94.00%; FB1:  94.06  1666
             MISC: precision:  81.96%; recall:  82.19%; FB1:  82.08  704
              ORG: precision:  89.94%; recall:  92.05%; FB1:  90.98  1700
              PER: precision:  97.35%; recall:  97.59%; FB1:  97.47  1621

According to the conlleval-script, the performance on the test set is 92.64. I would have expected that this model achieves a performance of 93.09 or something close to that. Am I doing something wrong?

My code:

# Usage: python run_model.py > predictions.txt
# Then remove the logging info before analyzing it with conlleval 
from flair.models import SequenceTagger
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher

#Convert the IOBES tags back to BIO tags, so it is compatible with the Perl eval script
def convert_iobes_tags(tag):
    return tag.replace("S-", "B-").replace("E-", "I-")

# Load dataset. I used the files from: https://github.com/kyzhouhzau/BERT-NER/tree/master/NERdata
columns = {0: 'text', 3: 'ner_gold'}
data_folder = 'data/conll2003_ner/'

corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                              train_file='test.txt',
                                                              test_file='test.txt',
                                                              dev_file='test.txt')


# load the NER tagger
tagger = SequenceTagger.load('ner')

#RUN ner model
for sentence in corpus.test:
    tagger.predict(sentence)

    for token in sentence:
        print(token.text, token.get_tag('ner_gold').value, convert_iobes_tags(token.get_tag('ner').value))

    print("")

It would be really great if you could also publish the performance scores of your pretrained models.

@alanakbik
Copy link
Collaborator

Hello @nreimers thanks very much for reporting this.

When running your script I found that the final evaluation numbers of the serialized NER model are different with every run and that predictions are currently not entirely deterministic (try sentence = Sentence('SOCCER - BLINKER BAN LIFTED .'); tagger.predict(sentence); print(sentence.to_tagged_string() a few times). I think this is likely due to an old bug that we fixed a while back in which the dropout never gets deactivated (#184). Since we trained this model many Flair versions ago, the bug was serialized along with the embedding class even though it should already be fixed.

I checked a more recently trained model and predictions are deterministic. With this model, I get the following results with your script:

processed 46666 tokens with 5648 phrases; found: 5683 phrases; correct: 5269.
accuracy:  98.43%; precision:  92.72%; recall:  93.29%; FB1:  93.00
              LOC: precision:  93.79%; recall:  94.24%; FB1:  94.02  1676
             MISC: precision:  82.84%; recall:  83.19%; FB1:  83.01  705
              ORG: precision:  91.32%; recall:  92.47%; FB1:  91.89  1682
              PER: precision:  97.35%; recall:  97.53%; FB1:  97.44  1620

What this means from our side is that we need to do the following:

  • Re-train all models with the current Flair version. This is something we have been meaning to do for a long while anyway since Flair and PyTorch have undergone a lot of changes over the last iterations.

  • Publish final scores of all models we ship and scripts to reproduce training.

@tabergma tabergma added the enhancement Improving of an existing feature label Jan 15, 2019
@nreimers
Copy link
Author

Hi @alanakbik
great, thanks for your fast reaction. Looking forward to the new serialized models.

@yahshibu
Copy link
Contributor

Hello @alanakbik
I'm really grateful for your contributions! Thank you.

I also tried to reproduce your results but failed...
And, I was about to ask you my question.

However, it is related to this issue, I guess.
I'm also looking forward to updating.

Thanks.

@alanakbik
Copy link
Collaborator

@yahshibu thanks for your interest - the next release will probably happen end of next week!

@yahshibu
Copy link
Contributor

yahshibu commented Feb 7, 2019

@alanakbik Thank you for letting me know! I'm so pumped!

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@alanakbik alanakbik removed the wontfix This will not be worked on label Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants