eval4ner: An All-Round Evaluation for Named Entity Recognition

Strict：exact match (Both entity boundary and type are correct)
Exact boundary matching：predicted entity boundary is correct, regardless of entity boundary
Partial boundary matching：entity boundaries overlap, regardless of entity boundary
Type matching：some overlap between the system tagged entity and the gold annotation is required;

Refer to the blog Evaluation Metrics of Name Entity Recognition for explanations of MUC metric.

Preliminaries for NER Evaluation

In research and production, following scenarios of NER systems can occur frequently:

Scenario	Golden Standard		NER system prediction		Measure
	Entity Type	Entity Boundary (Surface String)	Entity Type	Entity Boundary (Surface String)	Type	Partial	Exact	Strict
III	MUSIC_NAME	告白气球			MIS	MIS	MIS	MIS
II			MUSIC_NAME	年轮	SPU	SPU	SPU	SPU
V	MUSIC_NAME	告白气球	MUSIC_NAME	一首告白气球	COR	PAR	INC	INC
IV	MUSIC_NAME	告白气球	SINGER	告白气球	INC	COR	COR	INC
I	MUSIC_NAME	告白气球	MUSIC_NAME	告白气球	COR	COR	COR	COR
VI	MUSIC_NAME	告白气球	SINGER	一首告白气球	INC	PAR	INC	INC

Thus, MUC-5 takes into account all these scenarios for all-sided evaluation.

Then we can compute:

Number of golden standard:

$Possible(POS) = COR + INC + PAR + MIS = TP + FN$

Number of predictee:

$Actual(ACT) = COR + INC + PAR + SPU = TP + FP$

The evaluation type of exact match and partial match are as follows:

Therefore, we can get the results:

Measure	Type	Partial	Exact	Strict
Correct	2	2	2	1
Incorrect	2	0	2	3
Partial	0	2	0	0
Missed	1	1	1	1
Spurius	1	1	1	1
Precision	0.4	0.6	0.4	0.2
Recall	0.4	0.6	0.4	0.2
F1 score	0.4	0.6	0.4	0.2

User Guide

Installation

pip install [-U] eval4ner

Usage

1. Evaluate single prediction

import eval4ner.muc as muc
import pprint
grount_truth = [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')]
prediction = [('PER', 'John Jones and Peter Peters came to York')]
text = 'John Jones and Peter Peters came to York'
one_result = muc.evaluate_one(prediction, grount_truth, text)
pprint.pprint(one_result)

Output:

{'exact': {'actual': 1,
           'correct': 0,
           'f1_score': 0,
           'incorrect': 1,
           'missed': 2,
           'partial': 0,
           'possible': 3,
           'precision': 0.0,
           'recall': 0.0,
           'spurius': 0},
 'partial': {'actual': 1,
             'correct': 0,
             'f1_score': 0.25,
             'incorrect': 0,
             'missed': 2,
             'partial': 1,
             'possible': 3,
             'precision': 0.5,
             'recall': 0.16666666666666666,
             'spurius': 0},
 'strict': {'actual': 1,
            'correct': 0,
            'f1_score': 0,
            'incorrect': 1,
            'missed': 2,
            'partial': 0,
            'possible': 3,
            'precision': 0.0,
            'recall': 0.0,
            'spurius': 0},
 'type': {'actual': 1,
          'correct': 1,
          'f1_score': 0.5,
          'incorrect': 0,
          'missed': 2,
          'partial': 0,
          'possible': 3,
          'precision': 1.0,
          'recall': 0.3333333333333333,
          'spurius': 0}}

2. Evaluate all predictions

import eval4ner.muc as muc
# ground truth
grount_truths = [
    [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')],
    [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')],
    [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')]
]
# NER model prediction
predictions = [
    [('PER', 'John Jones and Peter Peters came to York')],
    [('LOC', 'John Jones'), ('PER', 'Peters'), ('LOC', 'York')],
    [('PER', 'John Jones'), ('PER', 'Peter Peters'), ('LOC', 'York')]
]
# input texts
texts = [
    'John Jones and Peter Peters came to York',
    'John Jones and Peter Peters came to York',
    'John Jones and Peter Peters came to York'
]
muc.evaluate_all(predictions, grount_truths * 1, texts, verbose=True)

Output:

 NER evaluation scores:
  strict mode, Precision=0.4444, Recall=0.4444, F1:0.4444
   exact mode, Precision=0.5556, Recall=0.5556, F1:0.5556
 partial mode, Precision=0.7778, Recall=0.6667, F1:0.6944
    type mode, Precision=0.8889, Recall=0.6667, F1:0.7222

This repo will be long-term supported. Welcome to contribute and PR.

Citation

For attribution in academic contexts, please cite this work as:

@misc{eval4ner,
  title={Evaluation Metrics of Named Entity Recognition},
  author={Chai, Yekun},
  year={2018},
  howpublished={\url{https://cyk1337.github.io/notes/2018/11/21/NLP/NER/NER-Evaluation-Metrics/}},
}

@misc{chai2018-ner-eval,
  author = {Chai, Yekun},
  title = {eval4ner: An All-Round Evaluation for Named Entity Recognition},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cyk1337/eval4ner}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
build/lib/eval4ner		build/lib/eval4ner
dist		dist
eval4ner.egg-info		eval4ner.egg-info
eval4ner		eval4ner
tests		tests
.gitignore		.gitignore
LICENCE		LICENCE
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

License

Licenses found

cyk1337/eval4ner

Folders and files

Latest commit

History

Repository files navigation

eval4ner: An All-Round Evaluation for Named Entity Recognition

Table of Contents

TL;DR

Preliminaries for NER Evaluation

Exact match(i.e. Strict, Exact)

Partial match (i.e. Partial, Type)

F-Measure

User Guide

Installation

Usage

1. Evaluate single prediction

2. Evaluate all predictions

Citation

References

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Languages