In [1]:
from analysis import Analysis

# Metric-based datasets analysis


## available corpora
We've converted corpora from different disambiguation tasks into the same format (tab separated):
* **identifier**: instance identifier
* **lexical expression**: depending on the task, this can be a lemma, entity mention, event mention, or predicate
* **meaning**: sense for concept-based task, instance for instance-based tasks
* **document creating time**: if available, this shows the date at which the document was created
* **resource ambiguity**: the number of meanings that the lexical expression can refer to according to a resource (for example WordNet)
* **resource variance**: the maximum of lexical expressions that can express the meaning according to a resource (for example WordNet)

| disambiguation task  | name corpus  | link   | name file in this folder  | 
|---|---|---|---|
| WSD  | senseval 2 all-words  | http://aclweb.org/anthology/S01-1005  |  WSD___SE2-AW |
| WSD  | senseval 3 all-words  | http://aclweb.org/anthology/W04-0811  |  WSD___SE3-AW |
| WSD  | semeval 2007 all-words  | http://aclweb.org/anthology/S07-1016   | WSD___SE7-AW  |
| WSD | semeval 2010 all-words | http://www.aclweb.org/anthology/S10-1013 | WSD___SE10-AW |
| WSD | semeval 2013 all-words | http://aclweb.org/anthology/S13-2040 | WSD___SE13-AW |
| SRL | CoNLL-2014 shared task training | http://www.cs.upc.edu/~srlconll/st04/st04.html | SRL___CONLL2004 |
| EL | WES2015 | http://ceur-ws.org/Vol-1581/paper4.pdf | EL____WES2015 |
| EL | Meantime | http://www.newsreader-project.eu/results/data/wikinews/ | EL___Meantime |
| EnC | QuizBowl | https://www.cs.umd.edu/~aguha/qbcoreference | EnC___QuizBowl |
| EvC | Ecb | http://aclweb.org/anthology/P10-1143 | EvC___Ecb |
| EvC | Ecb+ | http://www.lrec-conf.org/proceedings/lrec2014/pdf/840_Paper.pdf | EvC__Ecb+ |

The following two datasets were analyzed, but were not included due to licensing restrictions:

| disambiguation task  | name corpus  | link   | name file in this folder  | 
|---|---|---|---|
| EL | AIDA | http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/ | NA |
| EvC | TAC KBP '2015 EN Task 3 | https://tac.nist.gov/2015/KBP/Event/index.html | NA |

In [2]:
all_datasets = ['EL___WES2015', 'EL___Meantime', # Entity Linking
                'EnC___quizbowl',  # Entity Coreference
                'EvC___ecb', 'EvC___ecb+', # Event Coreference
                'SRL___CONLL2004', # Semantic Role Labelling
                'WSD___SE10-AW', 'WSD___SE13-AW', 'WSD___SE2-AW', 'WSD___SE3-AW', 'WSD___SE7-AW'] # Word Sense Disambiguation

## available metrics
The file 'metrics.py' in this folder contains the metrics we used to analyze each corpus. We encourage you to inspect those functions! We also prepare a class that computes these metrics for each task. An example can be found below:

In [6]:
# example analysis
instance = Analysis('EL___WES2015')
for attribute in ('moa', 'moda', 'emnle', 'rora',
                 'mov', 'modv', 'elenm', 'rorv'):
            print(attribute, round(getattr(instance, attribute), 2))

# The dtr value is only available for the Entity Linking datasets.
print('dtr', getattr(instance, 'dtr'))

moa 1.03
moda 0.99
emnle 0.99
rora 0.0
mov 1.38
modv 0.91
elenm 0.19
rorv 0.0
dtr [datetime.datetime(2012, 4, 30, 0, 0), datetime.datetime(2015, 3, 30, 0, 0)]


## ATRFU
In order to compute ATR(FU), we refer to the notebook at 'atrfu/Atrfu demo.ipynb'.