This repository contains the annotated STM corpus with coreferences, populated knowledge graphs and source code for the paper:
Brack A., Müller D., Hoppe A., Ewerth R. (2021) Coreference Resolution in Research Papers from Multiple Domains, ECIR 2021 (accepted for publication).
- Preprint: https://arxiv.org/abs/2101.00884
Python 3.7 required. Install the requirements with:
- (in case pipenv isn't installed yet:
pip install pipenv
) pipenv install
- data/stm-coref: contains the annotated coreferences separated per domain BRAT/standoff format.
- data/stm-entities: contains the annotated concepts separated per domain in BRAT/standoff format (from https://github.com/arthurbra/stm-corpus)
- data/silver_labelled: contains predicted concepts and coreferences of the silver labelled corpus
- data/STM coreference annotation guidelines.pdf: annotation guidelines for coreference annotions in the STM-corpus
The folder knowledge_graph/ contains various knowledge graphs.
- knowledge_graph/gold_kg_cross_domain.jsonl: Test-STM-KG (cross-domain)
- knowledge_graph/gold_kg_in_domain.jsonl: Test-STM-KG (in-domain)
- knowledge_graph/entity_resolution_annotations.tsv: links mentions to Wiktionary and Wikipedia (from https://gitlab.com/TIBHannover/orkg/orkg-nlp/-/tree/master/STEM-ECR-v1.0)
To evaluate the effect of coreference resolution in knowledge graph population against the Test-STM-KG, run the following python script:
- evaluate_kgs_against_test_stm_kg.py: prints the evaluation metrics
- knowledge_graph/stm_silver_kg_cross_domain_with_corefs.jsonl: Contains the cross-domain KG populated from 55,485 abstracts
- knowledge_graph/stm_silver_kg_in_domain_with_corefs.jsonl: Contains the in-domain KG populated from 55,485 abstracts
- knowledge_graph/top domain-specific concepts.xlsx: contains the most frequent domain-specific concepts per concept type and domain
To build the in-domain and cross-domain Test-STM-KG and the research knowledge graphs, run the following python script:
- build_kgs.py: creates the knowledge graphs in knowledge_graph/
The following two models were tested in the paper: BFCR and SCIIE. Experiments with the two models on STM-coref, SciERC or any corpus that inherits from brat_utils.Corpus can be performed in the following ways:
model = BFCRModel()
The fold to train and evaluate on and the exact experiment involving BFCR can be specified with the
fold
and experiments
parameters. By default an already pretrained BFCR_Span_Onto_STM-model is used.
model = SCIIEModel()
BFCRModel and SCIIEModel automatically download the necessary checkpoints to run and store them locally for future use. By default BFCRModel and SCIIEModel use STM-coref, but SciERC can also be used. If you want to use SciERC, download it as explained in brat_utils.SciercCorpus.
Runs for a fixed number of epochs.
model.train()
model.evaluate()
Evaluates on the test-set and prints the average MUC-, B³-, CEAFE- and CoNLL-scores. Also stores the average scores for the entire test-set and for each domain in the dataset as a .csv-file at stm-coref/EvalResults.
Coreference-clusters in texts can be predicted in the following two ways:
predicted_clusters = model.predict(texts, domains) # domains optional
Predict.py downloads the best performing model (BFCR_Span_Onto_STM), which is already pretrained and uses it to predict coreference-clusters of the texts stored in the input-file. The predictions are then stored in another file. Usage:
pipenv run python predict.py "path/to/texts.jsonlines" "Optional-path/to/domains.jsonlines"
The predictions can be visualized using brat. Creating the files needed for brat to visualize the predicted-clusters can be done in the two following ways:
model.predict(texts, domains, create_standoff_annotations=True)
or
pipenv run python predict.py "path/to/texts.jsonlines" "Optional-path/to/domains.jsonlines" --create_standoff_annotations