ChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain

Introduction

This repository contains code introduced in the following paper:

ChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain
Biaoyan Fang, Christian Druckenbrodt, Saber A. Akhondi, Jiayuan He, Timothy Baldwin and Karin Verspoor
In EACL 2021

Dataset

ChEMU-Ref Dataset is available in ChEMU 2021 Shared task. We have also uploaded the train and dev set in jsonlines format in data.
For detailed annotation guideline of ChEMU-Ref corpus, please refer to ChEMU-Ref annotation guideline

Getting Started

Install python (preference 3) requirement: pip install -r requirements.txt
Download GloVe and ChELMo embeddings
run setup_all.sh
To train your own models, modify the related codes in setup_training.sh and run it
Install brat evalation tool

Training Instructions

Experiment configurations are found in experiments.conf
Choose an experiment that you would like to run, e.g. best
Training: python train.py <experiment>
Results are stored in the logs directory.

Evaluation

Evaluation: python evaluate.py <experiment>
Evaluation tool provides differnet settings, exact and relax mention matching. For this paper, we use exact mention matching.

Input Data format

The input format is .jsonlines file. Each line of the .jsonlines file is a batch of sentences and must in the following format

{
"doc_key": "0414", 
"sentences": [["Step", "4", "Synthesis", "of", "Compound", "22", "To", "a", "solution", "of", "Compound", "21", "(", "445", "mg", ",", "0.88", "mmol", ")", "in", "dichloromethane", "(", "2", "mL", ")", "was", "added", "trifluoroacetic", "acid", "(", "2", "mL", ")", ",", "and", "the", "reaction", "mixture", "was", "stirred", "at", "room", "temperature", "for", "3", "hours", "."], ["The", "solvent", "was", "distilled", "off", "under", "reduced", "pressure", ",", "and", "saturated", "sodium", "hydrogen", "carbonate", "aqueous", "solution", "was", "added", "to", "the", "residue", ",", "followed", "by", "extraction", "with", "chloroform", "twice", "."], ["Then", ",", "the", "organic", "layer", "was", "dried", "over", "sodium", "sulfate", "."], ["The", "solvent", "was", "distilled", "off", "under", "reduced", "pressure", "to", "give", "Compound", "22", "(", "269", "mg", ",", "Yield", "100%", ")", "."]], 
"Coreference": [[[[47, 48]], [[20, 24]]], [[[87, 88]], [[73, 73]]], [[[97, 105]], [[4, 5]]]], 
"Transformed": [], 
"Reaction-associated": [[[[35, 37]], [[27, 32]]], [[[35, 37]], [[7, 24]]]], 
"Work-up": [[[[66, 67]], [[35, 37]]], [[[78, 80]], [[57, 62]]], [[[78, 80]], [[66, 67]]], [[[78, 80]], [[73, 73]]], [[[97, 105]], [[84, 85]]], [[[97, 105]], [[78, 80]]]], 
"Contained": []
}

For referring realtion, each pair contains two mentions: the first one is the anaphora and the second one is the antecedent.

Experiment result on the ChEMU-Ref Dataset (Full corpus)

Below displays the primary experiment result on the full ChEMU-Ref dataset based on our proposed model. For further experiment results with providing oracle mentions and ChELMO embedding, please refers to Full experiment results.

Relation	Method	P_A	R_A	F_A	P_R	R_R	F_R
Coref. (Surface)	coreference	89.4	55.9	68.7	79.2	47.7	59.5
	joint_train	91.4	56.0	69.5	81.3	48.0	60.3
Coref. (Atom)	coreference	89.4	55.9	68.7	81.3	48.3	60.6
	joint_train	91.4	56.0	69.5	83.9	48.8	61.7
Bridging	bridging	89.5	83.9	86.6	81.4	72.8	76.8
	joint_train	91.2	84.1	87.5	83.1	74.1	78.3
TR	bridging	78.6	84.7	81.5	77.4	84.7	80.8
	joint_train	79.7	85.9	82.7	77.6	85.9	81.5
RA	bridging	89.5	84.6	87.0	80.6	68.5	74.0
	joint_train	91.4	85.6	88.4	82.7	69.2	75.3
WU	bridging	91.5	84.0	87.5	81.9	74.3	77.9
	joint_train	93.1	83.7	88.1	83.6	76.0	79.6
CT	bridging	89.8	77.5	83.1	85.1	70.0	76.8
	joint_train	91.3	77.0	83.3	85.9	69.4	76.4
Overall	joint_train	91.2	74.0	81.7	82.8	68.7	75.1

Anaphora resolution results over the test dataset (%). Models are trained for "coreference", "bridging" or "joint_train" (both tasks jointly). Models were trained over 30,000 epochs, and averaged over 3 runs with different random seeds. F_A and F_R denote the F1 score for anaphor and relation prediction, respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
experiment_result		experiment_result
logs		logs
prediction		prediction
README.md		README.md
anaphora_model.py		anaphora_model.py
brat.py		brat.py
cache_elmo.py		cache_elmo.py
continuous_evaluate.py		continuous_evaluate.py
conversion.py		conversion.py
coref_kernels.cc		coref_kernels.cc
coref_ops.py		coref_ops.py
evaluate.py		evaluate.py
experiments.conf		experiments.conf
filter_embeddings.py		filter_embeddings.py
get_char_vocab.py		get_char_vocab.py
metrics.py		metrics.py
minimize.py		minimize.py
predict.py		predict.py
ps.py		ps.py
requirements.txt		requirements.txt
sentencesplit.py		sentencesplit.py
setup_all.sh		setup_all.sh
setup_training.sh		setup_training.sh
ssplit.py		ssplit.py
train.py		train.py
util.py		util.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain

Introduction

Dataset

Getting Started

Training Instructions

Evaluation

Input Data format

Experiment result on the ChEMU-Ref Dataset (Full corpus)

About

Releases

Packages

Languages

biaoyanf/ChEMU-Ref

Folders and files

Latest commit

History

Repository files navigation

ChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain

Introduction

Dataset

Getting Started

Training Instructions

Evaluation

Input Data format

Experiment result on the ChEMU-Ref Dataset (Full corpus)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages