Code to perform CDEC, which is the task of clustering event mentions in a collection of documents that refer to the same real-world event. Configured to do CDEC on the ECB+ corpus using only event trigger annotations. Matches or slightly beats state of the art on trigger-only CDEC with a significantly simpler model.
Suppose some event takes place and journalists inform the public. The reports refer to the same event but inevitably vary in their language, tone and contextualization. What can we learn from the variations? Of course, we must first group together all reports of the same event before we can realize any analysis. Generating these event groupings automatically is called Cross-Document Event Coreference (CDEC). It is an important component not only for this application, but also for tasks such as information retrieval and question answering. CDEC allows us to augment the information around an event mention in a single document with information from all of its mentions across many documents, such as entities and other contextual information, allowing us to paint a clearer picture of how events and their encompassing stories are told.
Note: This code was developed on Ubuntu 19.10, and instructions assume access to a linux shell.
- Download pre-trained Fasttext vectors here -- select the English bin file.
- Stanford CoreNLP 3.9.0
- The official CoNLL Scorer
java 1.8
python 3.7.5
pipenv 2018.11.26
maven 3.6.1
- Dependencies for python files ('python assets') are listed in corresponding Pipfile and managed through
pipenv
. - Dependencies for main java project ('cdec') are in pom.xml and managed through
maven
.
- Clone the CoNLL scorer repo and place it into a directory called "perl_assets" beneath the root directory.
mkdir perl_assets
cd perl_assets
git clone https://github.com/conll/reference-coreference-scorers
- Open the root directory in a terminal and type the following commands to generate the
pipenv
environments for the required python scripts:
cd python_assets
cd ecb_augmenter
pipenv install
cd ../word_vecs
pipenv install
-
Prepare the ECB+ corpus:
- First, open your download of the ECB+ corpus and place the extracted ECB+.zip directory and the file "ECBplus_coreference_sentences.csv" in the
CDEC/data
directory. - "Augment" the ECB+ corpus (this makes it more convenient to parse):
cd ecb_augmenter pipenv run python main.py
- First, open your download of the ECB+ corpus and place the extracted ECB+.zip directory and the file "ECBplus_coreference_sentences.csv" in the
-
Open the file "external_paths.json" in the root directory and add the paths to your installation of Stanford CoreNLP and the Fasstext vectors.
-
Start the word embedding server:
- When the server is finished loading your terminal will display a message with the server's port.
cd word_vecs
pipenv run python main.py
- Run the main java project:
- This will run random 5-fold cross validation and log the results to an automatically created folder at
data/results
. - If you wish to compile the code from source, you may do so using maven and the pom.xml file.
- This will run random 5-fold cross validation and log the results to an automatically created folder at
cd cdec
mvn clean compile assembly:single // this generates a .jar file with all necessary dependencies
java -cp target/CDEC-0.0.1-SNAPSHOT-jar-with-dependencies.jar main.Main
- Remember to shut down the CoreNLP server and the word embedding server once you are done.