Cross-Document Event Coreference (CDEC)

Code to perform CDEC, which is the task of clustering event mentions in a collection of documents that refer to the same real-world event. Configured to do CDEC on the ECB+ corpus using only event trigger annotations. Matches or slightly beats state of the art on trigger-only CDEC with a significantly simpler model.

What is this for?

Suppose some event takes place and journalists inform the public. The reports refer to the same event but inevitably vary in their language, tone and contextualization. What can we learn from the variations? Of course, we must first group together all reports of the same event before we can realize any analysis. Generating these event groupings automatically is called Cross-Document Event Coreference (CDEC). It is an important component not only for this application, but also for tasks such as information retrieval and question answering. CDEC allows us to augment the information around an event mention in a single document with information from all of its mentions across many documents, such as entities and other contextual information, allowing us to paint a clearer picture of how events and their encompassing stories are told.

Prerequisites

Note: This code was developed on Ubuntu 19.10, and instructions assume access to a linux shell.

Data

ECB+

Dependencies

Download pre-trained Fasttext vectors here -- select the English bin file.
Stanford CoreNLP 3.9.0
The official CoNLL Scorer
java 1.8
python 3.7.5
pipenv 2018.11.26
maven 3.6.1
Dependencies for python files ('python assets') are listed in corresponding Pipfile and managed through pipenv.
Dependencies for main java project ('cdec') are in pom.xml and managed through maven.

Instructions

Clone the CoNLL scorer repo and place it into a directory called "perl_assets" beneath the root directory.

mkdir perl_assets
cd perl_assets
git clone https://github.com/conll/reference-coreference-scorers

Open the root directory in a terminal and type the following commands to generate the pipenv environments for the required python scripts:

cd python_assets
cd ecb_augmenter
pipenv install
cd ../word_vecs
pipenv install

Prepare the ECB+ corpus:
- First, open your download of the ECB+ corpus and place the extracted ECB+.zip directory and the file "ECBplus_coreference_sentences.csv" in the CDEC/data directory.
- "Augment" the ECB+ corpus (this makes it more convenient to parse):
```
cd ecb_augmenter
pipenv run python main.py
```
Open the file "external_paths.json" in the root directory and add the paths to your installation of Stanford CoreNLP and the Fasstext vectors.
Start the word embedding server:
- When the server is finished loading your terminal will display a message with the server's port.

cd word_vecs
pipenv run python main.py

Run the main java project:
- This will run random 5-fold cross validation and log the results to an automatically created folder at data/results.
- If you wish to compile the code from source, you may do so using maven and the pom.xml file.

cd cdec
mvn clean compile assembly:single // this generates a .jar file with all necessary dependencies
java -cp target/CDEC-0.0.1-SNAPSHOT-jar-with-dependencies.jar main.Main

Remember to shut down the CoreNLP server and the word embedding server once you are done.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
cdec		cdec
data		data
docs		docs
python_assets		python_assets
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Document Event Coreference (CDEC)

What is this for?

Prerequisites

Data

Dependencies

Instructions

About

Releases

Packages

Languages

es-andres/cdec

Folders and files

Latest commit

History

Repository files navigation

Cross-Document Event Coreference (CDEC)

What is this for?

Prerequisites

Data

Dependencies

Instructions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages