Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about dataset preprocessing #11

Closed
wasiahmad opened this issue Feb 11, 2020 · 4 comments
Closed

Questions about dataset preprocessing #11

wasiahmad opened this issue Feb 11, 2020 · 4 comments

Comments

@wasiahmad
Copy link

In the documentation, there is two dataset preprocessing steps. One for entity and relations and the second one is for events. In the first task, Stanford Corenlp is used, but in the second task, Spacy is used. Can you please explain, what is the difference? I see relation labels are different in these preprocessing steps, such as, "ORG-AFF.Membership" or "GEN-AFF" and their offset values are different too. There are other differences too. It would be helpful if you provide some details.

Since ACE05 is a benchmark dataset, I assume, token/entity/relation/event annotation is already there. Then why do you need Corenlp or Spacy libraries?

@dwadden
Copy link
Owner

dwadden commented Feb 11, 2020

Good questions. As far as I know, the ACE dataset in its original release does not actually split the data into tokens or sentences. It does provide spans for entities, relations, and events, but there is ambiguity there also as described in DATA.md. In general, people do their own preprocessing. Part of the motivation for this code release was to offer a standardized way of preprocessing the ACE data for others to use.

Historically, the community has used different train / dev / test splits when doing ACE relation extraction vs. ACE event extraction. For the details, see section 3 of the dygiepp paper. For relation extraction we use the split from Miwa and Bansal and for event extraction we use the split from Yang and Mitchell.

For the relation split, I use an adapted version of the preprocessing code from the Miwa and Bansal codebase. This code relies on Stanford CoreNLP.

For the event split, I used code from this paper. The code itself is not publicly released, but the author shared code with me and I adapted it and made it public. This code relies on Spacy.

@wasiahmad
Copy link
Author

Thank you for clarifying everything. I assume CoreNLP and Spacy are used for tokenization and sentence splitting.

@wasiahmad
Copy link
Author

Opening this issue to ask a question regarding event extraction data. As noted in the dataset description here, from ACE05 we have entities, relations, and coreference clusters.

Then where is the ground-truth for event triggers, event arguments, and their role-labels? Can you please explain this?

@wasiahmad wasiahmad reopened this Feb 26, 2020
@dwadden
Copy link
Owner

dwadden commented Mar 3, 2020

Added to DATA.md. Thanks for the question.

@dwadden dwadden closed this as completed Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants