-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding English data from the merge of UD and Propbank #5
Comments
Two issues were already reported in the propbank and UD_English repositories: |
some mistakes between the PoS tag in the propbank data and the xpostag in the UD data deserve attention. One example is:
used was analyzed as JJ in EWT, and, therefore, it was not annotated as a predicate. In UD this token was analyzed as VERB. The tree structure of UD should be very different from the ptb tree. |
This is the summary of cases where the original PoS tag is complete different from the current xpostag in the UD data:
The case We have 177 sentences with these differences between the xpostag in UD data and the POS tag in the Propbank/EWT data:
|
My suggestions are:
|
Concerning PTB metadata, my suggestion is that we use the new UD data in practice, ignoring the old data, but does not remove the related info from the data itself. While in all future work it is good to use UD data only, there may be occassions people want to compare evaluations of models based on new and old data, and having these links to the past in the same file may be useful. |
In f631cfa I introduce the first version of the merge. Data is not ready for merging into the master. |
@arademaker thanks for performing this merge - this will be very useful for anyone that wants to train SRL systems over UD! A quick question on the format: The
|
Thank you @alanakbik , I agree with have to think a little bit more about the final format. I actually ended up using the same format used for the other languages and improved the README file explaining that the This is a bad situation because the extension may let people believe that standard CoNNL-U readers can parse the files and it is not the case for now. I also don't like to have to deal with a variable number of columns per sentence. The format you suggest above seems to be very concise and it can be encoded in the MISC column. Other options are:
@huaiyu-zhu , @yunyaoli ? |
Or choose a different extension, e.g. |
I would probably vote against encoding this information in the sentence metadata. CoNLL-U plus or changing the extension are good solutions, but best might be to have this in valid CoNLL-U format since this is what most people/tools use. So I like your way of encoding SRL in the MISC column, just perhaps the readability could be improved by encoding the arguments with a id-pointer system like in the Finnish Propbank or the enhanced dependency graph (that uses head-deprel pairs)? |
The idea is to merge the data from propbank in https://github.com/propbank/propbank-release (subset with the EWT treebank) with the http://github.com/universaldependencies/UD_English-EWT (same sentences from the EWT with UD annotations and revisions)
The text was updated successfully, but these errors were encountered: