-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk-fix bracket tags in dep format. #4
Conversation
This is not exactly an error, though I understand why this seems unexpected, and the lack of square bracket tags is something to consider changing completely - thanks for pointing both out. The GUM corpus manual tagging uses the same extended PTB tag set produced by the freely available TreeTagger model for English (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). This model actually uses the bracket character itself as a tag, and only the round variant. It normally tags square brackets as SYM, but because in GUM square brackets are used in very similar ways to round brackets, we decided to tag them the same. The second POS column which uses PTB style -RRB- tags etc. is actually derived automatically from the manual TreeTagger-style tags. The conversion turns the tag I think it's worth considering distinguishing the bracket types in the PTB style column, but for the TreeTagger column I consider the literal |
@amir-zeldes I don't understand you response entirely and I see that I should probably have opened multiple pull requests. So there are multiple issues:
You say:
So my understanding is that valid line for brackets are So why is this PR invalid? |
I see, you're absolutely right, sorry for not reading more carefully before. I just saw the first example and assumed they were all like the first issue. The remaining ones are probably different types of conversion errors/missing steps/human intervention errors. I'll merge these in a two step process, since they need to be propagated into the different formats, at least partially. I'm also seeing now that some of the conll10 files don't actually have the vanilla tags in the second column, they just repeat the TT style tags. I will leave this for now since they will be populated on the next merge. I'm working on a Pepper-based automatic buildbot with some built in validation which should hopefully reduce errors and manual intervention in propagating changes (basically I want any correction in any source format to be updated in the other formats, which will allow for faster and easier corrections). Thanks again for the corrections! |
* Minor corrections * Tags for brackets made more consistent: use literal brackets in TT tags, LRB style in PTB tags (see #4 for details) * Minor token consistency corrections (some formats were using incorrect ASCII equivalents for Unicode quotation marks and other punctuation) * Adjusted sentence border in syntax files for GUM_interview_gaming * Missing sentence type added in GUM_news_imprisoned * Coref conversion errors fixed in GUM_whow_chicken * Numerous minor dependency corrections
There is a pervasive problem with brackets not having the correct POS tags. This leads to
-RRB-
and(
tags existing at the same time (same for-LRB-
).I assume that the the PTB escaped versions are the desired ones.
normalized to
( _ ( -LRB-
normalized to
) _ ) -RRB-
Btw. are you sure that round brackets and square brackets should get the same tags?