Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk-fix bracket tags in dep format. #4

Merged
merged 4 commits into from
Aug 18, 2016

Conversation

reckart
Copy link
Contributor

@reckart reckart commented Aug 12, 2016

There is a pervasive problem with brackets not having the correct POS tags. This leads to -RRB- and ( tags existing at the same time (same for -LRB-).

I assume that the the PTB escaped versions are the desired ones.

-LRB-   _   (   (
-LRB-   _   -LRB-   -LRB-
[   _   (   (
(   _   (   (

normalized to ( _ ( -LRB-

-RRB-   _   )   )
-RRB-   _   -RRB-   -RRB-
]   _   (   (
]   _   )   )
)   _   )   )

normalized to ) _ ) -RRB-

Btw. are you sure that round brackets and square brackets should get the same tags?

@amir-zeldes
Copy link
Owner

This is not exactly an error, though I understand why this seems unexpected, and the lack of square bracket tags is something to consider changing completely - thanks for pointing both out.

The GUM corpus manual tagging uses the same extended PTB tag set produced by the freely available TreeTagger model for English (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). This model actually uses the bracket character itself as a tag, and only the round variant. It normally tags square brackets as SYM, but because in GUM square brackets are used in very similar ways to round brackets, we decided to tag them the same.

The second POS column which uses PTB style -RRB- tags etc. is actually derived automatically from the manual TreeTagger-style tags. The conversion turns the tag ( to -RRB- automatically, and doesn't check whether the token is a square bracket, though this would be easy to do.

I think it's worth considering distinguishing the bracket types in the PTB style column, but for the TreeTagger column I consider the literal ( to be the correct tag, so I'm not merging this change for now. I will definitely revisit the conversion to 'vanilla' PTB though, so thanks again for raising this issue.

@reckart
Copy link
Contributor Author

reckart commented Aug 18, 2016

@amir-zeldes I don't understand you response entirely and I see that I should probably have opened multiple pull requests.

So there are multiple issues:

  • square brackets annotated using round-bracket tags: I understand you say you consider fixing that
    • [ _ ( (
    • ] _ ) )
  • round brackets in text represented as -RRB-/-LRB- in text: these seem to me to be wrong in the text. I looked at the locations in the text and these didn't seem to me to be actual proper instances of these strings, but rather substitutions for round brackets. E.g. 92112f2#diff-3184c3716cf2b0d7ea01a17c7a12b83fL936
    • -LRB- _ ( (
    • -RRB- _ ) )
  • round brackets in text represented as -RRB-/-LRB- in text 2: my understanding of your reply is that in the tags only the plain bracket forms should be used (cf. TreeTagger). But then the PTB forms must be replaced. E.g. aea1d48#diff-e92674eb2759d50fd1d838dda9ae7f37L63
    • -LRB- _ -LRB- -LRB-
    • -RRB- _ -RRB- -RRB-
  • brackets going in the wrong direction: I don' think you addressed these in your reply:
    • ] _ ( (

You say:

The second POS column which uses PTB style -RRB- tags etc. is actually derived automatically from the manual TreeTagger-style tags. The conversion turns the tag ( to -RRB- automatically, and doesn't check whether the token is a square bracket, though this would be easy to do.

So my understanding is that valid line for brackets are ( _ ( -LRB- and ) _ ) -RRB- which is exactly to what I have normalized the invalid annotations.

So why is this PR invalid?

@amir-zeldes
Copy link
Owner

I see, you're absolutely right, sorry for not reading more carefully before. I just saw the first example and assumed they were all like the first issue. The remaining ones are probably different types of conversion errors/missing steps/human intervention errors. I'll merge these in a two step process, since they need to be propagated into the different formats, at least partially.

I'm also seeing now that some of the conll10 files don't actually have the vanilla tags in the second column, they just repeat the TT style tags. I will leave this for now since they will be populated on the next merge. I'm working on a Pepper-based automatic buildbot with some built in validation which should hopefully reduce errors and manual intervention in propagating changes (basically I want any correction in any source format to be updated in the other formats, which will allow for faster and easier corrections).

Thanks again for the corrections!

@amir-zeldes amir-zeldes reopened this Aug 18, 2016
@amir-zeldes amir-zeldes merged commit ae26261 into amir-zeldes:master Aug 18, 2016
amir-zeldes added a commit that referenced this pull request Aug 18, 2016
  * Minor corrections
    * Tags for brackets made more consistent: use literal brackets in TT tags, LRB style in PTB tags (see #4 for details)
    * Minor token consistency corrections (some formats were using incorrect ASCII equivalents for Unicode quotation marks and other punctuation)
    * Adjusted sentence border in syntax files for GUM_interview_gaming
    * Missing sentence type added in GUM_news_imprisoned
    * Coref conversion errors fixed in GUM_whow_chicken
    * Numerous minor dependency corrections
amir-zeldes added a commit that referenced this pull request Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants