Bulk-fix bracket tags in dep format. #4

reckart · 2016-08-12T20:31:35Z

There is a pervasive problem with brackets not having the correct POS tags. This leads to -RRB- and ( tags existing at the same time (same for -LRB-).

I assume that the the PTB escaped versions are the desired ones.

-LRB-   _   (   (
-LRB-   _   -LRB-   -LRB-
[   _   (   (
(   _   (   (

normalized to ( _ ( -LRB-

-RRB-   _   )   )
-RRB-   _   -RRB-   -RRB-
]   _   (   (
]   _   )   )
)   _   )   )

normalized to ) _ ) -RRB-

Btw. are you sure that round brackets and square brackets should get the same tags?

amir-zeldes · 2016-08-17T21:44:11Z

This is not exactly an error, though I understand why this seems unexpected, and the lack of square bracket tags is something to consider changing completely - thanks for pointing both out.

The GUM corpus manual tagging uses the same extended PTB tag set produced by the freely available TreeTagger model for English (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). This model actually uses the bracket character itself as a tag, and only the round variant. It normally tags square brackets as SYM, but because in GUM square brackets are used in very similar ways to round brackets, we decided to tag them the same.

The second POS column which uses PTB style -RRB- tags etc. is actually derived automatically from the manual TreeTagger-style tags. The conversion turns the tag ( to -RRB- automatically, and doesn't check whether the token is a square bracket, though this would be easy to do.

I think it's worth considering distinguishing the bracket types in the PTB style column, but for the TreeTagger column I consider the literal ( to be the correct tag, so I'm not merging this change for now. I will definitely revisit the conversion to 'vanilla' PTB though, so thanks again for raising this issue.

reckart · 2016-08-18T06:07:10Z

@amir-zeldes I don't understand you response entirely and I see that I should probably have opened multiple pull requests.

So there are multiple issues:

square brackets annotated using round-bracket tags: I understand you say you consider fixing that
- [ _ ( (
- ] _ ) )
round brackets in text represented as -RRB-/-LRB- in text: these seem to me to be wrong in the text. I looked at the locations in the text and these didn't seem to me to be actual proper instances of these strings, but rather substitutions for round brackets. E.g. 92112f2#diff-3184c3716cf2b0d7ea01a17c7a12b83fL936
- -LRB- _ ( (
- -RRB- _ ) )
round brackets in text represented as -RRB-/-LRB- in text 2: my understanding of your reply is that in the tags only the plain bracket forms should be used (cf. TreeTagger). But then the PTB forms must be replaced. E.g. aea1d48#diff-e92674eb2759d50fd1d838dda9ae7f37L63
- -LRB- _ -LRB- -LRB-
- -RRB- _ -RRB- -RRB-
brackets going in the wrong direction: I don' think you addressed these in your reply:
- ] _ ( (

You say:

The second POS column which uses PTB style -RRB- tags etc. is actually derived automatically from the manual TreeTagger-style tags. The conversion turns the tag ( to -RRB- automatically, and doesn't check whether the token is a square bracket, though this would be easy to do.

So my understanding is that valid line for brackets are ( _ ( -LRB- and ) _ ) -RRB- which is exactly to what I have normalized the invalid annotations.

So why is this PR invalid?

amir-zeldes · 2016-08-18T08:54:51Z

I see, you're absolutely right, sorry for not reading more carefully before. I just saw the first example and assumed they were all like the first issue. The remaining ones are probably different types of conversion errors/missing steps/human intervention errors. I'll merge these in a two step process, since they need to be propagated into the different formats, at least partially.

I'm also seeing now that some of the conll10 files don't actually have the vanilla tags in the second column, they just repeat the TT style tags. I will leave this for now since they will be populated on the next merge. I'm working on a Pepper-based automatic buildbot with some built in validation which should hopefully reduce errors and manual intervention in propagating changes (basically I want any correction in any source format to be updated in the other formats, which will allow for faster and easier corrections).

Thanks again for the corrections!

* Minor corrections * Tags for brackets made more consistent: use literal brackets in TT tags, LRB style in PTB tags (see #4 for details) * Minor token consistency corrections (some formats were using incorrect ASCII equivalents for Unicode quotation marks and other punctuation) * Adjusted sentence border in syntax files for GUM_interview_gaming * Missing sentence type added in GUM_news_imprisoned * Coref conversion errors fixed in GUM_whow_chicken * Numerous minor dependency corrections

* Fixes #4

reckart added 4 commits August 12, 2016 22:25

Bulk-fix bracket tags in dep format.

03cf5bf

Fixing another bracket.

d3aafd1

More bracket fixes.

aea1d48

More bracket fixes.

92112f2

amir-zeldes closed this Aug 17, 2016

amir-zeldes reopened this Aug 18, 2016

amir-zeldes merged commit ae26261 into amir-zeldes:master Aug 18, 2016

reckart mentioned this pull request Aug 19, 2016

Some brackets got "unfixed" again #5

Closed

amir-zeldes added a commit that referenced this pull request Jan 21, 2020

More flat fixes

e81628b

* Fixes #4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk-fix bracket tags in dep format. #4

Bulk-fix bracket tags in dep format. #4

reckart commented Aug 12, 2016 •

edited

Loading

amir-zeldes commented Aug 17, 2016

reckart commented Aug 18, 2016 •

edited

Loading

amir-zeldes commented Aug 18, 2016

Bulk-fix bracket tags in dep format. #4

Bulk-fix bracket tags in dep format. #4

Conversation

reckart commented Aug 12, 2016 • edited Loading

amir-zeldes commented Aug 17, 2016

reckart commented Aug 18, 2016 • edited Loading

amir-zeldes commented Aug 18, 2016

reckart commented Aug 12, 2016 •

edited

Loading

reckart commented Aug 18, 2016 •

edited

Loading