Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mistake on annotation of ranges #2

Open
vcvpaiva opened this issue Oct 10, 2020 · 8 comments
Open

mistake on annotation of ranges #2

vcvpaiva opened this issue Oct 10, 2020 · 8 comments

Comments

@vcvpaiva
Copy link

in sentences like:

  1. That 3% rate also applies to Nectar cardholders looking to borrow from £15,001-£19,999 over a period of between two and three years.

  2. Meanwhile, Bank of Scotland customers earn 3% on balances of £3,000-£5,000 when they add the free Vantage option to their account.

the hyphen should be considered punctuation, not symbol, I believe. At least this is what happens in a sentence like

  1. Theoretically, a couple could open four Tesco accounts and earn 3% on £12,000 – £360.

yes, these are different kinds of "hyphen", but do you really want to make this distinction em terms of POS?

@dan-zeman
Copy link
Member

The last one is also a different type of construction. It is not a range. It is an apposition that explains how much 3% from 12,000 is.

On the other hand, in ranges the hyphen can be read aloud as "to", which is the distinction between symbols and punctuation (although for me, hyphen feels borderline).

@vcvpaiva
Copy link
Author

yes, I realized it's a different type of construction, this is why I say "do you really want to make this distinction em terms of POS?" seems to me very much against the "easy to annotate" principle.

@amir-zeldes
Copy link

In both English-EWT and English-GUM, these hyphens in number ranges which are read as 'to' are analyzed as case, so that the second number can be e.g. nmod. In EWT they are tagged SYM and in GUM as ADP, though we should probably choose one and unify:

EWT: http://match.grew.fr/?corpus=UD_English-EWT@2.6&custom=5f831a41eabfd
GUM: http://match.grew.fr/?corpus=UD_English-GUM@2.6&custom=5f831b4c18290

I would find it very odd for a word with a proper grammatical function such as case to have upos=PUNCT (and xpos is already not the equivalent tag :, so assuming we don't want to change English xpos guidelines, it would also create a mapping discrepancy if we used upos=PUNCT). Also note that, at least for GUM, the corpus passes through udapi's fix-punct block, meaning that if we tagged it as PUNCT, it could get re-attached in all sorts of bad ways.

@nschneid
Copy link
Contributor

See discussion at UniversalDependencies/docs/issues/649

@amir-zeldes
Copy link

Right, no surprise we've already talked about this :)

So what's the verdict - which corpus should we change? Opinions @nschneid @sebschu / others? I don't feel strongly about it, except that it shouldn't be PUNCT.

@vcvpaiva
Copy link
Author

Cool that you have already discussedthe issue (and sorry for not having read it beforehand). However, the issue persists. do you want the different kinds of hyphens to have different POS? it seems perverse. do you want all of them to become the preposition "to"-- it seems wrong. (to me, at any rate). as usual the question is which is the least evil?

@nschneid
Copy link
Contributor

  • Ranges: The definition of SYM seems to fit for ranges:

    Many symbols are or contain special non-alphanumeric characters, similarly to punctuation. What makes them different from punctuation is that they can be substituted by normal words. This involves all currency symbols, e.g. $ 75 is identical to seventy-five dollars.

    I feel slightly uncomfortable with ADP because the hyphen/dash, though it can be read as to, does not feel like simple shorthand for to in the way that "2" is in social media. It feels similar to a mathematical operator with a highly conventionalized meaning in notation for ranges.

  • Parenthetical: The dash that acts like a colon or sets off a parenthetical should be PUNCT, I think (in some contexts it could be explicated with "namely" but I don't think it's a conventional way of writing "namely": it mainly clarifies prosody/phrasing).

@amir-zeldes
Copy link

OK, GUM source repo now has SYM/SYM and I unified the lemma of en-dash and hyphen to be hyphen, matching EWT:

amir-zeldes/gum@d4fd9d2

Parenthetical dashes are (and were already) :/PUNCT, so no problems there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants