Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect lemma capitalization #37

Open
rhdunn opened this issue Nov 28, 2023 · 7 comments
Open

Incorrect lemma capitalization #37

rhdunn opened this issue Nov 28, 2023 · 7 comments

Comments

@rhdunn
Copy link

rhdunn commented Nov 28, 2023

NNP should have capitalized lemma, the others lowercase:

ERROR: Sentence w01135037 token 8 -- CD/NumForm=Word lemma 'Five' does not match lowercase-form applied to form 'Five', expected 'five'
ERROR: Sentence n05001005 token 27 -- IN lemma 'Under' does not match lowercase-form applied to form 'Under', expected 'under'
ERROR: Sentence w01071043 token 3 -- NN lemma 'Post' does not match lowercase-form applied to form 'Post', expected 'post'
ERROR: Sentence w01100046 token 17 -- NN lemma 'Governor' does not match lowercase-form applied to form 'Governor', expected 'governor'
ERROR: Sentence w01100046 token 19 -- NN lemma 'General' does not match lowercase-form applied to form 'General', expected 'general'
ERROR: Sentence w01135037 token 10 -- NN lemma 'Year' does not match lowercase-form applied to form 'Year', expected 'year'
ERROR: Sentence n05001005 token 29 -- NN lemma 'Secretary' does not match lowercase-form applied to form 'Secretary', expected 'secretary'
ERROR: Sentence n02027007 token 7 -- NNP lemma 'service' does not match capitalized-form applied to form 'Service', expected 'Service'
ERROR: Sentence n02033113 token 5 -- NNP lemma 'ZEIT' does not match capitalized-form applied to form 'ZEIT', expected 'Zeit'
ERROR: Sentence n02066010 token 11 -- NNP lemma 'SPIEGEL' does not match capitalized-form applied to form 'SPIEGEL', expected 'Spiegel'
ERROR: Sentence n04006014 token 28 -- NNP lemma 'eurozone' does not match capitalized-form applied to form 'eurozone', expected 'Eurozone'
ERROR: Sentence n04008007 token 3 -- NNP lemma 'LUISS' does not match capitalized-form applied to form 'LUISS', expected 'Luiss'
@AngledLuffa
Copy link
Contributor

LUISS is an acronym, I believe

https://en.wikipedia.org/wiki/Libera_Universit%C3%A0_Internazionale_degli_Studi_Sociali_Guido_Carli

Should Secret Service both be labeled NNP?

# sent_id = n02027007
# text = According to Parker, Russian Secret Service agents are active in large numbers in Great Britain.
5       Russian Russian ADJ     JJ      Degree=Pos      8       amod    8:amod  _
6       Secret  secret  NOUN    NN      Number=Sing     7       compound        7:compound      _
7       Service Service PROPN   NNP     Number=Sing     8       compound        8:compound      _

not sure eurozone should be capitalized - perhaps it's not an NNP at this point? Although it is the eurozone

https://en.wikipedia.org/wiki/Eurozone

@AngledLuffa
Copy link
Contributor

still need to address up through Secretary

@nschneid
Copy link
Contributor

nschneid commented Dec 7, 2023

I'm not sure "Secret Service" is the real name of any Russian entity. https://en.wikipedia.org/wiki/List_of_intelligence_agencies#Russia So I think there could be justification for NOT using NNP here.

My intuition is that "Eurozone" is specific enough to count as a proper name, even if it is sometimes lowercased.

@AngledLuffa
Copy link
Contributor

I'm not sure "Secret Service" is the real name of any Russian entity.

Makes sense, but that can be confusing across countries where Secret Service is a real thing.

My intuition is that "Eurozone" is specific enough to count as a proper name, even if it is sometimes lowercased.

So leave it as NNP with a lowercase lemma?

@AngledLuffa
Copy link
Contributor

Regarding Under-Secretary, I suppose the tokenization is considered correct? However, Undersecretary can also turn up as a single word. Are we happy splitting that here?

@rhdunn
Copy link
Author

rhdunn commented Dec 7, 2023

Note that with a lot of these, this is my initial assessment based on a cursory scan/check. I may have some of those wrong, like with LUISS being an initiialism/acronym.

@rhdunn
Copy link
Author

rhdunn commented Dec 7, 2023

For the hyphenated forms see also UniversalDependencies/docs#1002.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants