Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccuracy related to capitalization #93

Open
juanjoDiaz opened this issue May 26, 2023 · 1 comment
Open

Inaccuracy related to capitalization #93

juanjoDiaz opened this issue May 26, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@juanjoDiaz
Copy link
Collaborator

HI @adbar ,

Using simplemma my team has found multiple odd cases related to capitalization.

When words are fully capitalized, lemmatization doesn't seem correct

>>> simplemma.text_lemmatizer("TRAINING INSTRUCTIONS", "en")
['train', 'INSTRUCTIONS']

I have no idea why TRAINING is in the dictionary but INSTRUCTIONS is not.
I don't think that TRAINING should be.
And we might want to adjust the dictionary lookup strategy to try full lowercasing the word.

I can do a PR since the change is easier.
But I have no idea of how you test if changes in the strategies improve or worsen the accuracy of simplemma.
It would be good to get that documented so anyone working on the library can run the tests.

Wdyt?

@adbar
Copy link
Owner

adbar commented May 26, 2023

Hi @juanjoDiaz, thanks for the feedback, that's odd indeed.

Words written in all caps currently remain untouched in case they are acronyms (e.g. BRICS). That being said it is safe to say that a token of len > x is most probably not an acronym and can be lower-cased if the language is in BETTER_LOWER. For English long acronyms are rare, we need to decide on a length limit, I'd say 6 ot 7: https://en.wiktionary.org/wiki/Category:English_acronyms

@adbar adbar added the enhancement New feature or request label Jun 1, 2023
@adbar adbar added this to the v1.0 milestone Jun 21, 2023
@adbar adbar removed this from the v1.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants