Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization and lemmatization behavior of French hyphened compound words #6

Closed
tkurosaka opened this issue Feb 14, 2023 · 3 comments
Closed

Comments

@tkurosaka
Copy link

I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns

But none of these French native compound words don't get lemmatized properly either.

compound word in plural form expected lemma actual lemma
portes-fenêtres porte-fenêtre portes-fenêtre
grands-mères grand-mère grands-mère
chefs-d'oeuvre chef-d'œuvre chefs-d'œuvre

As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.

I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".

@abzif
Copy link
Owner

abzif commented Feb 20, 2023

The behavior of lemmatizer is as expected. Lemmatizer modifies suffix or prefix of a word. It is not able to modify word somewhere in the middle. Compound words (popular in germanic languages) also are lemmatized only by suffix.

However the problem may be related to tokenization of words containing - or '
If 'portes-fenêtre' would be tokenized to 'portes' '-' 'fenêtre' (splitted at hyphen) then individual words should be pos-tagged and lemmatized.
In the issue related to tokenization of 't-shirt' I made a fix forcing words to be splitted at hyphen. This is global change, not only for english language but for all languages. Hopefully it will fix also problems above.
I suspect that words containing ' (apostroph character) may also be inconsistently annotated so they are sometimes splitted, sometimes not. It may be next candidate to fix.

@tkurosaka
Copy link
Author

The new end-of-February model fixes all of these! Especially amazing is that "des chefs-d'oeuvre" is now lemmatized as "un", "chef", "-", "de", "œuvre". Note, this took care of handling of "d'" automagically!

@abzif
Copy link
Owner

abzif commented Mar 1, 2023

I'm glad that small change, corrected the problem for many languages. Thank you for checking the models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants