You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns
But none of these French native compound words don't get lemmatized properly either.
compound word in plural form
expected lemma
actual lemma
portes-fenêtres
porte-fenêtre
portes-fenêtre
grands-mères
grand-mère
grands-mère
chefs-d'oeuvre
chef-d'œuvre
chefs-d'œuvre
As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.
I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".
The text was updated successfully, but these errors were encountered:
The behavior of lemmatizer is as expected. Lemmatizer modifies suffix or prefix of a word. It is not able to modify word somewhere in the middle. Compound words (popular in germanic languages) also are lemmatized only by suffix.
However the problem may be related to tokenization of words containing - or '
If 'portes-fenêtre' would be tokenized to 'portes' '-' 'fenêtre' (splitted at hyphen) then individual words should be pos-tagged and lemmatized.
In the issue related to tokenization of 't-shirt' I made a fix forcing words to be splitted at hyphen. This is global change, not only for english language but for all languages. Hopefully it will fix also problems above.
I suspect that words containing ' (apostroph character) may also be inconsistently annotated so they are sometimes splitted, sometimes not. It may be next candidate to fix.
The new end-of-February model fixes all of these! Especially amazing is that "des chefs-d'oeuvre" is now lemmatized as "un", "chef", "-", "de", "œuvre". Note, this took care of handling of "d'" automagically!
I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns
But none of these French native compound words don't get lemmatized properly either.
As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.
I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".
The text was updated successfully, but these errors were encountered: