Tokenization and lemmatization behavior of French hyphened compound words #6

tkurosaka · 2023-02-14T23:17:04Z

I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns

But none of these French native compound words don't get lemmatized properly either.

compound word in plural form	expected lemma	actual lemma
portes-fenêtres	porte-fenêtre	portes-fenêtre
grands-mères	grand-mère	grands-mère
chefs-d'oeuvre	chef-d'œuvre	chefs-d'œuvre

As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.

I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".

abzif · 2023-02-20T08:59:53Z

The behavior of lemmatizer is as expected. Lemmatizer modifies suffix or prefix of a word. It is not able to modify word somewhere in the middle. Compound words (popular in germanic languages) also are lemmatized only by suffix.

However the problem may be related to tokenization of words containing - or '
If 'portes-fenêtre' would be tokenized to 'portes' '-' 'fenêtre' (splitted at hyphen) then individual words should be pos-tagged and lemmatized.
In the issue related to tokenization of 't-shirt' I made a fix forcing words to be splitted at hyphen. This is global change, not only for english language but for all languages. Hopefully it will fix also problems above.
I suspect that words containing ' (apostroph character) may also be inconsistently annotated so they are sometimes splitted, sometimes not. It may be next candidate to fix.

tkurosaka · 2023-03-01T01:16:52Z

The new end-of-February model fixes all of these! Especially amazing is that "des chefs-d'oeuvre" is now lemmatized as "un", "chef", "-", "de", "œuvre". Note, this took care of handling of "d'" automagically!

abzif · 2023-03-01T13:28:24Z

I'm glad that small change, corrected the problem for many languages. Thank you for checking the models.

tkurosaka closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization and lemmatization behavior of French hyphened compound words #6

Tokenization and lemmatization behavior of French hyphened compound words #6

tkurosaka commented Feb 14, 2023

abzif commented Feb 20, 2023

tkurosaka commented Mar 1, 2023

abzif commented Mar 1, 2023

Tokenization and lemmatization behavior of French hyphened compound words #6

Tokenization and lemmatization behavior of French hyphened compound words #6

Comments

tkurosaka commented Feb 14, 2023

abzif commented Feb 20, 2023

tkurosaka commented Mar 1, 2023

abzif commented Mar 1, 2023