Tokenization speed too slow for French #3242

svlandeg · 2019-02-07T13:26:45Z

Feature description

The evaluation on the UD corpora has shown that French tokenization is significantly slower (in words per second) than other languages. I'm opening this issue to experiment with a few ideas and discuss potential solutions.

Historical trail

Benchmarking the tokenizer has some flunctuations, but across runs this seems to be the general performance for the French blank model:

early Jan 2019: 5-6K
30 Jan 2019 (from regex to re): 8-9K
early Feb 2019 (after PR Replacing regex library with re to increase tokenization speed #3218): 27-32K

svlandeg · 2019-02-07T13:31:19Z

I tried simplifying the ALPHA character class, which brings a lot of complexities because of Latin letters that are included but irrelevant for French, e.g. https://en.wikipedia.org/wiki/Latin_Extended-D.
I assumed that the large list of tokenization exceptions using the complicated ALPHA class would run faster when limiting to a much smaller set of Latin characters relevant to French only.

Unfortunately, this didn't seem to have much effect on the tokenization speed.

svlandeg · 2019-02-07T13:34:52Z

When I remove the 16K exceptions list entirely though, ofcourse some unit tests fail, but the UD evaluation accuracy remains almost the same on fr_gsd-ud-train.txt:

Precision drops from 99% to 98%
F-score stays at 99%

However, WPS rises to 73-74K!

Ofcourse, the UD corpora don't test every case and some of the exceptions from the original list could be really useful in some domains. But is it worth slowing down the tokenizer so much?

svlandeg · 2019-02-07T15:07:48Z

If we do remove the exceptions, one potential direction is to also remove the hyphens as infix for French. This makes a lot of the current French exceptions redundant (most of the list + additional regular expressions), but will ofcourse result in some undersegmentation instead of oversegmentation.

orig state: 0.1% under, 0.4% over, 99% precision, 99% F
without exception list: 0.09% under, 0.7% over, 98% precision, 99% F
without hyphen as infix: 0.2% under, 0.2% over, 99% precision, 99% F

This will bring WPS at around 90K.

I know the general trend is to move toward oversegmentation (because the models can learn to merge), but I'd be interested in hearing your opinion (@ines @honnibal) on how to move forward for French! I can prepare the PR accordingly.

aborsu · 2019-02-08T21:34:25Z

We actually use a custom french tokenizer (we just drop the exceptions) for all our in house models because of this.

svlandeg · 2019-02-09T00:19:23Z

Makes sense @aborsu. So do you then remove the hyphen as infix, too? Or do you have another solution to deal with names like Sophie-Anne ? Or is that not an issue on the data you're working with?

aborsu · 2019-02-11T09:10:03Z

@svlandeg we keep it as infix, but in our context it doesn't really matter. We mostly train models for entity detection, so the entity detector will recombine the tokens into a single entity. The fact that it is one or more token has no importance to us.
Of course this is specific for our use case, I wouldn't presume to make it standard behavior.

honnibal · 2019-02-17T11:56:06Z

@svlandeg Very interesting analysis!

I would lean towards removing the exceptions, and keeping the hyphen as an infix. Then we can hope that the model learns to rejoin the names etc in the parser. I'm happy to give this a try if you want to make the PR?

svlandeg · 2019-02-18T08:38:20Z

Ok, will prepare the PR !

lock · 2019-03-23T12:21:32Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements lang / fr French language data and models feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Feb 7, 2019

svlandeg mentioned this issue Feb 19, 2019

Clean up of char classes, few tokenizer fixes and faster default French tokenizer #3293

Merged

3 tasks

ines closed this as completed Feb 21, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization speed too slow for French #3242

Tokenization speed too slow for French #3242

svlandeg commented Feb 7, 2019

svlandeg commented Feb 7, 2019

svlandeg commented Feb 7, 2019 •

edited

Loading

svlandeg commented Feb 7, 2019

aborsu commented Feb 8, 2019

svlandeg commented Feb 9, 2019

aborsu commented Feb 11, 2019

honnibal commented Feb 17, 2019

svlandeg commented Feb 18, 2019

lock bot commented Mar 23, 2019

Tokenization speed too slow for French #3242

Tokenization speed too slow for French #3242

Comments

svlandeg commented Feb 7, 2019

Feature description

Historical trail

svlandeg commented Feb 7, 2019

svlandeg commented Feb 7, 2019 • edited Loading

svlandeg commented Feb 7, 2019

aborsu commented Feb 8, 2019

svlandeg commented Feb 9, 2019

aborsu commented Feb 11, 2019

honnibal commented Feb 17, 2019

svlandeg commented Feb 18, 2019

lock bot commented Mar 23, 2019

svlandeg commented Feb 7, 2019 •

edited

Loading