Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization speed too slow for French #3242

Closed
svlandeg opened this issue Feb 7, 2019 · 9 comments
Closed

Tokenization speed too slow for French #3242

svlandeg opened this issue Feb 7, 2019 · 9 comments
Labels
enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / fr French language data and models perf / speed Performance: speed

Comments

@svlandeg
Copy link
Member

svlandeg commented Feb 7, 2019

Feature description

The evaluation on the UD corpora has shown that French tokenization is significantly slower (in words per second) than other languages. I'm opening this issue to experiment with a few ideas and discuss potential solutions.

Historical trail

Benchmarking the tokenizer has some flunctuations, but across runs this seems to be the general performance for the French blank model:

image

@svlandeg
Copy link
Member Author

svlandeg commented Feb 7, 2019

I tried simplifying the ALPHA character class, which brings a lot of complexities because of Latin letters that are included but irrelevant for French, e.g. https://en.wikipedia.org/wiki/Latin_Extended-D.
I assumed that the large list of tokenization exceptions using the complicated ALPHA class would run faster when limiting to a much smaller set of Latin characters relevant to French only.

Unfortunately, this didn't seem to have much effect on the tokenization speed.

@svlandeg
Copy link
Member Author

svlandeg commented Feb 7, 2019

When I remove the 16K exceptions list entirely though, ofcourse some unit tests fail, but the UD evaluation accuracy remains almost the same on fr_gsd-ud-train.txt:

  • Precision drops from 99% to 98%
  • F-score stays at 99%

However, WPS rises to 73-74K!

Ofcourse, the UD corpora don't test every case and some of the exceptions from the original list could be really useful in some domains. But is it worth slowing down the tokenizer so much?

@svlandeg
Copy link
Member Author

svlandeg commented Feb 7, 2019

If we do remove the exceptions, one potential direction is to also remove the hyphens as infix for French. This makes a lot of the current French exceptions redundant (most of the list + additional regular expressions), but will ofcourse result in some undersegmentation instead of oversegmentation.

  • orig state: 0.1% under, 0.4% over, 99% precision, 99% F
  • without exception list: 0.09% under, 0.7% over, 98% precision, 99% F
  • without hyphen as infix: 0.2% under, 0.2% over, 99% precision, 99% F

This will bring WPS at around 90K.

I know the general trend is to move toward oversegmentation (because the models can learn to merge), but I'd be interested in hearing your opinion (@ines @honnibal) on how to move forward for French! I can prepare the PR accordingly.

@ines ines added enhancement Feature requests and improvements lang / fr French language data and models feat / tokenizer Feature: Tokenizer perf / speed Performance: speed labels Feb 7, 2019
@aborsu
Copy link

aborsu commented Feb 8, 2019

We actually use a custom french tokenizer (we just drop the exceptions) for all our in house models because of this.

@svlandeg
Copy link
Member Author

svlandeg commented Feb 9, 2019

Makes sense @aborsu. So do you then remove the hyphen as infix, too? Or do you have another solution to deal with names like Sophie-Anne ? Or is that not an issue on the data you're working with?

@aborsu
Copy link

aborsu commented Feb 11, 2019

@svlandeg we keep it as infix, but in our context it doesn't really matter. We mostly train models for entity detection, so the entity detector will recombine the tokens into a single entity. The fact that it is one or more token has no importance to us.
Of course this is specific for our use case, I wouldn't presume to make it standard behavior.

@honnibal
Copy link
Member

@svlandeg Very interesting analysis!

I would lean towards removing the exceptions, and keeping the hyphen as an infix. Then we can hope that the model learns to rejoin the names etc in the parser. I'm happy to give this a try if you want to make the PR?

@svlandeg
Copy link
Member Author

Ok, will prepare the PR !

@lock
Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer lang / fr French language data and models perf / speed Performance: speed
Projects
None yet
Development

No branches or pull requests

4 participants