-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization speed too slow for French #3242
Comments
I tried simplifying the Unfortunately, this didn't seem to have much effect on the tokenization speed. |
When I remove the 16K exceptions list entirely though, ofcourse some unit tests fail, but the UD evaluation accuracy remains almost the same on
However, WPS rises to 73-74K! Ofcourse, the UD corpora don't test every case and some of the exceptions from the original list could be really useful in some domains. But is it worth slowing down the tokenizer so much? |
If we do remove the exceptions, one potential direction is to also remove the hyphens as infix for French. This makes a lot of the current French exceptions redundant (most of the list + additional regular expressions), but will ofcourse result in some undersegmentation instead of oversegmentation.
This will bring WPS at around 90K. I know the general trend is to move toward oversegmentation (because the models can learn to merge), but I'd be interested in hearing your opinion (@ines @honnibal) on how to move forward for French! I can prepare the PR accordingly. |
We actually use a custom french tokenizer (we just drop the exceptions) for all our in house models because of this. |
Makes sense @aborsu. So do you then remove the hyphen as infix, too? Or do you have another solution to deal with names like |
@svlandeg we keep it as infix, but in our context it doesn't really matter. We mostly train models for entity detection, so the entity detector will recombine the tokens into a single entity. The fact that it is one or more token has no importance to us. |
@svlandeg Very interesting analysis! I would lean towards removing the exceptions, and keeping the hyphen as an infix. Then we can hope that the model learns to rejoin the names etc in the parser. I'm happy to give this a try if you want to make the PR? |
Ok, will prepare the PR ! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Feature description
The evaluation on the UD corpora has shown that French tokenization is significantly slower (in words per second) than other languages. I'm opening this issue to experiment with a few ideas and discuss potential solutions.
Historical trail
Benchmarking the tokenizer has some flunctuations, but across runs this seems to be the general performance for the French blank model:
The text was updated successfully, but these errors were encountered: