French regular expressions instead of extensive exceptions list (on develop) #3046
This PR refers to PR #3023 but with the base branch changed to develop.
Types of change
I tried running some scripts I used before to profile the timing issue and to look into the comments by @mhham, but on develop I run into an issue with the code
With this error:
I updated thinc to thinc-7.0.0.dev6 as according to the requirements, but there seems to be some incompatibility between the old models and the current code on develop? How should I remedy this?
No worries - I'll continue with this PR to see if there are further improvements possible as soon as the code runs again :)
One quick thought: if we assume that there's nothing wrong with the serialization / model loading itself, it might also make sense to just profile the loading of
from spacy.lang.en import English nlp = English()
from spacy.lang.fr import French nlp = French()
@ines : good point. Benchmarking on French() the loading time goes down on average from 5.0-5.2s (develop branch) to 3.1-3.3s (this PR) - I added that to the description above.
@mhham : I looked into detail at the suggestions you made over at PR #3023 (thanks btw!). The list comprehensions do not really make a difference because the in-code exception lists are very small (max a few dozen elements). So I kept them as they were. Ofcourse if the writing style of the comprehensions is prefered we can make that edit.
I also tried your proposed rewritings of the for loops but those don't seem to improve performance. The key idea behind the rewrite that I propose here, is that for each word in the list
Most time in the tokenizer_exceptions.py file is now lost when compiling the TOKEN_MATCH regular expressions - about 1.1s. This is slightly slower than before because a lot of the exceptions from the list were translated in a handful of regexps. So it's a trade-off, but all in all overall speed still goes up.
So for now I'm out of ideas to make this faster :)
Nice! That's like 30% faster, which is really good
In general, I do think we have to accept that some data will just take slightly longer to load. Some languages are just easier to tokenize and lemmatize than others... So all we can really do here is make sure we write efficient code, avoid redundancy and use smart rules. (Or we have to make trade-offs and accept a slightly lower tokenization accuracy on certain edge cases in favour of loading speed – but it's usually never that simple.)
I'll merge this PR so we can start testing it. We've already started training the new models for the next nightly, so we'll be able to see this change in action in the next round. Thanks again for the great work on this btw