Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French regular expressions instead of extensive exceptions list (on develop) #3046

Merged
merged 2 commits into from Dec 16, 2018

Conversation

Projects
None yet
2 participants
@svlandeg
Copy link
Contributor

commented Dec 12, 2018

Description

This PR refers to PR #3023 but with the base branch changed to develop.

It addresses issue #2679 and also includes the fix of PR #2980

  • Looked into the French model and analysed where most time was lost
  • Rewrote the compilation of _infixes_exc to avoid compilation of redundant patterns (from 270K to 100K)
  • Added new regular expressions to catch common tokens
  • Removed all redundant cases from the exception list, cutting it down from 34K lines to 16K
  • Added a few additional unit tests

Results

  • Loading the fr_core_news_sm model on my system goes down on average from 12-13s to 8-9s
    • Compared to 1s for en_core_web_sm
  • Loading French() (incl import) on my system goes down on average from 5.0-5.2s to 3.1-3.3s
    • Compared to 0.09-0.1s for English()

Types of change

Speed enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.
@svlandeg

This comment has been minimized.

Copy link
Contributor Author

commented Dec 12, 2018

I tried running some scripts I used before to profile the timing issue and to look into the comments by @mhham, but on develop I run into an issue with the code
nlp = spacy.load('fr_core_news_sm')

With this error:

Traceback (most recent call last):
  File "C:/Users/Sofie/Documents/git/spacy/sofie_test/test_ner.py", line 31, in <module>
    nlp = spacy.load('fr_core_news_sm')
  File "C:\Users\Sofie\Documents\git\spacy\spacy\__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\util.py", line 105, in load_model
    return load_model_from_link(name, **overrides)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\util.py", line 122, in load_model_from_link
    return cls.load(**overrides)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\data\fr_core_news_sm\__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\util.py", line 166, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\util.py", line 149, in load_model_from_path
    return nlp.from_disk(model_path)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\language.py", line 693, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\util.py", line 555, in from_disk
    reader(path / key)
  File "C:\Users\Sofie\Documents\git\spacy\spacy\language.py", line 689, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 809, in spacy.pipeline.Tagger.from_disk
  File "C:\Users\Sofie\Documents\git\spacy\spacy\util.py", line 555, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 793, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 794, in spacy.pipeline.Tagger.from_disk.load_model
  File "C:\Users\Sofie\Anaconda3\envs\spacy\lib\site-packages\thinc\neural\_classes\model.py", line 372, in from_bytes
    copy_array(dest, param[b"value"])
  File "C:\Users\Sofie\Anaconda3\envs\spacy\lib\site-packages\thinc\neural\util.py", line 124, in copy_array
    dst[:] = src
ValueError: could not broadcast input array from shape (128) into shape (96)

I updated thinc to thinc-7.0.0.dev6 as according to the requirements, but there seems to be some incompatibility between the old models and the current code on develop? How should I remedy this?

@svlandeg svlandeg changed the title Feature/faster french develop French regular expressions instead of extensive exceptions list (on develop) Dec 12, 2018

@ines

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

Thanks so much! 🙏 (Can I close #3023 then?)

ValueError: could not broadcast input array from shape (128) into shape (96)

Sorry about that – we're just training new models for develop so that should hopefully fix it!

@svlandeg

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

Can I close #3023 then?

Yep!

Sorry about that – we're just training new models for develop so that should hopefully fix it!

No worries - I'll continue with this PR to see if there are further improvements possible as soon as the code runs again :)

@ines

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

Okay, cool! 👍

One quick thought: if we assume that there's nothing wrong with the serialization / model loading itself, it might also make sense to just profile the loading of English vs French? In theory, we should see a similar discrepancy in loading times, right?

from spacy.lang.en import English
nlp = English()
from spacy.lang.fr import French
nlp = French()
@svlandeg

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

@ines : good point. Benchmarking on French() the loading time goes down on average from 5.0-5.2s (develop branch) to 3.1-3.3s (this PR) - I added that to the description above.

@mhham : I looked into detail at the suggestions you made over at PR #3023 (thanks btw!). The list comprehensions do not really make a difference because the in-code exception lists are very small (max a few dozen elements). So I kept them as they were. Ofcourse if the writing style of the comprehensions is prefered we can make that edit.

I also tried your proposed rewritings of the for loops but those don't seem to improve performance. The key idea behind the rewrite that I propose here, is that for each word in the list FR_BASE_EXCEPTIONS, I make a tiny set of variations (variants_infix) that are then added to the bigger list. This ensures that at the end, or in between, we never have to do set operations on the 16K list, which is what would otherwise slow it down. The current version of these loops only takes 0.06s and will only be faster, I think, if we'd cut the 16K list further down (it was 34K before this PR).

Most time in the tokenizer_exceptions.py file is now lost when compiling the TOKEN_MATCH regular expressions - about 1.1s. This is slightly slower than before because a lot of the exceptions from the list were translated in a handful of regexps. So it's a trade-off, but all in all overall speed still goes up.

So for now I'm out of ideas to make this faster :)

@ines

This comment has been minimized.

Copy link
Member

commented Dec 16, 2018

good point. Benchmarking on French() the loading time goes down on average from 5.0-5.2s (develop branch) to 3.1-3.3s (this PR) - I added that to the description above.

Nice! That's like 30% faster, which is really good 👍

In general, I do think we have to accept that some data will just take slightly longer to load. Some languages are just easier to tokenize and lemmatize than others... So all we can really do here is make sure we write efficient code, avoid redundancy and use smart rules. (Or we have to make trade-offs and accept a slightly lower tokenization accuracy on certain edge cases in favour of loading speed – but it's usually never that simple.)

I'll merge this PR so we can start testing it. We've already started training the new models for the next nightly, so we'll be able to see this change in action in the next round. Thanks again for the great work on this btw 🎉

@ines ines merged commit c6ad557 into explosion:develop Dec 16, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@svlandeg svlandeg deleted the svlandeg:feature/faster-french-develop branch Dec 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.