Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question/Feature Request: Reducing spaCy package size #2851

Closed
DomHudson opened this issue Oct 15, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@DomHudson
Copy link

commented Oct 15, 2018

Summary

Hi, I was recently investigating the causes of our large docker images and I noticed that the spacy installation (v2.0.12) is taking approximately 346mb in the site-packages directory.

The vast majority of this large folder size originates from the language packages. As I only use the English language, my plan of attack was to fork spaCy and remove all languages other than en. After removing these languages, the folder size goes down to 37.5mb. This approach seems to work fine although I am wary about doing this.

My feature request is

What would be required to implement functionality to split out other languages so they are either optionally included in an installed wheel or optionally installed? Is this something that other people would find beneficial?

My question is

Is this a safe or the best approach?

Many thanks
Dom


screenshot from 2018-10-15 18-28-45

@ines ines added the enhancement label Oct 15, 2018

@ines

This comment has been minimized.

Copy link
Member

commented Oct 15, 2018

Thanks for bringing this up and yes, I definitely agree!

I think the biggest bloat at the moment comes from the lemmatization lookup tables and other similar resources. The language data itself should be pretty light and only really include smaller dictionaries of rules for tokenization, norms, lemmatization and so on. Going forward, we'd love to transition the lemmatizers to rule-based solutions that rely on the tagger or to entirely statistical components (shipped via the models). These would also perform much better, so it's a win-win overall.

Developing the components isn't trivial, but we do have someone working on lemmatization for spaCy now. You can find more details in #2668. We've also been getting awesome community contributions (most recently for Greek and French).

@DomHudson

This comment has been minimized.

Copy link
Author

commented Oct 15, 2018

@ines Thanks very much for your thoughts and quick response! Good to hear that it's on your radar. Do you think my intermittent approach of simply removing those folders for now is okay?

Many thanks
Dom

@ines

This comment has been minimized.

Copy link
Member

commented Oct 15, 2018

Do you think my intermittent approach of simply removing those folders for now is okay?

Yes, from spaCy's perspective, this should be okay – the language data is lazy-loaded, so spacy.tr for instance is only required if you load a model that specifies it, if you run util.get_lang_class('tr'), or if you actually import it from spacy.lang. (And if you run the tests, I guess.)

@ines

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

Merging this with #3258!

@lock

This comment has been minimized.

Copy link

commented Apr 9, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 9, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.