Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Question/Feature Request: Reducing spaCy package size #2851
Hi, I was recently investigating the causes of our large docker images and I noticed that the spacy installation (v2.0.12) is taking approximately 346mb in the site-packages directory.
The vast majority of this large folder size originates from the language packages. As I only use the English language, my plan of attack was to fork spaCy and remove all languages other than
My feature request is
What would be required to implement functionality to split out other languages so they are either optionally included in an installed wheel or optionally installed? Is this something that other people would find beneficial?
My question is
Is this a safe or the best approach?
Thanks for bringing this up and yes, I definitely agree!
I think the biggest bloat at the moment comes from the lemmatization lookup tables and other similar resources. The language data itself should be pretty light and only really include smaller dictionaries of rules for tokenization, norms, lemmatization and so on. Going forward, we'd love to transition the lemmatizers to rule-based solutions that rely on the tagger or to entirely statistical components (shipped via the models). These would also perform much better, so it's a win-win overall.
Developing the components isn't trivial, but we do have someone working on lemmatization for spaCy now. You can find more details in #2668. We've also been getting awesome community contributions (most recently for Greek and French).
Yes, from spaCy's perspective, this should be okay – the language data is lazy-loaded, so