v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more
⚠️This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
- NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
- NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
KnowledgeBaseAPI to train and access entity linking models, plus scripts to train your own Wikidata models.
- NEW: 10× faster
PhraseMatcherand improved phrase matching algorithm.
DocBinclass to efficiently serialize collections of
- NEW: Train text classification models on the command line with
spacy trainand get
textcatresults via the
debug-datacommand to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more.
- NEW: Efficient
Lookupsclass using Bloom filters that allows storing, accessing and serializing large dictionaries via
- Data augmentation in
spacy trainvia the
--orth-variant-levelflag, which defines the percentage of occurrences of some tokens subject to replacement during training.
nlp.pipe_labels(labels assigned by pipeline components) and include
spacy_displacy_colorsentry point to allow packages to add entity colors to
templateconfig option in
displacyto customize entity HTML template.
- Improve match pattern validation and handling of unsupported attributes.
- Add lookup lemmatization data for Croatian and Serbian.
- Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.
🔴 Bug fixes
- Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
- Fix issue #3540: Update lemma and vector information after splitting a token.
- Fix issue #3687: Automatically skip duplicates in
- Fix issue #3830: Retrain German model and fix
- Fix issue #3850: Allow customizing entity HTML template in displaCy.
- Fix issue #3879, #3951, #4154: Fix bug in
Matcherretry loop that'd cause problems with
- Fix issue #3917: Raise error for negative token indices in
- Fix issue #3922: Add
- Fix issue #3959, #4133: Make sure both
tagare correctly serialized.
- Fix issue #3972: Ensure
PhraseMatcherreturns multiple matches for identical rules.
- Fix issue #4020: Raise error for overlapping entities in
- Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
- Fix issue #4070: Improve token pattern checking without validation.
- Fix issue #4096: Add checks for cycles in
- Fix issue #4100: Improve docs on phrase pattern attributes.
- Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
- Fix issue #4104: Make visualized NER examples in docs more clear.
- Fix issue #4107: Automatically set span root attributes on merging.
- Fix issue #4111, #4170: Improve NER/IOB converters.
- Fix issue #4120: Correctly handle
?operator at the end of pattern.
- Fix issue #4123: Provide more details in cycle error message
- Fix issue #4138: Correctly open
.htmlfiles as UTF-8 in
- Fix issue #4139: Make emoticon data a raw string.
- Fix issue #4148: Add missing API docs for
- Fix issue #4155: Correct language code for Serbian.
- Fix issue #4165: Add more attributes to matcher validation schema.
- Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
- Fix issue #4200: Work around
tqdmbug that'd remove text color from terminal output.
- Fix issue #4229: Fix handling of pre-set entities.
- Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
- Fix issue #4242: Make
.tagdistinction more clear in the docs.
- Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
- Fix issue #4262: Fix handling of spaces in Japanese.
- Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
- Fix issue #4270: Fix
- Fix issue #4302: Remove duplicate
- Fix issue #4303: Correctly support
- Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
- Fix issue #4308: Fix bug that could cause
PhraseMatcherwith very large lists to miss matches.
- Fix issue #4348: Ensure training doesn't crash with empty batches.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions.
- The lemmatization tables have been moved to their own package,
spacy-lookups-data, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g.
spacy.blank("en")), you'll need to explicitly install spaCy plus data via
pip install spacy[lookups]. The data will be registered automatically via entry points.
- Lemmatization tables (rules, exceptions, index and lookups) are now part of the
Vocaband serialized with it. This means that serialized objects (
nlp, pipeline components, vocab) will now include additional data, and models written to disk will include additional files.
Lemmatizerclass is now initialized with an instance of
Lookupscontaining the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a custom
Lemmatizer, you'll need to update your code.
- If you've been training your own models, you'll need to retrain them with the new version.
- The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
spacy downloadcommand does not set the
--no-depspip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source,
--no-depsis added back automatically to prevent spaCy from being downloaded and installed again from pip.
- The built-in
biluo_tags_from_offsetsconverter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the new
debug-datacommand to find problems in your data.
- Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an
ent_iobvalue set, it won't be reset to an "unset" state and will always have at least
list(doc.ents)now actually keeps the annotations on the token level consistent, instead of resetting
Oto an empty string.
- The default punctuation in the
Sentencizerhas been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, set
punct_chars=[".", "!", "?"]on initialization.
PhraseMatcheralgorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change.
Serbianlanguage class (introduced in v2.1.8) incorrectly used the language code
sr. This has now been fixed, so
Serbianis now available via
meta.jsonhave changed from a list of strings to a list of dicts. This is mostly internals, but if your code used
nlp.meta["sources"], you might have to update it.
💬UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Add "label scheme" section to all models in the models directory that lists the labels assigned by the different components.
- Extend the
sourceslisted in the
meta.jsonof pre-trained models with more details on the training corpora and include more information in the models directory.
- Add more examples of matching regular expressions.
- Add instructions for training an entity linking model.
- Add API docs for new
- Add new projects to the spaCy Universe.
- Add example for interactive model visualizer with Streamlit.
- Fix various typos and inconsistencies.
Thanks to @ICLRandD, @phiedulxp, @ajrader, @RyanZHe, @jenojp, @yanaiela, @isaric, @mrdbourke, @avramandrei, @Pavle992, @chkoar, @wannaphongcom, @BreakBB, @b1uec0in, @mihaigliga21, @tamuhey, @euand, @Hazoom, @SeanBE, @esemeniuc, @zqianem, @ajkl, @jaydeepborkar, @EarlGreyT and @er-raoniz for the pull requests and contributions.
Special thanks to our spaCy team @svlandeg and @adrianeboyd for the bug fixes and new features, @polm for the Bloom filters implementation and data compression and @yvespeirsman, @lemontheme, @jarib, @miktoki and @rokasramas for the help and resources for the new models.