Skip to content

v2.0.0 alpha: Neural network models, Pickle, better training & lots of API improvements

Pre-release
Pre-release
Compare
Choose a tag to compare
@honnibal honnibal released this 05 Jun 19:05
· 10179 commits to develop since this release

PyPi Last update: 2.0.0rc2, 2017-11-07

This is an alpha pre-release of spaCy v2.0.0 and available on pip as spacy-nightly. It's not intended for production use. The alpha documentation is available at alpha.spacy.io. Please note that the docs reflect the library's intended state on release, not the current state of the implementation. For bug reports, feedback and questions, see the spaCy v2.0.0 alpha thread.

Before installing v2.0.0 alpha, we recommend setting up a clean environment.

pip install spacy-nightly

The models are still under development and will keep improving. For more details, see the benchmarks below. There will also be additional models for German, French and Spanish.

Name Lang Capabilities Size spaCy Info
en_core_web_sm-2.0.0a4 en Parser, Tagger, NER 42MB >=2.0.0a14 ℹ️
en_vectors_web_lg-2.0.0a0 en Vectors (GloVe) 627MB >=2.0.0a10 ℹ️
xx_ent_wiki_sm-2.0.0a0 multi NER 12MB <=2.0.0a9 ℹ️

You can download a model by using its name or shortcut. To load a model, use spaCy's loader, e.g. nlp = spacy.load('en_core_web_sm') , or import it as a module (import en_core_web_sm) and call its load() method, e.g nlp = en_core_web_sm.load().

python -m spacy download en_core_web_sm

📈 Benchmarks

The evaluation was conducted on raw text with no gold standard information. Speed and accuracy are currently comparable to the v1.x models: speed on CPU is slightly lower, while accuracy is slightly higher. We expect performance to improve quickly between now and the release date, as we run more experiments and optimise the implementation.

Model spaCy Type UAS LAS NER F POS Words/s
en_core_web_sm-2.0.0a4 v2.x neural 91.9 90.0 85.0 97.1 10,000
en_core_web_sm-2.0.0a3 v2.x neural 91.2 89.2 85.3 96.9 10,000
en_core_web_sm-2.0.0a2 v2.x neural 91.5 89.5 84.7 96.9 10,000
en_core_web_sm-1.1.0 v1.x linear 86.6 83.8 78.5 96.6 25,700
en_core_web_md-1.2.1 v1.x linear 90.6 88.5 81.4 96.7 18,800

✨ Major features and improvements

  • NEW: Neural network model for English (comparable performance to the >1GB v1.x models) and multi-language NER (still experimental).
  • NEW: GPU support via Chainer's CuPy module.
  • NEW: Strings are now resolved to hash values, instead of mapped to integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state.
  • NEW: Trainable document vectors and contextual similarity via convolutional neural networks.
  • NEW: Built-in text classification component.
  • NEW: Built-in displaCy visualizers with Jupyter notebook support.
  • NEW: Alpha tokenization for Danish, Polish and Indonesian.
  • Improved language data, support for lazy loading and simple, lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.
  • Improved language processing pipelines and support for custom, model-specific components.
  • Improved and consistent saving, loading and serialization across objects, plus Pickle support.
  • Revised matcher API to make it easier to add and manage patterns and callbacks in one step.
  • Support for multi-language models and new MultiLanguage class (xx).
  • Entry point for spacy command to use instead of python -m spacy.

🚧 Work in progress (not yet implemented)

  • NEW: Neural network models for German, French and Spanish.
  • NEW: Binder, a container class for serializing collections of Doc objects.

🔴 Bug fixes

  • Fix issue #125, #228, #299, #377, #460, #606, #930: Add full Pickle support.
  • Fix issue #152, #264, #322, #343, #437, #514, #636, #785, #927, #985, #992, #1011: Fix and improve serialization and deserialization of Doc objects.
  • Fix issue #512: Improve parser to prevent it from returning two ROOT objects.
  • Fix issue #524: Improve parser and handling of noun chunks.
  • Fix issue #621: Prevent double spaces from changing the parser result.
  • Fix issue #664, #999, #1026: Fix bugs that would prevent loading trained NER models.
  • Fix issue #671, #809, #856: Fix importing and loading of word vectors.
  • Fix issue #753: Resolve bug that would tag OOV items as personal pronouns.
  • Fix issue #905, #1021, #1042: Improve parsing model and allow faster accuracy updates.
  • Fix issue #995: Improve punctuation rules for Hebrew and other non-latin languages.
  • Fix issue #1008: train command finally works correctly if used without dev_data.
  • Fix issue #1012: Improve documentation on model saving and loading.
  • Fix issue #1043: Improve NER models and allow faster accuracy updates.
  • Fix issue #1051: Improve error messages if functionality needs a model to be installed.
  • Fix issue #1071: Correct typo of "whereve" in English tokenizer exceptions.
  • Fix issue #1088: Emoji are now split into separate tokens wherever possible.

🚧 Work in progress (not yet implemented)

📖 Documentation and examples

🚧 Work in progress (not yet implemented)

⚠️ Backwards incompatibilities

Note that the old v1.x models are not compatible with spaCy v2.0.0. If you've trained your own models, you'll have to re-train them to be able to use them with the new version. For a full overview of changes in v2.0, see the alpha documentation and guide on migrating from spaCy 1.x.

Loading models

spacy.load() is now only intended for loading models – if you need an empty language class, import it directly instead, e.g. from spacy.lang.en import English. If the model you're loading is a shortcut link or package name, spaCy will expect it to be a model package, import it and call its load() method. If you supply a path, spaCy will expect it to be a model data directory and use the meta.json to initialise a language class and call nlp.from_disk() with the data path.

nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
nlp = spacy.load('/model-data')
nlp = English().from.disk('/model-data')
# OLD: nlp = spacy.load('en', path='/model-data')

Hash values instead of integer IDs

The StringStore now resolves all strings to hash values instead of integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state, making a lot of workflows much simpler, especially during training. However, you still need to make sure all objects have access to the same Vocab. Otherwise, spaCy won't be able to resolve hashes back to their string values.

nlp.vocab.strings[u'coffee']       # 3197928453018144401
other_nlp.vocab.strings[u'coffee'] # 3197928453018144401

Serialization

spaCy's serialization API is now consistent across objects. All containers and pipeline components have .to_disk(), .from_disk(), .to_bytes() and .from_bytes() methods.

nlp.to_disk('/model')
nlp.vocab.to_disk('/vocab')
# OLD: nlp.save_to_directory('/model')

Processing pipelines

Models can now define their own processing pipelines as a list of strings, mapping to component names. Components receive a Doc, modify it and return it to be processed by the next component in the pipeline. You can add custom components to nlp.pipeline, and disable components by adding their name to the disable keyword argument. The tokenizer can simply be overwritten with a custom function.

nlp = spacy.load('en', disable=['tagger', 'ner'])
nlp.tokenizer = my_custom_tokenizer
nlp.pipeline.append(my_custom_component)
doc = nlp(u"I don't want parsed", disable=['parser'])

Comparison table

For the complete table and more details, see the alpha guide on what's new in v2.0.

Old New Notes
spacy.en, spacy.de, ... spacy.lang.en, ... Language data moved to lang.
.save_to_directory, .dump, .dump_vectors .to_disk , to_bytes Consistent serialization.
.load, .load_lexemes, .load_vectors, .load_vectors_from_bin_loc .from_disk, .from_bytes Consistent serialization.
Language.create_make_doc Language.tokenizer Tokenizer can now be replaced via nlp.tokenizer.
Matcher.add_pattern, Matcher.add_entity Matcher.add Simplified API.
Matcher.get_entity, Matcher.has_entity Matcher.get, Matcher.__contains__ Simplified API.
Doc.read_bytes Binder Consistent API.
Token.is_ancestor_of Token.is_ancestor Duplicate method.

👥 Contributors

This release is brought to you by @honnibal and @ines. Thanks to @Gregory-Howard, @luvogels, @ferdous-al-imran, @uetchy, @akYoung, @kengz, @raphael0202, @ardeego, @yuvalpinter, @dvsrepo, @frascuchon, @oroszgy, @v3t3a, @Tpt, @thinline72, @jarle, @jimregan, @nkruglikov, @delirious-lettuce and @geovedi for the pull requests and contributions. Also thanks to everyone who submitted bug reports and took the spaCy user survey – your feedback made a big difference!