Release v2.0.0 alpha: Neural network models, Pickle, better training & lots of API improvements · explosion/spaCy

Last update: `2.0.0rc2`, 2017-11-07

This is an alpha pre-release of spaCy v2.0.0 and available on pip as spacy-nightly. It's not intended for production use. The alpha documentation is available at alpha.spacy.io. Please note that the docs reflect the library's intended state on release, not the current state of the implementation. For bug reports, feedback and questions, see the spaCy v2.0.0 alpha thread.

Before installing v2.0.0 alpha, we recommend setting up a clean environment.

pip install spacy-nightly

The models are still under development and will keep improving. For more details, see the benchmarks below. There will also be additional models for German, French and Spanish.

Name	Lang	Capabilities	Size	spaCy	Info
`en_core_web_sm-2.0.0a4`	en	Parser, Tagger, NER	42MB	`>=2.0.0a14`	ℹ️
`en_vectors_web_lg-2.0.0a0`	en	Vectors (GloVe)	627MB	`>=2.0.0a10`	ℹ️
`xx_ent_wiki_sm-2.0.0a0`	multi	NER	12MB	`<=2.0.0a9`	ℹ️

You can download a model by using its name or shortcut. To load a model, use spaCy's loader, e.g. nlp = spacy.load('en_core_web_sm') , or import it as a module (import en_core_web_sm) and call its load() method, e.g nlp = en_core_web_sm.load().

python -m spacy download en_core_web_sm

📈 Benchmarks

The evaluation was conducted on raw text with no gold standard information. Speed and accuracy are currently comparable to the v1.x models: speed on CPU is slightly lower, while accuracy is slightly higher. We expect performance to improve quickly between now and the release date, as we run more experiments and optimise the implementation.

Model	spaCy	Type	UAS	LAS	NER F	POS	Words/s
`en_core_web_sm-2.0.0a4`	v2.x	neural	91.9	90.0	85.0	97.1	10,000
`en_core_web_sm-2.0.0a3`	v2.x	neural	91.2	89.2	85.3	96.9	10,000
`en_core_web_sm-2.0.0a2`	v2.x	neural	91.5	89.5	84.7	96.9	10,000
`en_core_web_sm-1.1.0`	v1.x	linear	86.6	83.8	78.5	96.6	25,700
`en_core_web_md-1.2.1`	v1.x	linear	90.6	88.5	81.4	96.7	18,800

✨ Major features and improvements

NEW: Neural network model for English (comparable performance to the >1GB v1.x models) and multi-language NER (still experimental).
NEW: GPU support via Chainer's CuPy module.
NEW: Strings are now resolved to hash values, instead of mapped to integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state.
NEW: Trainable document vectors and contextual similarity via convolutional neural networks.
NEW: Built-in text classification component.
NEW: Built-in displaCy visualizers with Jupyter notebook support.
NEW: Alpha tokenization for Danish, Polish and Indonesian.
Improved language data, support for lazy loading and simple, lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.
Improved language processing pipelines and support for custom, model-specific components.
Improved and consistent saving, loading and serialization across objects, plus Pickle support.
Revised matcher API to make it easier to add and manage patterns and callbacks in one step.
Support for multi-language models and new MultiLanguage class (xx).
Entry point for spacy command to use instead of python -m spacy.

🚧 Work in progress (not yet implemented)

NEW: Neural network models for German, French and Spanish.

NEW: Binder, a container class for serializing collections of Doc objects.

🔴 Bug fixes

Fix issue #125, #228, #299, #377, #460, #606, #930: Add full Pickle support.
Fix issue #152, #264, #322, #343, #437, #514, #636, #785, #927, #985, #992, #1011: Fix and improve serialization and deserialization of Doc objects.
Fix issue #512: Improve parser to prevent it from returning two ROOT objects.
Fix issue #524: Improve parser and handling of noun chunks.
Fix issue #621: Prevent double spaces from changing the parser result.
Fix issue #664, #999, #1026: Fix bugs that would prevent loading trained NER models.
Fix issue #671, #809, #856: Fix importing and loading of word vectors.
Fix issue #753: Resolve bug that would tag OOV items as personal pronouns.
Fix issue #905, #1021, #1042: Improve parsing model and allow faster accuracy updates.
Fix issue #995: Improve punctuation rules for Hebrew and other non-latin languages.
Fix issue #1008: train command finally works correctly if used without dev_data.
Fix issue #1012: Improve documentation on model saving and loading.
Fix issue #1043: Improve NER models and allow faster accuracy updates.
Fix issue #1051: Improve error messages if functionality needs a model to be installed.
Fix issue #1071: Correct typo of "whereve" in English tokenizer exceptions.
Fix issue #1088: Emoji are now split into separate tokens wherever possible.

🚧 Work in progress (not yet implemented)

Fix issue #519, #611, #725: Retrain German model with better tokenized input.

Fix issue #933, #954, #977, #1040: Improve accuracy and update online demos.

Fix issue #1044: Fix bugs in French model and improve performance.

📖 Documentation and examples

NEW: spacy 101 guide with simple explanations and illustrations of the most important concepts and an overview of spaCy's features and capabilities.
NEW: Visualizing spaCy guide on how to use the built-in displacy module.
NEW: API docs for top-level functions, spacy.displacy, spacy.util, spacy.gold.GoldCorpus.
NEW: Full code example for text classification (sentiment analysis).
Improved rule-based matching guide with examples for matching entities and phone numbers, and social media analysis.
Improved processing pipelines guide with examples for custom sentence segmentation logic and hooking in a sentiment analysis model.
Re-wrote all API and usage docs and added more examples.

🚧 Work in progress (not yet implemented)

NEW: Usage guide on scaling spaCy for production.

NEW: Usage guide on text classification.

NEW: API docs for spacy.pipeline.TextCategorizer, spacy.pipeline.Tensorizer, spacy.tokens.binder.Binder and spacy.vectors.Vectors.

Improved training, NER training and deep learning usage guides with examples.

⚠️ Backwards incompatibilities

Note that the old v1.x models are not compatible with spaCy v2.0.0. If you've trained your own models, you'll have to re-train them to be able to use them with the new version. For a full overview of changes in v2.0, see the alpha documentation and guide on migrating from spaCy 1.x.

Loading models

spacy.load() is now only intended for loading models – if you need an empty language class, import it directly instead, e.g. from spacy.lang.en import English. If the model you're loading is a shortcut link or package name, spaCy will expect it to be a model package, import it and call its load() method. If you supply a path, spaCy will expect it to be a model data directory and use the meta.json to initialise a language class and call nlp.from_disk() with the data path.

nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
nlp = spacy.load('/model-data')
nlp = English().from.disk('/model-data')
# OLD: nlp = spacy.load('en', path='/model-data')

Hash values instead of integer IDs

The StringStore now resolves all strings to hash values instead of integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state, making a lot of workflows much simpler, especially during training. However, you still need to make sure all objects have access to the same Vocab. Otherwise, spaCy won't be able to resolve hashes back to their string values.

nlp.vocab.strings[u'coffee']       # 3197928453018144401
other_nlp.vocab.strings[u'coffee'] # 3197928453018144401

Serialization

spaCy's serialization API is now consistent across objects. All containers and pipeline components have .to_disk(), .from_disk(), .to_bytes() and .from_bytes() methods.

nlp.to_disk('/model')
nlp.vocab.to_disk('/vocab')
# OLD: nlp.save_to_directory('/model')

Processing pipelines

Models can now define their own processing pipelines as a list of strings, mapping to component names. Components receive a Doc, modify it and return it to be processed by the next component in the pipeline. You can add custom components to nlp.pipeline, and disable components by adding their name to the disable keyword argument. The tokenizer can simply be overwritten with a custom function.

nlp = spacy.load('en', disable=['tagger', 'ner'])
nlp.tokenizer = my_custom_tokenizer
nlp.pipeline.append(my_custom_component)
doc = nlp(u"I don't want parsed", disable=['parser'])

Comparison table

For the complete table and more details, see the alpha guide on what's new in v2.0.

Old	New	Notes
`spacy.en`, `spacy.de`, ...	`spacy.lang.en`, ...	Language data moved to `lang`.
`.save_to_directory`, `.dump`, `.dump_vectors`	`.to_disk` , `to_bytes`	Consistent serialization.
`.load`, `.load_lexemes`, `.load_vectors`, `.load_vectors_from_bin_loc`	`.from_disk`, `.from_bytes`	Consistent serialization.
`Language.create_make_doc`	`Language.tokenizer`	Tokenizer can now be replaced via `nlp.tokenizer`.
`Matcher.add_pattern`, `Matcher.add_entity`	`Matcher.add`	Simplified API.
`Matcher.get_entity`, `Matcher.has_entity`	`Matcher.get`, `Matcher.__contains__`	Simplified API.
`Doc.read_bytes`	`Binder`	Consistent API.
`Token.is_ancestor_of`	`Token.is_ancestor`	Duplicate method.

👥 Contributors

This release is brought to you by @honnibal and @ines. Thanks to @Gregory-Howard, @luvogels, @ferdous-al-imran, @uetchy, @akYoung, @kengz, @raphael0202, @ardeego, @yuvalpinter, @dvsrepo, @frascuchon, @oroszgy, @v3t3a, @Tpt, @thinline72, @jarle, @jimregan, @nkruglikov, @delirious-lettuce and @geovedi for the pull requests and contributions. Also thanks to everyone who submitted bug reports and took the spaCy user survey – your feedback made a big difference!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.0.0 alpha: Neural network models, Pickle, better training & lots of API improvements

Last update: `2.0.0rc2`, 2017-11-07

📈 Benchmarks

✨ Major features and improvements

🚧 Work in progress (not yet implemented)

🔴 Bug fixes

🚧 Work in progress (not yet implemented)

📖 Documentation and examples

🚧 Work in progress (not yet implemented)

⚠️ Backwards incompatibilities

Loading models

Hash values instead of integer IDs

Serialization

Processing pipelines

Comparison table

👥 Contributors

v2.0.0 alpha: Neural network models, Pickle, better training & lots of API improvements

Last update: 2.0.0rc2, 2017-11-07

📈 Benchmarks

✨ Major features and improvements

🚧 Work in progress (not yet implemented)

🔴 Bug fixes

🚧 Work in progress (not yet implemented)

📖 Documentation and examples

🚧 Work in progress (not yet implemented)

⚠️ Backwards incompatibilities

Loading models

Hash values instead of integer IDs

Serialization

Processing pipelines

Comparison table

👥 Contributors

Last update: `2.0.0rc2`, 2017-11-07