Assets 2

πŸŒ™ This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

✨ New features and improvements

Tagger, Parser & NER

  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Fix bugs in beam-search training objective.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.

CLI

  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

πŸ”΄ Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix serialization of custom tokenizer if not all functions are defined.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

πŸ“ˆ Benchmarks

Model Language Version UAS LAS POS NER F Vec Size
en_core_web_sm English 2.1.0a0 91.8 90.0 96.8 85.6 𐄂 28 MB
en_core_web_md English 2.1.0a0 92.0 90.2 97.0 86.2 βœ“ 107 MB
en_core_web_lg English 2.1.0a0 92.1 90.3 97.0 86.2 βœ“ 805 MB
de_core_news_sm German 2.1.0a0 92.0 90.1 97.2 83.8 𐄂 26 MB
de_core_news_md German 2.1.0a0 92.4 90.7 97.4 84.2 βœ“ 228 MB
es_core_news_sm Spanish 2.1.0a0 90.1 87.2 96.9 89.4 𐄂 28 MB
es_core_news_md Spanish 2.1.0a0 90.7 88.0 97.2 89.5 βœ“ 88 MB
pt_core_news_sm Portuguese 2.1.0a0 89.4 86.3 80.1 82.7 𐄂 29 MB
fr_core_news_sm French 2.1.0a0 88.8 85.7 94.4 67.3 1 𐄂 32 MB
fr_core_news_md French 2.1.0a0 88.7 86.0 95.0 70.4 1 βœ“ 100 MB
it_core_news_sm Italian 2.1.0a0 90.7 87.1 96.1 81.3 𐄂 27 MB
nl_core_news_sm Dutch 2.1.0a0 83.5 77.6 91.5 87.3 𐄂 27 MB
el_core_news_sm Greek 2.1.0a0 84.5 81.0 95.0 73.5 𐄂 27 MB
el_core_news_md Greek 2.1.0a0 87.7 84.7 96.3 80.2 βœ“ 143 MB
xx_ent_wiki_sm Multi 2.1.0a0 - - - 83.8 𐄂 9 MB
  1. We're currently investigating this, as the results are anomalously low.

πŸ’¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

πŸ“– Documentation and examples

  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos and @louridas for the pull requests and contributions.

Aug 9, 2018
Increment version to 2.0.13.dev2
Aug 9, 2018
Increment version to v2.0.13.dev1
Aug 9, 2018
Set version to 2.0.13.dev0
Assets 2

We had to release another update to the v2.0.x branch of spaCy to resolve a dependency issue, so we decided to also include and/or backport a bunch of features and fixes that were originally intended for v2.1.0 (see here for the nightly version).

✨ New features and improvements

  • NEW: Alpha tokenization and language data for Arabic, Urdu, Tatar and Greek.
  • NEW: Mecab-based Japanese tokenization and lemmatization.
  • NEW: Add Norwegian rule-based and lookup lemmatization.
  • NEW: Add Danish lookup lemmatization based on the Den store danske SprogTeknologiske Ordbase, STO dataset, courtesy of The University of Copenhagen.
  • NEW: Romanian lookup lemmatization.
  • Improve language data for Polish, Turkish, French, Romanian, Swedish and Japanese.
  • Improve case-sensitive lookup lemmatization in German.
  • Add Token.sent property that returns the sentence Span the token is part of.
  • Add remove_extension method on Doc, Token and Span.
  • Add Doc.is_sentenced property that returns True if sentence boundaries have been applied.
  • Allow ignoring warning by code via the SPACY_WARNING_IGNORE environment variable.
  • Add --silent option to info command.

πŸ”΄ Bug fixes

  • Fix issue #1456: Pass additional arguments of download command to pip and check if model is already installed before downloading it.
  • Fix issue #2191: Update README section on tests and dependencies.
  • Fix issue #2194: Ensure that Doc.noun_chunks_iterator isn't None before calling it.
  • Fix issue #2196: Return data in cli.info and add silent option.
  • Fix issue #2200: Correct typo in spacy package command message.
  • Fix issue #2210: Fix bug in Spanish noun chunks.
  • Fix issue #2211, #2320: Resolve problem in download command and use requests library again.
  • Fix issue #2219: Fix token similarity of single-letter tokens.
  • Fix issue #2222, #2223: Fix typos in documentation and docstrings.
  • Fix issue #2226: Use correct, non-deprecated merge syntax in merge_ents.
  • Fix issue #2228: Fix deserialization when using tensor=False or sentiment=False.
  • Fix issue #2238: Correct Swedish lookup lemmatization.
  • Fix issue #2242: Add remove_extension method on Doc, Token and Span.
  • Fix issue #2266: Add collapse_phrases option to displaCy visualizer.
  • Fix issue #2269: Fix KeyError by renaming SP to _SP.
  • Fix issue #2304: Don't require attrs argument in Doc.retokenize and allow ints/unicode.
  • Fix issue #2361: Escape HTML tags in displacy.render.
  • Fix issue #2376: Improve Matcher examples and add section on using pipeline components.
  • Fix issue #2385: Handle multi-word entities correctly in IOB to BILUO conversion.
  • Fix issue #2452: Fix bug that would cause displacy arrows to only point in one direction.
  • Fix issue #2477: Also allow Span objects in displacy.render.
  • Fix issue #2490: Update Thinc's dependencies for Python 3.7 compatibility.
  • Fix issue #2495: Fix loading tokenizer with custom prefix search.
  • Fix issue #2514: Switch from msgpack-python to msgpack to hopefully prevent conda from downloading a two-year-old spaCy version when installing with latest the Anaconda distribution.
  • Ensure that Doc.is_tagged is set correctly when using Language.pipe.
  • Fix bug in merge_noun_chunks factory that would return None if Doc wasn't parsed.
  • Explicitly require pathlib backport on Python 2 only.

πŸ“– Documentation and examples

  • NEW: Edit and execute code examples in your browser – all across the documentation!
  • NEW: The spaCy Universe, a collection of plugins, extensions and other resources for spaCy.
  • NEW: Experimental rule-based Matcher Explorer demo – create token patterns interactively, test them against your text and copy-paste the Python pattern code.
  • NEW: Document Cython API.
  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @mollerhoj, @howl-anderson, @pktippa, @skrcode, @miroli, @ivyleavedtoadflax, @5hirish, @therealronnie, @alexvy86, @mn3mos, @polm, @knoxdw, @bellabie, @mauryaland, @LRAbbade, @janimo, @vishnumenon, @tzano, @cclauss, @armsp, @aristorinjuang, @BigstickCarpet, @idealley, @ansgar-t, @mpszumowski, @91ns, @msklvsk, @himkt, @DanielRuf, @nathanathan, @GolanLevy, @nipunsadvilkar, @cjhurst, @aliiae, @mirfan899, @ohenrik, @btrungchi, @kleinay, @DuyguA, @stefan-it, @Eleni170, @datascouting, @tjkemp, @x-ji, @giannisdaras, @kororo and @katarkor for the pull requests and contributions.

Assets 2

πŸŒ™ This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

✨ New features and improvements

Tagger, Parser & NER

  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Fix bugs in beam-search training objective.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.

CLI

  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

πŸ”΄ Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix serialization of custom tokenizer if not all functions are defined.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

πŸ“ˆ Benchmarks

Model Version UAS LAS POS NER F Vec Size
en_core_web_sm 2.1.0a0 91.8 90.0 96.8 85.6 𐄂 28 MB
en_core_web_md 2.1.0a0 92.0 90.2 97.0 86.2 βœ“ 107 MB
en_core_web_lg 2.1.0a0 92.1 90.3 97.0 86.2 βœ“ 805 MB
de_core_news_sm 2.1.0a0 92.0 90.1 97.2 83.8 𐄂 26 MB
de_core_news_md 2.1.0a0 92.4 90.7 97.4 84.2 βœ“ 228 MB
es_core_news_sm 2.1.0a0 90.1 87.2 96.9 89.4 𐄂 28 MB
es_core_news_md 2.1.0a0 90.7 88.0 97.2 89.5 βœ“ 88 MB
pt_core_news_sm 2.1.0a0 89.4 86.3 80.1 82.7 𐄂 29 MB
fr_core_news_sm 2.1.0a0 88.8 85.7 94.4 67.3 1 𐄂 32 MB
fr_core_news_md 2.1.0a0 88.7 86.0 95.0 70.4 1 βœ“ 100 MB
it_core_news_sm 2.1.0a0 90.7 87.1 96.1 81.3 𐄂 27 MB
nl_core_news_sm 2.1.0a0 83.5 77.6 91.5 87.3 𐄂 27 MB
xx_ent_wiki_sm 2.1.0a0 - - - 83.8 𐄂 9 MB
  1. We're currently investigating this, as the results are anomalously low.

πŸ’¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

πŸ“– Documentation and examples

  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @DuyguA for the pull requests and contributions.

Assets 2

πŸ“Š Help us improve spaCy and take the User Survey 2018!


✨ New features and improvements

  • NEW: Alpha Vietnamese support with tokenization via Pyvi.
  • NEW: Improved system for error messages and warnings. Errors now have unique error codes and are referenced in one place, and all unspecified asserts have been replaced with descriptive errors. See #2163 for implementation details, and let us know if you have any suggestions for errors and warnings in #2164!
  • Improve language data for Polish.
  • Tidy up dependencies and drop six, html5lib, ftfy and requests.
  • Improve efficiency (and potentially accuracy) of beam-search training, by randomly using greedy updates for some sentences. This can be controlled by changing the beam_update_prob entry in nlp.parser.cfg. The default value is 0.5, so 50% of beam updates will be done as greedy updates.

πŸ”΄ Bug fixes

  • Fix issue #1554, #1752, #2159: Fix Token.ent_iob after Doc.merge(), and ensure consistency in Doc.ents.
  • Fix issue #1660: Fix loading of multiple vector models.
  • Fix issue #1967: Allow entity types with dashes.
  • Fix issue #2032: Fix accidentally quadratic runtime in Vocab.set_vector.
  • Fix issue #2050: Correct mistakes in Italian lemmatizer data.
  • Fix issue #2073: Make Token.set_extension work as expected.
  • Fix issue #2100, #2151, #2181: Drop six and html5lib and prevent dependency conflict with TensorFlow / Keras.
  • Fix issue #2101: Improve error message if token text is empty string.
  • Fix issue #2121: Fix Language.to_bytes and pickling in Thinc.
  • Fix issue #2156: Fix hashtag example in Matcher docs.
  • Fix issue #2177: Don't raise error in set_extension if getter and setter are specified or if default=None, and add error if setter is specified with no getter.

πŸ“– Documentation and examples

πŸ‘₯ Contributors

Thanks to @jimregan, @justindujardin, @trungtv, @katrinleinweber and @skrcode for the pull requests and contributions.

Apr 3, 2018
Update oracle tests for Split
Apr 2, 2018
Handle complex tags in ud-train
Assets 2

πŸ“Š Help us improve spaCy and take the User Survey 2018!


✨ New features and improvements

  • Improve language data for Turkish and Croatian.
  • Add built-in factories for merge_entities and merge_noun_chunks to allow models to specify those components as part of their pipeline.
merge_entities = nlp.create_pipe('merge_entities')
nlp.add_pipe(merge_entities, after='ner')

πŸ”΄ Bug fixes

  • Fix issue #2012: Fix Spanish noun_chunks failure caused by typo.
  • Fix issue #2040: Make sure Token.lemma always returns a hash value.
  • Fix issue #2063: Correct typo in English lookup lemmatization table.
  • Fix issue #2103: Correct typo in documentation.
  • Fix pickling of Vectors class.

πŸ“– Documentation and examples

  • Add example for visualizing spaCy vectors with the TensorBoard Embedding Projector.
  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @thomasopsomer, @alldefector, @DuyguA, @dejanmarich, @justindujardin, @calumcalder, @SebastinSanty, @iann0036, @doug-descombaz and @willismonroe for the pull requests and contributions.