Assets 2

Major features and improvements

  • NEW: custom processing pipelines, to support deep learning workflows
  • NEW: Rule matcher now supports entity IDs and attributes
  • NEW: Official/documented training APIs and GoldParse class
  • Download and use GloVe vectors by default
  • Make it easier to load and unload word vectors
  • Improved rule matching functionality
  • Move basic data into the code, rather than the json files. This makes it simpler to use the tokenizer without the models installed, and makes adding new languages much easier.
  • Replace file-system strings with Path objects. You can now load resources over your network, or do similar trickery, by passing any object that supports the Path protocol.

⚠️ Backwards incompatibilities

  • The data_dir keyword argument of Language.__init__ (and its subclasses English.__init__ and German.__init__) has been renamed to path.
  • Details of how the Language base-class and its sub-classes are loaded, and how defaults are accessed, have been heavily changed. If you have your own subclasses, you should review the changes.
  • The deprecated token.repvec name has been removed.
  • The .train() method of Tagger and Parser has been renamed to .update()
  • The previously undocumented GoldParse class has a new __init__() method. The old method has been preserved in GoldParse.from_annot_tuples().
  • Previously undocumented details of the Parser class have changed.
  • The previously undocumented get_package and get_package_by_name helper functions have been moved into a new module, spacy.deprecated, in case you still need them while you update.

🔴 Bug fixes

  • Fix get_lang_class bug when GloVe vectors are used.
  • Fix Issue #411: doc.sents raised IndexError on empty string.
  • Fix Issue #455: Correct lemmatization logic
  • Fix Issue #371: Make Lexeme objects hashable
  • Fix Issue #469: Make noun_chunks detect root NPs

👥 Contributors

Thanks to @daylen, @RahulKulhari, @stared, @adamhadani, @izeye and @crawfordcomeaux for the pull requests!