@honnibal honnibal released this Jan 16, 2017 · 4717 commits to master since this release

Assets 5

✨ Major features and improvements

  • Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
  • Improve how tokenizer exceptions for English contractions and punctuations are generated.
  • Update language data for Hungarian and Swedish tokenization.
  • Update to use Thinc v6 to prepare for spaCy v2.0.

πŸ”΄ Bug fixes

  • Fix issue #326: Tokenizer is now more consistent and handles abbreviations correctly.
  • Fix issue #344: Tokenizer now handles URLs correctly.
  • Fix issue #483: Period after two or more uppercase letters is split off in tokenizer exceptions.
  • Fix issue #631: Add richcmp method to Token.
  • Fix issue #718: Contractions with She are now handled correctly.
  • Fix issue #736: Times are now tokenized with correct string values.
  • Fix issue #743: Token is now hashable.
  • Fix issue #744: were and Were are now excluded correctly from contractions.

πŸ“‹ Tests

  • Modernise and reorganise all tests and remove model dependencies where possible.
  • Improve test speed to ~20s for basic tests (from previously >80s) and ~100s including models (from previously >200s).
  • Add fixtures for spaCy components and test utilities, e.g. to create Doc object manually.
  • Add documentation for tests to explain conventions and organisation.

πŸ‘₯ Contributors

Thanks to @oroszgy, @magnusburton, @guyrosin and @danielhers and for the pull requests!