Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 28 million developers.Sign up
✨ Major features and improvements
- Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
- Improve how tokenizer exceptions for English contractions and punctuations are generated.
- Update language data for Hungarian and Swedish tokenization.
- Update to use Thinc v6 to prepare for spaCy v2.0.
🔴 Bug fixes
- Fix issue #326: Tokenizer is now more consistent and handles abbreviations correctly.
- Fix issue #344: Tokenizer now handles URLs correctly.
- Fix issue #483: Period after two or more uppercase letters is split off in tokenizer exceptions.
- Fix issue #631: Add
- Fix issue #718: Contractions with
Sheare now handled correctly.
- Fix issue #736: Times are now tokenized with correct string values.
- Fix issue #743:
Tokenis now hashable.
- Fix issue #744:
Wereare now excluded correctly from contractions.
- Modernise and reorganise all tests and remove model dependencies where possible.
- Improve test speed to ~20s for basic tests (from previously >80s) and ~100s including models (from previously >200s).
- Add fixtures for spaCy components and test utilities, e.g. to create
- Add documentation for tests to explain conventions and organisation.