v2.1.0a0: New models, new languages, joint word segmentation and parsing, 4 new base languages, bug fixes & more
ines
released this
spacy-nightly. It's not intended for production use.
pip install -U spacy-nightlyIf you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models β see below for details and benchmarks.
β¨ New features and improvements
Tagger, Parser & NER
- NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Fix bugs in beam-search training objective.
- Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Alpha tokenization and language data for Arabic, Urdu, Tatar and Greek.
- NEW: Mecab-based Japanese tokenization.
- NEW: Add Danish lookup lemmatization based on the Den store danske SprogTeknologiske Ordbase, STO dataset, courtesy of The University of Copenhagen.
- NEW: Romanian lookup lemmatization.
- Improve language data for Polish, Turkish, Romanian and Swedish.
- Improve case-sensitive lookup lemmatization in German.
CLI
- NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Add
--silentoption todownloadandinfocommands. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated.
Other
- NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - Add
remove_extensionmethod onDoc,TokenandSpan. - Add
Token.sentproperty that returns the sentenceSpanthe token is part of. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Add
Doc.is_sentencedproperty that returnsTrueif sentence boundaries have been applied. - Allow ignoring warning by code via the
SPACY_WARNING_IGNOREenvironment variable.
π§ Under constructionThis section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Built-in rule-based NER component to add entities based on match patterns (see #2513).
- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
π΄ Bug fixes
- Fix issue #1456: Pass additional arguments of
downloadcommand topipand check if model is already installed before downloading it. - Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1987: Make
span.sentwork with manual or custom SBD. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2191: Update
READMEsection on tests and dependencies. - Fix issue #2194: Ensure that
Doc.noun_chunks_iteratorisn'tNonebefore calling it. - Fix issue #2196: Return data in
cli.infoand addsilentoption. - Fix issue #2200: Correct typo in
spacy packagecommand message. - Fix issue #2211, #2320: Resolve problem in
downloadcommand and userequestslibrary again. - Fix issue #2219: Fix token similarity of single-letter tokens.
- Fix issue #2222, #2223: Fix typos in documentation and docstrings.
- Fix issue #2228: Fix deserialization when using
tensor=Falseorsentiment=False. - Fix issue #2242: Add
remove_extensionmethod onDoc,TokenandSpan. - Fix issue #2252, #2238: Correct Swedish lookup lemmatization.
- Fix issue #2266: Add
collapse_phrasesoption to displaCy visualizer. - Fix issue #2269: Fix
KeyErrorby renamingSPto_SP. - Fix issue #2304: Don't require
attrsargument inDoc.retokenizeand allow ints/unicode. - Fix issue #2361: Escape HTML tags in
displacy.render. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2376: Improve
Matcherexamples and add section on using pipeline components. - Fix issue #2385: Handle multi-word entities correctly in IOB to BILUO conversion.
- Fix issue #2452: Fix bug that would cause
displacyarrows to only point in one direction. - Fix issue #2477: Also allow
Spanobjects indisplacy.render. - Fix issue #2495: Fix loading tokenizer with custom prefix search.
- Fix serialization of custom tokenizer if not all functions are defined.
- Ensure that
Doc.is_taggedis set correctly when usingLanguage.pipe. - Fix bug in
merge_noun_chunksfactory that would returnNoneifDocwasn't parsed.
β οΈ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
π Benchmarks
| Model | Version | UAS | LAS | POS | NER F | Vec | Size |
|---|---|---|---|---|---|---|---|
en_core_web_sm |
2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | π | 28 MB |
en_core_web_md |
2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | β | 107 MB |
en_core_web_lg |
2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | β | 805 MB |
de_core_news_sm |
2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | π | 26 MB |
de_core_news_md |
2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | β | 228 MB |
es_core_news_sm |
2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | π | 28 MB |
es_core_news_md |
2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | β | 88 MB |
pt_core_news_sm |
2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | π | 29 MB |
fr_core_news_sm |
2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | π | 32 MB |
fr_core_news_md |
2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | β | 100 MB |
it_core_news_sm |
2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | π | 27 MB |
nl_core_news_sm |
2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | π | 27 MB |
xx_ent_wiki_sm |
2.1.0a0 | - | - | - | 83.8 | π | 9 MB |
- We're currently investigating this, as the results are anomalously low.
π¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
π Documentation and examples
- NEW: Edit and execute code examples in your browser β all across the documentation!
- NEW: The spaCy Universe, a collection of plugins, extensions and other resources for spaCy.
- NEW: Experimental rule-based
MatcherExplorer demo β create token patterns interactively, test them against your text and copy-paste the Python pattern code. - NEW: Document Cython API.
- Fix various typos and inconsistencies.
π₯ Contributors
Thanks to @howl-anderson, @mollerhoj, @savkov, @jimregan, @fucking-signup, @thomasopsomer, @DuyguA, @pktippa, @skrcode, @miroli, @ivyleavedtoadflax, @5hirish, @therealronnie, @alexvy86, @mn3mos, @polm, @knoxdw, @mauryaland, @LRAbbade, @janimo, @vishnumenon, @tzano, @cclauss, @armsp, @aristorinjuang, @BigstickCarpet, @idealley, @mpszumowski, @NourShalabi, @msklvsk, @himkt, @DanielRuf, @nathanathan, @GolanLevy, @nipunsadvilkar, @cjhurst, @aliiae, @mirfan899, @ohenrik, @btrungchi, @kleinay, @alexvy86, @stefan-it, @Eleni170 and @datascouting for the pull requests and contributions.