Skip to content
0fc3dee
Compare
Choose a tag to compare

New features and improvements

  • NEW: Registered scoring functions for each component in the config.
  • NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
  • NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
  • overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
  • extend config setting for morphologizer for whether existing feature types are preserved.
  • Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
  • New package spacy-loggers for additional loggers.
  • New Irish lemmatizer.
  • New Portuguese noun chunks and updated Spanish noun chunks.
  • Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
  • Japanese reading and inflection from sudachipy are annotated as Token.morph features.
  • Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
  • LIKE_URL attribute includes the tokenizer URL pattern.
  • --n-save-epoch option for spacy pretrain.
  • Trained pipelines:
    • New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
    • Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
    • Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
    • Universal Dependencies corpora updated to v2.8.
    • Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
    • English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

馃敶 Bug fixes

  • Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
  • Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
  • Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
  • Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

鈿狅笍 Backwards incompatibilities

  • In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of 掳[cfk]. is now 掳 c . instead of 掳 c. for most languages.
  • The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
  • In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

馃摉 Documentation and examples

馃懃 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

006df1a
Compare
Choose a tag to compare

New features and improvements

  • NEW: Binary wheels for Python 3.10.
  • NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
  • GPU profiling with spacy.models_with_nvtx_range.v1.
  • Full mypy integration in the CI and many type fixes across the code base.
  • Added custom Protocol classes in ty.py to define behavior of pipeline components.
  • Support for entity linking visualization in displacy.
  • Allow overriding vars in spacy project assets .
  • Standalone train function to run the training from Python scripts just like the spacy train CLI.
  • Support for spacy-transformers>=1.1.0 with improved IO.
  • Support for thinc>=8.0.11 with improved gradient clipping.

馃敶 Bug fixes

  • Fix issue #5507: Improve UX for multiprocessing on GPU.
  • Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
  • Fix issue #9244: Fix vectors for 0-length spans.
  • Fix issue #9247: Improve UX for the DocBin constructor.
  • Fix Issue #9254: Allow unicode in a spacy project title.
  • Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
  • Fix issue #9305: Restore tokenization timing during evaluation.
  • Fix issue #9335: Sync vocab in vectors and sourced components.
  • Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
  • Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
  • Fix issue #9437: Improve UX around Doc object creation.
  • Fix issue #9465: Fix minor issues with convert CLI.
  • Fix issue #9500: Include .pyi files in the distributed package.

馃摉 Documentation and examples

  • Various updates to the documentation.
  • New additions to the spaCy universe:
    • deplacy: CUI-based dependency visualizer
    • ipymarkup: Visualizations for NER and syntax trees
    • PhruzzMatcher: Find fuzzy matches
    • spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
    • spaCyOpenTapioca: Entity Linking on Wikidata
    • spacy-clausie: Clause-based information extraction system
    • "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
    • "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

馃懃 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

8bda39f
Compare
Choose a tag to compare

New features and improvements

  • The v3 of WandbLogger now supports optional run_name and entity parameters.
  • Improved UX when providing invalid pos values for a Doc or Token.

馃敶 Bug fixes

  • Fix issue #9001: Pass alignments to Matcher callbacks.
  • Fix issue #9009: Include component factories in third-party dependencies resolver.
  • Fix issue #9012: Correct type of config in create_pipe.
  • Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
  • Fix issue #9033: Fix verbs list for French tokenizer exceptions.
  • Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
  • Fix issue #9074: Improve UX around repo and path arguments in spacy project.
  • Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
  • Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
  • Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

馃摉 Documentation and examples

  • Various updates to the documentation.
  • Few additions and updates to the spaCy universe.
  • Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

馃懃 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker

e1f88de
Compare
Choose a tag to compare

New features and improvements

  • NEW: Provide scores for the SpanCategorizer predictions.
  • NEW: Broader compatibility with type checkers thanks to .pyi stub files.
  • NEW: Auto-detect package dependencies in spacy package.
  • New INTERSECTS operator for the Matcher.
  • More debugging info for spacy project push and pull commands.
  • Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
  • The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

馃敶 Bug fixes

  • Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
  • Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
  • Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
  • Fix issue #8796: Respect the no_skip value for spacy project run.
  • Fix issue #8810: Make ConsoleLogger flush after each logging line.
  • Fix issue #8819: Pass exclude when serializing the vocab.
  • Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
  • Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
  • Fix issue #8982: Add glossary entry for _SP.
  • Fix issue #9007: Fix span categorizer training on nested entities.

馃摉 Documentation and examples

馃懃 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

034ac0a
Compare
Choose a tag to compare

New features and improvements

  • Alpha tokenization support for Azerbaijani.
  • Updates for French stop words.

馃敶 Bug fixes

  • Fix issue #7629: Fix scoring normalization.
  • Fix issue #7886: Fix unknown tokens percentage in debug data.
  • Fix issue #7907: Update load_lookups return type and docstring.
  • Fix issue #7930: Make EntityLinker robust for nO=None.
  • Fix issue #7925: Skip vector ngram backoff if minn is not set.
  • Fix issue #7973: Fix debug model for transformers.
  • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
  • Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
  • Fix issue #8004: Handle errors while multiprocessing.
  • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
  • Fix issue #8012: Fix ensemble textcat with listener.
  • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
  • Fix issue #8055: Handle partial entities in Span.as_doc.
  • Fix issue #8062: Make all Span attrs writable.
  • Fix issue #8066: Update debug data for textcat.
  • Fix issue #8069: Custom warning if DocBin is too large.
  • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
  • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
  • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
  • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
  • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
  • Fix issue #8208: Address missing config overrides post load of models.
  • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
  • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
  • Fix issue #8244: Use context manager when reading model file.
  • Fix issue #8245: Fix other open calls without context managers.
  • Fix issue #8265: Address mypy errors.
  • Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
  • Fix issue #8335: Raise error if deps not provided with heads in Doc.
  • Fix issue #8368: Preserve whitespace in Span.lemma_.
  • Fix issue #8396: Make JsonlReader path optional.
  • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
  • Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
  • Fix issue #8426: Fix setting empty entities in Example.from_dict.
  • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
  • Fix issue #8584: Raise an error for textcat with <2 labels.
  • Fix issue #8551: Fix duplicate spacy package CLI opts.

馃懃 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

ffaead8
Compare
Choose a tag to compare

New features and improvements

  • Alpha tokenization support for Ancient Greek.
  • Implementation of a noun_chunk iterator for Dutch.
  • Support for black & flake8 as pre-commit hooks.
  • New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

馃敶 Bug fixes

  • Fix issue #8638: Fix Azerbaijani initialization.
  • Fix issue #8639: Use 0-vector for OOV lexemes.
  • Fix issue #8640: Update lexeme ranks for loaded vectors.
  • Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
  • Fix issue #8663: Preserve existing meta information with spacy package.
  • Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

馃懃 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe

530b5d7
Compare
Choose a tag to compare

New features and improvements

For more details, see the New in v3.1 usage guide.

馃摝 New trained pipelines

Package Language UPOS Parser LAS 聽NER F
ca_core_news_sm Catalan 98.2 87.4 79.8
ca_core_news_md Catalan 98.3 88.2 84.0
ca_core_news_lg Catalan 98.5 88.4 84.2
ca_core_news_trf Catalan 98.9 93.0 91.2
da_core_news_trf Danish 98.0 85.0 82.9

鈿狅笍 Upgrading from v3.0

  • Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
  • Use spacy init fill-config to update a v3.0 config for v3.1.
  • When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
  • Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

馃敶 Bug fixes

  • Fix issue #7036: Use a context manager when reading model.
  • Fix issue #7629: Fix scoring normalization.
  • Fix issue #7799: Ensure spacy ray command works.
  • Fix issue #7807: Show warning if entity ruler runs without patterns.
  • Fix issue #7886: Fix unknown tokens percentage in debug data.
  • Fix issue #7930: Make EntityLinker robust for nO=None.
  • Fix issue #7925: Skip vector ngram backoff if minn is not set.
  • Fix issue #7973: Fix debug model for transformers.
  • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
  • Fix issue #8004: Handle errors while multiprocessing.
  • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
  • Fix issue #8012: Fix ensemble textcat with listener.
  • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
  • Fix issue #8055: Handle partial entities in Span.as_doc.
  • Fix issue #8062: Make all Span attrs writable.
  • Fix issue #8066: Update debug data for textcat.
  • Fix issue #8069: Custom warning if DocBin is too large.
  • Fix issue #8099: Update Vietnamese tokenizer.
  • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
  • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
  • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
  • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
  • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
  • Fix issue #8208: Address missing config overrides post load of models.
  • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
  • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
  • Fix issue #8265: Address mypy errors.
  • Fix issue #8335: Raise error if deps not provided with heads in Doc.
  • Fix issue #8368: Preserve whitespace in Span.lemma_.
  • Fix issue #8388: Don't clobber vectors when loading components from source models.
  • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
  • Fix issue #8426: Fix setting empty entities in Example.from_dict.
  • Fix issue #8441: Add correct types for Language.pipe return values.
  • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
  • Fix issue #8559: Fix vectors check for sourced components.
  • Fix issue #8584: Raise an error for textcat with <2 labels.

馃懃 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD

cae72e4
Compare
Choose a tag to compare

馃敶 Bug fixes

  • Fix issue #8286: Fix spacy download.
2c1de4b
Compare
Choose a tag to compare

New features and improvements

  • Add base support for Amharic.
  • Add noun chunk iterator for Danish.
  • Updates to French, Portuguese and Romanian stop words.

馃敶 Bug fixes

  • Fix issue #6705: Fix deserialization of null token_match and url_match for the tokenizer.
  • Fix issue #6712: Prevent overlapping noun chunks for Spanish.
  • Fix issue #6745: Fix minibatch iterator when size iterator is finished.
  • Fix issue #6759: Skip 0-length matches in the Matcher.
  • Fix issue #6771: Support IS_SENT_START in the PhraseMatcher.
  • Fix issue #6772: Fix Span.text for empty spans.
  • Fix issue #6820: Improve Doc.char_span alignment_mode handling.
  • Fix issue #6857: Remove --no-cache-dir when downloading models.
  • Fix issue #8115: Fix offsets in Span.get_lca_matrix.

馃懃 Contributors

Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.

df34444
Compare
Choose a tag to compare

New features and improvements

  • New assemble CLI command for assembling a pipeline from a config without training.
  • Add support for match alignments in the Matcher to align matched tokens with matcher patterns.
  • Add support for training from streamed corpora.
  • Add support for W&B data and model checkpoint logging and versioning in spacy.WandbLogger.v2.
  • Extend Scorer.score_spans to support overlapping and unlabeled spans.
  • Update debug data for new v3 components.
  • Improve language data for Italian.
  • Various improvements to error handling and UX.

馃敶 Bug fixes

  • Fix issue #7408: Add vocab kwarg to spacy.load.
  • Fix issue #7419: Exclude user hooks in displacy conversion.
  • Fix issue #7421: Update --code usage in CLI commands.
  • Fix issue #7424: Preserve sent starts on retokenization without parse.
  • Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
  • Fix issue #7471: Improve warnings related to listening components.
  • Fix issue #7488: Fix upstream check in pretraining.
  • Fix issue #7489: Support callbacks entry points.
  • Fix issue #7497: Merge doc.spans in Doc.from_docs().
  • Fix issue #7528: Preserve user data for DependencyMatcher on spans.
  • Fix issue #7557: Fix __add__ method for PRFScore.
  • Fix issue #7574: Fix conversion of custom extension data in Span.as_doc and Doc.from_docs.
  • Fix issue #7620: Fix replace_listeners in configs.
  • Fix issue #7626: Fix vectors data on GPU.
  • Fix issue #7630: Update NEL for entities crossing sentence boundaries.
  • Fix issue #7631: Fix parser sourcing in NER converter.
  • Fix issue #7642: Fix handling of hyphen string value in config files.
  • Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
  • Fix issue #7674: Fix handling of unknown tokens in StaticVectors.
  • Fix issue #7690: Fix pickling of Lemmatizer.
  • Fix issue #7749: Update Tokenizer.explain for special cases in v3.
  • Fix issue #7755: Fix config parsing of ints/strings.
  • Fix issue #7836: Fix tokenizer cache flushing.
  • Fix issue #7847: Fix handling of boolean values in Example.from_dict for sent starts.

馃摉 Documentation and examples

  • Add documentation for legacy functions and architectures.
  • Add documentation for pretrained pipeline design.
  • Add more details about pipe and multiprocessing.
  • Fix various typos and inconsistencies.

馃懃 Contributors

Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!