Assets 2

πŸŒ™ This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).

πŸ”΄ Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated.
- sentence_splitter = nlp.create_pipe('sbd')
+ sentence_splitter = nlp.create_pipe('sentencizer')
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box.
- $ spacy train en /output train_data.json dev_data.json --no-parser
+ $ spacy train en /output train_data.json dev_data.json --pipeline tagger,ner
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

πŸ“ˆ Benchmarks

Model Language Version UAS LAS POS NER F Vec Size
en_core_web_sm English 2.1.0a6 91.5 89.6 96.8 85.5 𐄂 10 MB
en_core_web_md English 2.1.0a6 91.9 90.2 97.0 86.4 βœ“ 90 MB
en_core_web_lg English 2.1.0a6 92.0 90.2 97.0 86.6 βœ“ 788 MB
de_core_news_sm German 2.1.0a6 91.6 89.6 97.2 83.3 𐄂 10 MB
de_core_news_md German 2.1.0a6 92.2 90.3 97.5 83.9 βœ“ 210 MB
es_core_news_sm Spanish 2.1.0a6 90.3 87.3 97.0 89.0 𐄂 10 MB
es_core_news_md Spanish 2.1.0a6 90.9 88.1 97.2 89.3 βœ“ 69 MB
pt_core_news_sm Portuguese 2.1.0a6 89.4 86.0 80.4 89.1 𐄂 12 MB
fr_core_news_sm French 2.1.0a6 87.7 84.8 94.5 82.9 𐄂 14 MB
fr_core_news_md French 2.1.0a6 89.1 86.5 95.1 83.4 βœ“ 82 MB
it_core_news_sm Italian 2.1.0a6 90.9 87.2 95.9 86.4 𐄂 10 MB
nl_core_news_sm Dutch 2.1.0a6 83.7 77.6 91.5 87.1 𐄂 10 MB
el_core_news_sm Greek 2.1.0a6 85.0 81.5 94.8 73.1 𐄂 10 MB
el_core_news_md Greek 2.1.0a6 88.4 85.2 96.6 81.0 βœ“ 126 MB
xx_ent_wiki_sm Multi 2.1.0a6 - - - 81.6 𐄂 3 MB

πŸ’¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

πŸ“– Documentation and examples

  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin and @moreymat for the pull requests and contributions.

Jan 21, 2019
Set version to 2.1.0a6.dev1
Jan 5, 2019
Set version to v2.1.0a6.dev0
Assets 2

πŸŒ™ This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).

πŸ”΄ Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated.
- sentence_splitter = nlp.create_pipe('sbd')
+ sentence_splitter = nlp.create_pipe('sentencizer')
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box.
- $ spacy train en /output train_data.json dev_data.json --no-parser
+ $ spacy train en /output train_data.json dev_data.json --pipeline tagger,ner
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

πŸ“ˆ Benchmarks

Model Language Version UAS LAS POS NER F Vec Size
en_core_web_sm English 2.1.0a5 91.2 89.3 96.9 85.6 𐄂 10 MB
en_core_web_md English 2.1.0a5 91.4 89.5 96.9 85.9 βœ“ 90 MB
en_core_web_lg English 2.1.0a5 91.5 89.7 97.0 86.3 βœ“ 788 MB
de_core_news_sm German 2.1.0a5 91.3 89.0 97.1 82.2 𐄂 10 MB
de_core_news_md German 2.1.0a5 92.0 90.0 97.4 82.7 βœ“ 210 MB
es_core_news_sm Spanish 2.1.0a5 89.9 86.7 96.6 87.3 𐄂 10 MB
es_core_news_md Spanish 2.1.0a5 90.6 87.7 97.0 88.0 βœ“ 69 MB
pt_core_news_sm Portuguese 2.1.0a5 89.3 86.0 78.5 87.8 𐄂 12 MB
fr_core_news_sm French 2.1.0a5 87.3 84.4 94.4 81.0 𐄂 14 MB
fr_core_news_md French 2.1.0a5 88.8 86.1 94.9 82.2 βœ“ 82 MB
it_core_news_sm Italian 2.1.0a5 90.8 87.0 95.7 84.8 𐄂 10 MB
nl_core_news_sm Dutch 2.1.0a5 83.7 77.4 90.9 85.4 𐄂 10 MB
el_core_news_sm Greek 2.1.0a5 85.5 81.8 94.7 75.9 𐄂 10 MB
el_core_news_md Greek 2.1.0a5 88.5 85.2 96.8 80.01 βœ“ 126 MB
xx_ent_wiki_sm Multi 2.1.0a5 - - - 82.8 𐄂 3 MB

πŸ’¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

πŸ“– Documentation and examples

  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.

Dec 20, 2018
Set version to v2.1.0a5
Dec 20, 2018
Merge branch 'develop' of https://github.com/explosion/spaCy into dev…
…elop
Assets 2

πŸŒ™ This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).

πŸ”΄ Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated.
- sentence_splitter = nlp.create_pipe('sbd')
+ sentence_splitter = nlp.create_pipe('sentencizer')
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box.
- $ spacy train en /output train_data.json dev_data.json --no-parser
+ $ spacy train en /output train_data.json dev_data.json --pipeline tagger,ner
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

πŸ“ˆ Benchmarks

Model Language Version UAS LAS POS NER F Vec Size
en_core_web_sm English 2.1.0a5 91.2 89.3 96.9 85.6 𐄂 10 MB
en_core_web_md English 2.1.0a5 91.4 89.5 96.9 85.9 βœ“ 90 MB
en_core_web_lg English 2.1.0a5 91.5 89.7 97.0 86.3 βœ“ 788 MB
de_core_news_sm German 2.1.0a5 91.3 89.0 97.1 82.2 𐄂 10 MB
de_core_news_md German 2.1.0a5 92.0 90.0 97.4 82.7 βœ“ 210 MB
es_core_news_sm Spanish 2.1.0a5 89.9 86.7 96.6 87.3 𐄂 10 MB
es_core_news_md Spanish 2.1.0a5 90.6 87.7 97.0 88.0 βœ“ 69 MB
pt_core_news_sm Portuguese 2.1.0a5 89.3 86.0 78.5 87.8 𐄂 12 MB
fr_core_news_sm French 2.1.0a5 87.3 84.4 94.4 81.0 𐄂 14 MB
fr_core_news_md French 2.1.0a5 88.8 86.1 94.9 82.2 βœ“ 82 MB
it_core_news_sm Italian 2.1.0a5 90.8 87.0 95.7 84.8 𐄂 10 MB
nl_core_news_sm Dutch 2.1.0a5 83.7 77.4 90.9 85.4 𐄂 10 MB
el_core_news_sm Greek 2.1.0a5 85.5 81.8 94.7 75.9 𐄂 10 MB
el_core_news_md Greek 2.1.0a5 88.5 85.2 96.8 80.01 βœ“ 126 MB
xx_ent_wiki_sm Multi 2.1.0a5 - - - 82.8 𐄂 3 MB

πŸ’¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

πŸ“– Documentation and examples

  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.

Dec 1, 2018
Merge branch 'develop' of https://github.com/explosion/spaCy into dev…
…elop
Assets 2

✨ New features and improvements

  • NEW: Alpha tokenization support for Catalan.
  • Improve French tokenization.
  • Fix regex pin to harmonise dependencies with conda.
  • Fix msgpack pin.
  • Update tests for pytest 4.0.

πŸ”΄ Bug fixes

  • Fix issue #2933: Correct mistake in is_ascii documentation.
  • Fix issue #2976: Fix bug where Vocab.prune_vectors did not use batch_size.
  • Fix issue #2986: Correctly document when Span.ents was added.
  • Fix issue #2995, #2996: Fix msgpack pin.

πŸ“– Documentation and examples

  • Fix various typos and inconsistencies.

πŸ‘₯ Contributors

Thanks to @mpuig, @ALSchwalm, @bpben, @svlandeg and @wxv for the pull requests and contributions.

Dec 1, 2018
Try again to fix OSX build