@explosion-bot explosion-bot released this Aug 15, 2018 · 4 commits to master since this release

Assets 3

File checksum: 8128b7ec4b7388397e04df2939ae492aecc1ad9e48be76376f9e0cee5612e090

Greek pipeline with word vectors, POS tags, dependencies and named entities. Word vectors use Facebook's FastText Common Crawl vectors, pruned to a vocabulary of 20,000 items. Words outside the most frequent were mapped to the nearest neighbouring vector within the 20,000 rows retained. Syntax (dependencies and POS tags) trained from the Universal Dependencies conversion of the Greek Dependency Treebank (v2.2). Named entity annotations were created by Giannis Daras using Prodigy, using the OntoNotes 5 annotation schema.

Feature Description
Name el_core_news_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 27 MB
Pipeline  tagger, parser, ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Greek Dependency Treebank, Daras GSOC 2018
License CC BY-NC 4.0
Author Giannis Daras

Accuracy

Type Score
ENTS_F  73.53
ENTS_P  73.53
ENTS_R  73.53
LAS  81.03
TAGS_ACC  94.97
TOKEN_ACC  100.00
UAS  84.45

Installation

pip install spacy-nightly
spacy download el_core_news_sm

@explosion-bot explosion-bot released this Aug 15, 2018 · 4 commits to master since this release

Assets 3

File checksum: 937a52b6eed1847acc11662a33ed0a817a1376bf4c590c261fb1e07edc8cdc79

Greek pipeline with word vectors, POS tags, dependencies and named entities. Word vectors use Facebook's FastText Common Crawl vectors, pruned to a vocabulary of 20,000 items. Words outside the most frequent were mapped to the nearest neighbouring vector within the 20,000 rows retained. Syntax (dependencies and POS tags) trained from the Universal Dependencies conversion of the Greek Dependency Treebank (v2.2). Named entity annotations were created by Giannis Daras using Prodigy, using the OntoNotes 5 annotation schema.

Feature Description
Name el_core_news_md
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 143 MB
Pipeline  tagger, parser, ner
Vectors 1999938 keys, 20000 unique vectors (300 dimensions)
Sources Common Crawl, Greek Dependency Treebank, Daras GSOC 2018
License CC BY-NC 4.0
Author Giannis Daras

Accuracy

Type Score
ENTS_F  80.17
ENTS_P  78.86
ENTS_R  81.51
LAS  84.70
TAGS_ACC  96.28
TOKEN_ACC  100.00
UAS  87.66

Installation

pip install spacy-nightly
spacy download el_core_news_md

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/xx#xx_ent_wiki_sm

File checksum: 4c6b990912d42e25d06a174ee5426feebe9084fe890128bf1902f20c5797eb95

Multi-lingual CNN trained on Nothman et al. (2010) Wikipedia corpus. Assigns named entities. Supports identification of PER, LOC, ORG and MISC entities for English, German, Spanish, French, Italian, Portuguese and Russian.

Feature Description
Name xx_ent_wiki_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 8 MB
Pipeline  ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Wikipedia
License MIT
Author Explosion AI

Accuracy

Type Score
ENTS_F  83.79
ENTS_P  84.06
ENTS_R  83.53

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text.

Installation

pip install spacy-nightly
spacy download xx_ent_wiki_sm

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/pt#pt_core_news_sm

File checksum: 63f282f6b21e8d3fc0cef02fe4bd4849b9b8d04d5c563dc8e8f834c3af28574b

Portuguese multi-task CNN trained on the Universal Dependencies and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name pt_core_news_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 29 MB
Pipeline  tagger, parser, ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Universal Dependencies, Wikipedia
License CC BY-SA 4.0
Author Explosion AI

Accuracy

Type Score
ENTS_F  82.72
ENTS_P  82.74
ENTS_R  82.69
LAS  86.26
TAGS_ACC  80.06
TOKEN_ACC  100.00
UAS  89.37

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download pt_core_news_sm

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/nl#nl_core_news_sm

File checksum: c9e646547381e9b7c3f1e91b041f0e7f1d488d08ab058c6c51bf861cec38941e

Dutch multi-task CNN trained on the Universal Dependencies and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name nl_core_news_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 27 MB
Pipeline  tagger, parser, ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Universal Dependencies, Wikipedia
License CC BY-SA 4.0
Author Explosion AI

Accuracy

Type Score
ENTS_F  87.28
ENTS_P  86.87
ENTS_R  87.69
LAS  77.57
TAGS_ACC  91.53
TOKEN_ACC  100.00
UAS  83.50

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download nl_core_news_sm

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/it#it_core_news_sm

File checksum: 6cdaace95334a98d579fdaf0d1d885df9743430c0628d9a344d9fd2559dcce9a

Italian multi-task CNN trained on the Universal Dependencies and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name it_core_news_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 27 MB
Pipeline  tagger, parser, ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Universal Dependencies, Wikipedia
License CC BY-NC-SA 3.0
Author Explosion AI

Accuracy

Type Score
ENTS_F  81.25
ENTS_P  81.51
ENTS_R  81.00
LAS  87.09
TAGS_ACC  96.08
TOKEN_ACC  100.00
UAS  90.73

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download it_core_news_sm

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/fr#fr_core_news_sm

File checksum: c8c522b0c4fec7e58da8bf1f9344025304102e2e879e59b7c53d18fc15c292c4

French multi-task CNN trained on the French Sequoia (Universal Dependencies) and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name fr_core_news_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 31 MB
Pipeline  tagger, parser, ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Sequoia Corpus (UD), Wikipedia
License LGPL
Author Explosion AI

Accuracy

Type Score
ENTS_F  67.32
ENTS_P  67.81
ENTS_R  66.83
LAS  85.65
TAGS_ACC  94.42
TOKEN_ACC  100.00
UAS  88.79

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download fr_core_news_sm

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/fr#fr_core_news_md

File checksum: 6d120daf4f024dc1d8c0523a6e10191f1163f9fb6dfbdac646c15d959dbe6d08

French multi-task CNN trained on the French Sequoia (Universal Dependencies) and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name fr_core_news_md
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 99 MB
Pipeline  tagger, parser, ner
Vectors 579447 keys, 20000 unique vectors (300 dimensions)
Sources Sequoia Corpus (UD), Wikipedia
License LGPL
Author Explosion AI

Accuracy

Type Score
ENTS_F  70.42
ENTS_P  70.99
ENTS_R  69.85
LAS  86.00
TAGS_ACC  94.96
TOKEN_ACC  100.00
UAS  88.65

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download fr_core_news_md

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/es#es_core_news_sm

File checksum: 9198c75af6a9415075fdfd67bc9382a0efde2d12cc2cfdc34c48e4f414e12605

Spanish multi-task CNN trained on the AnCora and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name es_core_news_sm
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 27 MB
Pipeline  tagger, parser, ner
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources AnCora, Wikipedia
License GPL
Author Explosion AI

Accuracy

Type Score
ENTS_F  89.43
ENTS_P  89.51
ENTS_R  89.35
LAS  87.17
TAGS_ACC  96.91
TOKEN_ACC  100.00
UAS  90.13

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download es_core_news_sm

@explosion-bot explosion-bot released this Jul 10, 2018 · 39 commits to master since this release

Assets 3

Details: https://spacy.io/models/es#es_core_news_md

File checksum: 63e48fe8edd1ddecfe1febc1412ef32fe93ced4ad7d32b4c27d1c30ec89b9e35

Spanish multi-task CNN trained on the AnCora and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.

Feature Description
Name es_core_news_md
Version 2.1.0a0
spaCy >=2.1.0a0
Model size 87 MB
Pipeline  tagger, parser, ner
Vectors 533736 keys, 20000 unique vectors (50 dimensions)
Sources AnCora, Wikipedia
License GPL
Author Explosion AI

Accuracy

Type Score
ENTS_F  89.52
ENTS_P  89.58
ENTS_R  89.46
LAS  87.96
TAGS_ACC  97.18
TOKEN_ACC  100.00
UAS  90.73

Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Installation

pip install spacy-nightly
spacy download es_core_news_md