Skip to content

Commit

Permalink
Merge pull request #425 from allenai/spacy_32_upgrade
Browse files Browse the repository at this point in the history
Spacy 32 upgrade
  • Loading branch information
dakinggg committed Mar 10, 2022
2 parents cc0ace9 + 8ff659f commit cc1a717
Show file tree
Hide file tree
Showing 17 changed files with 367 additions and 148 deletions.
2 changes: 1 addition & 1 deletion Dockerfile
Expand Up @@ -18,7 +18,7 @@ WORKDIR /work
COPY requirements.in .

RUN pip install -r requirements.in
RUN pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
RUN pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz
RUN python -m spacy download en_core_web_sm
RUN python -m spacy download en_core_web_md

Expand Down
18 changes: 9 additions & 9 deletions README.md
Expand Up @@ -19,7 +19,7 @@ pip install scispacy
to install a model (see our full selection of available models below), run a command like the following:

```bash
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz
```

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy.
Expand Down Expand Up @@ -76,14 +76,14 @@ pip install CMD-V(to paste the copied URL)

| Model | Description | Install URL
|:---------------|:------------------|:----------|
| en_core_sci_sm | A full spaCy pipeline for biomedical data with a ~100k vocabulary. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz)|
| en_core_sci_md | A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz)|
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz)|
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to [use a GPU](https://spacy.io/usage#gpu) with this model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_scibert-0.4.0.tar.gz)|
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_craft_md-0.4.0.tar.gz)|
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_jnlpba_md-0.4.0.tar.gz)|
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz)|
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz)|
| en_core_sci_sm | A full spaCy pipeline for biomedical data with a ~100k vocabulary. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz)|
| en_core_sci_md | A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz)|
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz)|
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to [use a GPU](https://spacy.io/usage#gpu) with this model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz)|
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz)|
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_jnlpba_md-0.5.0.tar.gz)|
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz)|
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bionlp13cg_md-0.5.0.tar.gz)|


## Additional Pipeline Components
Expand Down
23 changes: 13 additions & 10 deletions configs/base_ner.cfg
@@ -1,3 +1,6 @@
[vars]
include_static_vectors = null

[paths]
vectors = null
init_tok2vec = null
Expand Down Expand Up @@ -31,26 +34,26 @@ moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 2500, 2500, 2500]
include_static_vectors = true
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
rows = [5000, 2500, 2500, 2500, 100]
include_static_vectors = ${vars.include_static_vectors}

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
Expand Down Expand Up @@ -82,7 +85,7 @@ dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 7
Expand Down
16 changes: 8 additions & 8 deletions configs/base_ner_scibert.cfg
Expand Up @@ -5,7 +5,7 @@ parser_tagger_path = null
vocab_path = null

[system]
gpu_allocator = null
gpu_allocator = "pytorch"
seed = 0

[nlp]
Expand All @@ -31,7 +31,7 @@ moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
Expand All @@ -40,17 +40,17 @@ use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 2500, 2500, 2500]
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
rows = [5000, 2500, 2500, 2500, 100]
include_static_vectors = false

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
Expand Down Expand Up @@ -83,7 +83,7 @@ dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 7
Expand Down
19 changes: 11 additions & 8 deletions configs/base_parser_tagger.cfg
@@ -1,3 +1,6 @@
[vars]
include_static_vectors = null

[paths]
genia_train = "project_data/genia_train.spacy"
genia_dev = "project_data/genia_dev.spacy"
Expand Down Expand Up @@ -35,7 +38,7 @@ moves = null
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v1"
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
Expand Down Expand Up @@ -64,17 +67,17 @@ upstream = "*"
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 2500, 2500, 2500]
include_static_vectors = true
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
rows = [5000, 2500, 2500, 2500, 100]
include_static_vectors = ${vars.include_static_vectors}

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
Expand Down Expand Up @@ -106,7 +109,7 @@ dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 20
Expand Down
23 changes: 8 additions & 15 deletions configs/base_parser_tagger_scibert.cfg
Expand Up @@ -7,7 +7,7 @@ init_tok2vec = null
vocab_path = null

[system]
gpu_allocator = "pytorch"
gpu_allocator = null
seed = 0

[nlp]
Expand Down Expand Up @@ -36,12 +36,12 @@ moves = null
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v1"
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
use_upper = false
nO = null

[components.parser.model.tok2vec]
Expand Down Expand Up @@ -69,9 +69,10 @@ max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
@architectures = "spacy-transformers.TransformerModel.v3"
name = "allenai/scibert_scivocab_uncased"
tokenizer_config = {"use_fast": true}
mixed_precision = true

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
Expand Down Expand Up @@ -105,7 +106,7 @@ dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 8
Expand All @@ -120,8 +121,8 @@ get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 16
stop = 64
start = 4
stop = 12
compound = 1.001
t = 0.0

Expand Down Expand Up @@ -157,14 +158,6 @@ ents_r = 0.0
[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = ${paths.vocab_path}
lookups = null

[initialize.components]

[initialize.tokenizer]

[initialize.before_init]
@callbacks = "replace_tokenizer"
23 changes: 13 additions & 10 deletions configs/base_specialized_ner.cfg
@@ -1,3 +1,6 @@
[vars]
include_static_vectors = null

[paths]
vectors = null
init_tok2vec = null
Expand Down Expand Up @@ -33,26 +36,26 @@ moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 2500, 2500, 2500]
include_static_vectors = true
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
rows = [5000, 2500, 2500, 2500, 100]
include_static_vectors = ${vars.include_static_vectors}

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
Expand Down Expand Up @@ -82,7 +85,7 @@ dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 7
Expand Down
33 changes: 16 additions & 17 deletions docs/index.md
Expand Up @@ -17,15 +17,14 @@ pip install <Model URL>

| Model | Description | Install URL
|:---------------|:------------------|:----------|
| en_core_sci_sm | A full spaCy pipeline for biomedical data. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz)|
| en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz)|
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_scibert-0.4.0.tar.gz)|
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz)|
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_craft_md-0.4.0.tar.gz)|
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_jnlpba_md-0.4.0.tar.gz)|
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz)|
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz)|

| en_core_sci_sm | A full spaCy pipeline for biomedical data. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz)|
| en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz)|
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz)|
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz)|
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz)|
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_jnlpba_md-0.5.0.tar.gz)|
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz)|
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bionlp13cg_md-0.5.0.tar.gz)|



Expand All @@ -35,18 +34,18 @@ Our models achieve performance within 3% of published state of the art dependenc

| model | UAS | LAS | POS | Mentions (F1) | Web UAS |
|:---------------|:----|:------|:------|:---|:---|
| en_core_sci_sm | 89.54| 87.62 | 98.32 | 68.15 | 87.62 |
| en_core_sci_md | 89.61| 87.77 | 98.56 | 69.64 | 88.05 |
| en_core_sci_lg | 89.63| 87.81 | 98.56 | 69.61 | 88.08 |
| en_core_sci_scibert | 92.03| 90.25 | 98.91 | 67.91 | 92.21 |
| en_core_sci_sm | 89.27| 87.33 | 98.29 | 68.05 | 87.61 |
| en_core_sci_md | 89.86| 87.92 | 98.43 | 69.32 | 88.05 |
| en_core_sci_lg | 89.54| 87.66 | 98.29 | 69.52 | 87.68 |
| en_core_sci_scibert | 92.28| 90.83 | 98.93 | 67.84 | 92.63 |


| model | F1 | Entity Types|
|:---------------|:-----|:--------|
| en_ner_craft_md | 76.11|GGP, SO, TAXON, CHEBI, GO, CL|
| en_ner_jnlpba_md | 71.62| DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
| en_ner_bc5cdr_md | 84.49| DISEASE, CHEMICAL|
| en_ner_bionlp13cg_md | 77.75| AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
| en_ner_craft_md | 78.35|GGP, SO, TAXON, CHEBI, GO, CL|
| en_ner_jnlpba_md | 70.89| DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
| en_ner_bc5cdr_md | 84.70| DISEASE, CHEMICAL|
| en_ner_bionlp13cg_md | 76.79| AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |


### Example Usage
Expand Down

0 comments on commit cc1a717

Please sign in to comment.