Rewrite for spaCy v3 #173

honnibal · 2020-04-30T14:30:35Z

The current spacy-transformers had to go to quite some effort to work around limitations in spaCy v2 and Thinc v7. It also had to take on a lot of tasks that the Transformers library now handles itself. Limitations in Thinc (and the fact that the library was undocumented!) were particularly painful, because transformers really aren't too useful unless you can get in and fiddle with the model architectures.

With the forthcoming spaCy v3, Thinc v8, and Huggingface's constant awesome improvements, things are now much nicer, so we can make this library much much smaller.

I also want to make a slightly different trade-off in the library. Previously we tried to do a lot and offer a lot in the extension attributes. This made it hard to keep up with all of the different transformer models as they're released. It also sometimes meant that the wrapper could get in the way of the underlying transformers models.

The new trade-off is to simply do less, at least in terms of the alignments and extension attributes. We now offer just one extension attribute, doc._.trf_data, which provides a spacy_transformers.types.TransformerData object, a simple dataclass that holds the tensor outputs for the doc, the tokens data, and alignment information.

If you want more extension attributes, it's easy to design and set them yourself, by providing a custom annotation_setter function. Your function will receive a batch of documents and a FullTransformerBatch object that holds the input and output objects passed from the transformers library -- so you know you'll be able to implement whatever you need.

The previous version also went to some effort to rebatch data by sentence, to allow prediction on long documents. I still believe in this idea, but hard-coding for it could easily get in the way. Instead, the transformers now let you provide a function to map a batch of documents into a batch of Span objects. You can even have spans that overlap, or which only cover subsets of the Doc objects. The doc._.trf_data object will tell you which spans the transformers data refers to, making it easy to use the output.

The workflow for training models with spacy-transformers is also dramatically better, using the improvements from spaCy v3 and Thinc v8. The main workflow is to write a config file, using Thinc's new config system.

You can find two early example config files here:

Using Distilbert as a subnetwork in the parser: https://github.com/explosion/spacy-transformers/blob/feature/spacy-v3/examples/own/dep-distilbert.cfg
Joint tagging and parsing, with the Transformer pipeline component: https://github.com/explosion/spacy-transformers/blob/feature/spacy-v3/examples/listen/joint-dep-pos-distilbert.cfg

You run the config files with the examples/train_from_config.py script (in future you'll actually use spacy train-from-config).

I'm not that satisfied with the names of everything yet, but all the pieces are in place and it works (although I still need to tune the hyper-parameters to get better models).

The Transformer pipeline component lets you run the transformer once to set the doc._.trf_data extension, and also have downstream components use the transformer features and pass gradients back to the transformer, allowing easy multitask learning. I'm hoping we can have a pipeline where we run one transformer model shared between a whole pipeline, including tagging, parsing, NER, morphology, coref and SRL.

TODO

Fix entry-points
Finalize naming
Implement serialization for the Huggingface tokenizer object
Make sure the transformer models serialize and deserialize correctly
Improve tests
Remove previous examples
Documentation
Find good hyper-parameters for pipeline

Various small fixes

fix align indices in split_by_doc

honnibal · 2020-06-28T21:50:51Z

This has diverged too much from the version on master, as this targets spaCy v3. It doesn't really make sense to do a merge here. Instead I've labelled this "develop".

When we're ready to release we'll name the stuff on master something like spacy-v2.x or something and then just switch master over to the state of develop.

honnibal added 30 commits April 24, 2020 23:41

Fix flake errors

1b41182

Clean up annotation_setters module

901dc7f

Update

2836a80

Format

1561dfd

Fix flake errors

1045511

Add extension implementations

0fca4fe

Tidy and fix

69d2a9b

Format

e22ce04

Fix model wrapper

fa9fec5

Fix component

4da3daf

Add imports

f827ed7

Fix bug

a9e62b2

Work on dummy tokenizer

05e4b8a

Add test util

5f6fdf6

Move dummy test thing to tests module

35729d2

Fix dummy model

54b0ab6

Fix model wrapper

4816a20

Add test for pipeline component

9682176

Fix test

718973c

Fix alignment

3283791

Test set_annotations

cb41214

Fix TransformerOutput.empty method

0199a57

Remove print statement

7c59d91

Try to set nO dimension

1e4654e

Fix listener

7a38302

Add test for pipeline component

3ee96f8

Move add_extensions to util.py

1091604

Add layers

f623c90

Rethink extensions

ac0a58a

Reorg

437cdd6

honnibal and others added 27 commits May 22, 2020 23:08

Fix import

99ae851

Increment version

4bfe673

Specify use_pytorch_for_gpu_memory

69c7289

Align via offset mapping if possible

ae4269d

Align via offset mapping if possible

aed918d

Update config

f8ad5ad

Update config

0a723e9

Merge master into v3 branch (#185)

f2e3e2d

Merge branch 'master' into feature/spacy-v3

64796f1

Delete language.py

f57f246

fix align indices in split_by_doc

1af2a00

remove duplicate pytokenizations requirement

b03e542

width argument in Listener was removed

e9fecdb

small edits in the readme documentation

72128f0

typing fixes

64ab863

remove unused imports

eefcd97

update Azure pipelines

e80dd87

Update Azure pipelines

0f28289

update Azure pipelines

cf73466

Fix requirement

6a2ac81

Fix pipeline

c83c831

Fix CI

2f7d3ca

Fix CI

6e0a2f8

Fix test

9a38d52

Format

7e6cd50

Merge pull request #188 from svlandeg/fixes/varia

7e91931

Various small fixes

Merge pull request #186 from svlandeg/bugfix/split-by-doc

2c61bfe

fix align indices in split_by_doc

honnibal closed this Jun 28, 2020

svlandeg deleted the feature/spacy-v3 branch August 25, 2021 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite for spaCy v3 #173

Rewrite for spaCy v3 #173

honnibal commented Apr 30, 2020 •

edited

Loading

honnibal commented Jun 28, 2020

Rewrite for spaCy v3 #173

Rewrite for spaCy v3 #173

Conversation

honnibal commented Apr 30, 2020 • edited Loading

TODO

honnibal commented Jun 28, 2020

honnibal commented Apr 30, 2020 •

edited

Loading