Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite for spaCy v3 #173

Closed
wants to merge 279 commits into from
Closed

Rewrite for spaCy v3 #173

wants to merge 279 commits into from

Conversation

honnibal
Copy link
Member

@honnibal honnibal commented Apr 30, 2020

The current spacy-transformers had to go to quite some effort to work around limitations in spaCy v2 and Thinc v7. It also had to take on a lot of tasks that the Transformers library now handles itself. Limitations in Thinc (and the fact that the library was undocumented!) were particularly painful, because transformers really aren't too useful unless you can get in and fiddle with the model architectures.

With the forthcoming spaCy v3, Thinc v8, and Huggingface's constant awesome improvements, things are now much nicer, so we can make this library much much smaller.

I also want to make a slightly different trade-off in the library. Previously we tried to do a lot and offer a lot in the extension attributes. This made it hard to keep up with all of the different transformer models as they're released. It also sometimes meant that the wrapper could get in the way of the underlying transformers models.

The new trade-off is to simply do less, at least in terms of the alignments and extension attributes. We now offer just one extension attribute, doc._.trf_data, which provides a spacy_transformers.types.TransformerData object, a simple dataclass that holds the tensor outputs for the doc, the tokens data, and alignment information.

If you want more extension attributes, it's easy to design and set them yourself, by providing a custom annotation_setter function. Your function will receive a batch of documents and a FullTransformerBatch object that holds the input and output objects passed from the transformers library -- so you know you'll be able to implement whatever you need.

The previous version also went to some effort to rebatch data by sentence, to allow prediction on long documents. I still believe in this idea, but hard-coding for it could easily get in the way. Instead, the transformers now let you provide a function to map a batch of documents into a batch of Span objects. You can even have spans that overlap, or which only cover subsets of the Doc objects. The doc._.trf_data object will tell you which spans the transformers data refers to, making it easy to use the output.

The workflow for training models with spacy-transformers is also dramatically better, using the improvements from spaCy v3 and Thinc v8. The main workflow is to write a config file, using Thinc's new config system.

You can find two early example config files here:

You run the config files with the examples/train_from_config.py script (in future you'll actually use spacy train-from-config).

I'm not that satisfied with the names of everything yet, but all the pieces are in place and it works (although I still need to tune the hyper-parameters to get better models).

The Transformer pipeline component lets you run the transformer once to set the doc._.trf_data extension, and also have downstream components use the transformer features and pass gradients back to the transformer, allowing easy multitask learning. I'm hoping we can have a pipeline where we run one transformer model shared between a whole pipeline, including tagging, parsing, NER, morphology, coref and SRL.

TODO

  • Fix entry-points
  • Finalize naming
  • Implement serialization for the Huggingface tokenizer object
  • Make sure the transformer models serialize and deserialize correctly
  • Improve tests
  • Remove previous examples
  • Documentation
  • Find good hyper-parameters for pipeline

@honnibal
Copy link
Member Author

This has diverged too much from the version on master, as this targets spaCy v3. It doesn't really make sense to do a merge here. Instead I've labelled this "develop".

When we're ready to release we'll name the stuff on master something like spacy-v2.x or something and then just switch master over to the state of develop.

@honnibal honnibal closed this Jun 28, 2020
@svlandeg svlandeg deleted the feature/spacy-v3 branch August 25, 2021 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants