Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Transformer IO #281

Closed
wants to merge 4 commits into from
Closed

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Jul 19, 2021

Note: some tests are adjusted in this PR. This was done first, before implementing the code changes, as we were aware that the initialize statements shouldn't be there, cf explosion/spaCy#8319 and explosion/spaCy#8566.

Description

Before this PR, the HF Transformer model was loaded through set_pytorch_transformer (stored in model.attrs["set_transformer"]), but this happened in the initialize call of TransformerModel. Unfortunately, this meant that saving/loading a transformer-based pipeline was kind of broken, as you needed to call initialize on a previously trained pipeline, which isn't the normal spaCy API. This also broke the from_bytes / to_bytes typical API.

Furthermore, the from_disk / to_disk functionality worked with a "listener" transformer, because the transformer pipeline component saved out the PyTorch files directly. However, this solution did not work for the "inline" Transformer, because the TransformerModel would be used directly and not via the pipeline component.

I've been looking at various different solutions, but the proposal in this PR is the only one that I got working for all use-cases: basically we need to load/define the transformer model in the constructor as we do for any other spaCy component.

Unfortunately with this proposal, if you'd have the initialize calls as before in your old code, it will crash, complaining about the fact that the transformer model can't be set twice. I think this is actually the correct behaviour in the end, but it might break some people's code. The fix is obvious/easy though.

But we'll have to discuss which version bump we want to do when releasing this.

Fixes explosion/spaCy#8319
Fixes explosion/spaCy#8566

@svlandeg svlandeg added bug Something isn't working feat / serialize Feature: Serialization, saving and loading labels Jul 19, 2021
@svlandeg svlandeg requested a review from honnibal July 19, 2021 17:00
@svlandeg
Copy link
Member Author

Ugh, I only just realised that this will break loading the pretrained models too :'(
So if we want this solution, it'd require a major version bump and training new trf models.

@honnibal
Copy link
Member

This feels quite wrong. How come we don't have this problem with other pytorch-wrapped models in thinc? Can we do whatever we're doing there for the transformers? Maybe we need a new shim class just for transformers?

@svlandeg svlandeg marked this pull request as draft July 22, 2021 05:46
@svlandeg
Copy link
Member Author

svlandeg commented Jul 22, 2021

The main difference to other models is that for the transformer models here, we're adding layers / defining the model only in the initialize call. I don't really see how that can work with the to_bytes/from_bytes API that we have normally, which assumes the internal layers are correctly defined before calling from_bytes. This why the tests needed initialize calls inbetween (cf test_transformer_model_tobytes).

I guess it might be possible to work with a Transformers shim class, but won't the final solution look the same - i.e. make sure the network is defined on construction, rather that initialization?

@svlandeg
Copy link
Member Author

svlandeg commented Aug 9, 2021

Closing in favour of #285

@svlandeg svlandeg closed this Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feat / serialize Feature: Serialization, saving and loading
Projects
None yet
2 participants