Fixing Transformer IO #281

svlandeg · 2021-07-19T17:00:06Z

Note: some tests are adjusted in this PR. This was done first, before implementing the code changes, as we were aware that the initialize statements shouldn't be there, cf explosion/spaCy#8319 and explosion/spaCy#8566.

Description

Before this PR, the HF Transformer model was loaded through set_pytorch_transformer (stored in model.attrs["set_transformer"]), but this happened in the initialize call of TransformerModel. Unfortunately, this meant that saving/loading a transformer-based pipeline was kind of broken, as you needed to call initialize on a previously trained pipeline, which isn't the normal spaCy API. This also broke the from_bytes / to_bytes typical API.

Furthermore, the from_disk / to_disk functionality worked with a "listener" transformer, because the transformer pipeline component saved out the PyTorch files directly. However, this solution did not work for the "inline" Transformer, because the TransformerModel would be used directly and not via the pipeline component.

I've been looking at various different solutions, but the proposal in this PR is the only one that I got working for all use-cases: basically we need to load/define the transformer model in the constructor as we do for any other spaCy component.

Unfortunately with this proposal, if you'd have the initialize calls as before in your old code, it will crash, complaining about the fact that the transformer model can't be set twice. I think this is actually the correct behaviour in the end, but it might break some people's code. The fix is obvious/easy though.

But we'll have to discuss which version bump we want to do when releasing this.

Fixes explosion/spaCy#8319
Fixes explosion/spaCy#8566

svlandeg · 2021-07-19T19:19:35Z

Ugh, I only just realised that this will break loading the pretrained models too :'(
So if we want this solution, it'd require a major version bump and training new trf models.

honnibal · 2021-07-20T00:46:13Z

This feels quite wrong. How come we don't have this problem with other pytorch-wrapped models in thinc? Can we do whatever we're doing there for the transformers? Maybe we need a new shim class just for transformers?

svlandeg · 2021-07-22T07:30:26Z

The main difference to other models is that for the transformer models here, we're adding layers / defining the model only in the initialize call. I don't really see how that can work with the to_bytes/from_bytes API that we have normally, which assumes the internal layers are correctly defined before calling from_bytes. This why the tests needed initialize calls inbetween (cf test_transformer_model_tobytes).

I guess it might be possible to work with a Transformers shim class, but won't the final solution look the same - i.e. make sure the network is defined on construction, rather that initialization?

svlandeg · 2021-08-09T11:12:18Z

Closing in favour of #285

svlandeg added 3 commits July 19, 2021 17:30

remove initialize calls from IO tests

39f0efc

try using empty shim

7ab8d36

set transformer model in constructor

0fcda0d

svlandeg added bug Something isn't working feat / serialize Feature: Serialization, saving and loading labels Jul 19, 2021

svlandeg requested a review from honnibal July 19, 2021 17:00

cleanup

ebe9eb1

svlandeg marked this pull request as draft July 22, 2021 05:46

svlandeg mentioned this pull request Aug 4, 2021

Fixing Transformer IO (attempt 2) #285

Merged

svlandeg closed this Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing Transformer IO #281

Fixing Transformer IO #281

svlandeg commented Jul 19, 2021 •

edited

svlandeg commented Jul 19, 2021

honnibal commented Jul 20, 2021

svlandeg commented Jul 22, 2021 •

edited

svlandeg commented Aug 9, 2021

Fixing Transformer IO #281

Fixing Transformer IO #281

Conversation

svlandeg commented Jul 19, 2021 • edited

Description

svlandeg commented Jul 19, 2021

honnibal commented Jul 20, 2021

svlandeg commented Jul 22, 2021 • edited

svlandeg commented Aug 9, 2021

svlandeg commented Jul 19, 2021 •

edited

svlandeg commented Jul 22, 2021 •

edited