IO for transformer component #178

svlandeg · 2020-05-14T18:48:44Z

IO

Been going in circles a bit with this, trying to puzzle it into the IO mechanisms we decided on for the config refactor for spaCy 3 ...

This PR: Transformer(Pipe) knows how to do to_disk and from_disk and stores the internal tokenizer & transformer object using huggingface transformer standard IO mechanisms. In the nlp/transformer output directory, this results in a folder model with files

config.json
pytorch_model.bin
special_tokens_map.json
tokenizer_config.json
vocab.txt

This folder can be read using the spacy.TransformerFromFile.v1 architecture for the model, and then calling from_disk on the pipeline component (which happens automatically when reading the nlp object from a config)

If users want to download a model by using architecture spacy.TransformerByName.v2, then when calling nlp.to_disk, we need to do a little hack rewriting that architecture to the one from file. This is done by directly modifying nlp.config when the component is created with from_nlp. This feels hacky, but not sure how else to prevent multiple downloads.

Other fixes

fixed the config files to be up-to-date with the latest version of the v3 branch
moved install_extensions to the init of the transformer pipe, where I think it makes more sense. Added force=True to prevent warnings/errors when calling it multiple times (I don't think that matters?)

…/v3-io

honnibal · 2020-05-14T19:33:48Z

Nice work! I agree that the problem is fundamentally kind of awkward. I wasn't really sure what the best approach would be. It's tough to accommodate the two requirements of wanting things to work with the models in the Huggingface-default installations, while also letting us ship trained models in a self-contained way.

I'm just thinking a bit more about how we handle the architectures for the trained pipeline. We need to make sure the trained model doesn't call into TransformerByName, because the weights will be updated during training -- we need them to load the weights we're providing. Does it do that? Sorry I should really just look at the code more carefully.

svlandeg · 2020-05-14T20:37:19Z

So yes, the idea in this PR is really that TransformerByName is called only once upon creation (if you have it in your config), and then this architecture is "swapped" in the config for a "TransformerFromFile" one. That should (in theory) make sure we're always accessing the trained weights from file.

I tried to test this with a trained model but there's an indexing bug somewhere in the backprop code you're refactoring. I tried looking at it but as the code base is changing I thought I'd wait for a second ;-) This PR does update the config files though so we can test faster.

svlandeg · 2020-05-15T18:32:02Z

[EDIT]: removed previous comment that was here. After going full-circle once more, I do think this is the best approach so far. We need some kind of annotation for the Model in the config file - spaCy expects it. We need a dummy Model architecture in place that gets properly initialized when calling from_disk on the component - just like the other spaCy components get an empty Model upon init, which is properly initialized with the correct weigths from dir.

But always happy to discuss alternatives ;-)

honnibal · 2020-05-18T20:40:34Z

Okay, great, let's merge this :)

svlandeg added 7 commits May 13, 2020 12:08

fix config files

3ffa762

Merge remote-tracking branch 'upstream/feature/spacy-v3' into feature…

c6357f4

…/v3-io

set doc extension when creating pipeline component, force to True

c10ef4e

testing a TransformerFromFile IO solution (WIP)

9765e41

shift IO responsibility to pipeline component

d06eb21

add newline

0ed9d22

tidying

e65d73f

svlandeg mentioned this pull request May 14, 2020

fix config files #175

Closed

svlandeg marked this pull request as draft May 18, 2020 11:27

merge upstream spacy-v3 branch

e04cc22

svlandeg marked this pull request as ready for review May 18, 2020 14:06

svlandeg added enhancement New feature or request feat / serialize Feature: Serialization, saving and loading labels May 18, 2020

honnibal merged commit 9e15dc4 into explosion:feature/spacy-v3 May 18, 2020

svlandeg deleted the feature/v3-io branch May 18, 2020 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO for transformer component #178

IO for transformer component #178

svlandeg commented May 14, 2020

honnibal commented May 14, 2020

svlandeg commented May 14, 2020 •

edited

svlandeg commented May 15, 2020 •

edited

honnibal commented May 18, 2020

IO for transformer component #178

IO for transformer component #178

Conversation

svlandeg commented May 14, 2020

IO

Other fixes

honnibal commented May 14, 2020

svlandeg commented May 14, 2020 • edited

svlandeg commented May 15, 2020 • edited

honnibal commented May 18, 2020

svlandeg commented May 14, 2020 •

edited

svlandeg commented May 15, 2020 •

edited