Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO for transformer component #178

Merged
merged 8 commits into from May 18, 2020

Conversation

svlandeg
Copy link
Member

IO

Been going in circles a bit with this, trying to puzzle it into the IO mechanisms we decided on for the config refactor for spaCy 3 ...

This PR: Transformer(Pipe) knows how to do to_disk and from_disk and stores the internal tokenizer & transformer object using huggingface transformer standard IO mechanisms. In the nlp/transformer output directory, this results in a folder model with files

  • config.json
  • pytorch_model.bin
  • special_tokens_map.json
  • tokenizer_config.json
  • vocab.txt

This folder can be read using the spacy.TransformerFromFile.v1 architecture for the model, and then calling from_disk on the pipeline component (which happens automatically when reading the nlp object from a config)

If users want to download a model by using architecture spacy.TransformerByName.v2, then when calling nlp.to_disk, we need to do a little hack rewriting that architecture to the one from file. This is done by directly modifying nlp.config when the component is created with from_nlp. This feels hacky, but not sure how else to prevent multiple downloads.

Other fixes

  • fixed the config files to be up-to-date with the latest version of the v3 branch
  • moved install_extensions to the init of the transformer pipe, where I think it makes more sense. Added force=True to prevent warnings/errors when calling it multiple times (I don't think that matters?)

@svlandeg svlandeg mentioned this pull request May 14, 2020
@honnibal
Copy link
Member

Nice work! I agree that the problem is fundamentally kind of awkward. I wasn't really sure what the best approach would be. It's tough to accommodate the two requirements of wanting things to work with the models in the Huggingface-default installations, while also letting us ship trained models in a self-contained way.

I'm just thinking a bit more about how we handle the architectures for the trained pipeline. We need to make sure the trained model doesn't call into TransformerByName, because the weights will be updated during training -- we need them to load the weights we're providing. Does it do that? Sorry I should really just look at the code more carefully.

@svlandeg
Copy link
Member Author

svlandeg commented May 14, 2020

So yes, the idea in this PR is really that TransformerByName is called only once upon creation (if you have it in your config), and then this architecture is "swapped" in the config for a "TransformerFromFile" one. That should (in theory) make sure we're always accessing the trained weights from file.

I tried to test this with a trained model but there's an indexing bug somewhere in the backprop code you're refactoring. I tried looking at it but as the code base is changing I thought I'd wait for a second ;-) This PR does update the config files though so we can test faster.

@svlandeg
Copy link
Member Author

svlandeg commented May 15, 2020

[EDIT]: removed previous comment that was here. After going full-circle once more, I do think this is the best approach so far. We need some kind of annotation for the Model in the config file - spaCy expects it. We need a dummy Model architecture in place that gets properly initialized when calling from_disk on the component - just like the other spaCy components get an empty Model upon init, which is properly initialized with the correct weigths from dir.

But always happy to discuss alternatives ;-)

@svlandeg svlandeg marked this pull request as draft May 18, 2020 11:27
@svlandeg svlandeg marked this pull request as ready for review May 18, 2020 14:06
@svlandeg svlandeg added enhancement New feature or request feat / serialize Feature: Serialization, saving and loading labels May 18, 2020
@honnibal
Copy link
Member

Okay, great, let's merge this :)

@honnibal honnibal merged commit 9e15dc4 into explosion:feature/spacy-v3 May 18, 2020
@svlandeg svlandeg deleted the feature/v3-io branch May 18, 2020 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat / serialize Feature: Serialization, saving and loading
Projects
None yet
2 participants