New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO for transformer component #178
Conversation
Nice work! I agree that the problem is fundamentally kind of awkward. I wasn't really sure what the best approach would be. It's tough to accommodate the two requirements of wanting things to work with the models in the Huggingface-default installations, while also letting us ship trained models in a self-contained way. I'm just thinking a bit more about how we handle the architectures for the trained pipeline. We need to make sure the trained model doesn't call into |
So yes, the idea in this PR is really that I tried to test this with a trained model but there's an indexing bug somewhere in the backprop code you're refactoring. I tried looking at it but as the code base is changing I thought I'd wait for a second ;-) This PR does update the config files though so we can test faster. |
[EDIT]: removed previous comment that was here. After going full-circle once more, I do think this is the best approach so far. We need some kind of annotation for the But always happy to discuss alternatives ;-) |
Okay, great, let's merge this :) |
IO
Been going in circles a bit with this, trying to puzzle it into the IO mechanisms we decided on for the config refactor for spaCy 3 ...
This PR:
Transformer(Pipe)
knows how to doto_disk
andfrom_disk
and stores the internal tokenizer & transformer object using huggingfacetransformer
standard IO mechanisms. In thenlp/transformer
output directory, this results in a foldermodel
with filesThis folder can be read using the
spacy.TransformerFromFile.v1
architecture for the model, and then callingfrom_disk
on the pipeline component (which happens automatically when reading thenlp
object from a config)If users want to download a model by using architecture
spacy.TransformerByName.v2
, then when callingnlp.to_disk
, we need to do a little hack rewriting that architecture to the one from file. This is done by directly modifyingnlp.config
when the component is created withfrom_nlp
. This feels hacky, but not sure how else to prevent multiple downloads.Other fixes
install_extensions
to the init of the transformer pipe, where I think it makes more sense. Addedforce=True
to prevent warnings/errors when calling it multiple times (I don't think that matters?)