-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to load a pretrained model from huggingface and use it in fairseq? #2666
Comments
me too, hope for answers |
It should be straightforward to wrap huggingface models in the corresponding fairseq abstractions. We've done this for the gpt2 language model implementation in huggingface: https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py It'd be great to add more wrappers for other model types (e.g., FairseqEncoderModel for BERT-like models) and also to generalize it to load arbitrary pretrained models from huggingface (e.g., using AutoModel). PRs are welcome! 😄 |
@myleott According to the suggested way can we use the pretrained huggingface checkpoint? I feel like we need to specially change data preprocessing steps.
|
Fairseq doesn’t really do any preprocessing. If you want to apply tokenization or BPE, that should happen outside of fairseq, then you can feed the resulting text into fairseq-preprocess/train. Steps might be:
|
@myleott Is it necessary to go through fairseq-preprocess ?
got:
Thank you! |
You can do it. But it will slow down your training. Specially the data
feeding part.
…On Tue, Oct 27, 2020, 21:17 CheungZee ***@***.***> wrote:
@myleott <https://github.com/myleott> Is it necessary to go through
fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like
"您好,世界" as tokenizer's input, dict of tensors as output) as model's input ?
`
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertModel.from_pretrained(model_path)
input_texts = ["您好, 世界"]
inputs = tokenizer(input_texts, padding=True, return_tensors='pt')
print("inputs:{}".format(inputs))
got:
inputs:{
'input_ids': tensor([[ 101, 8701, 102, 0],
[ 101, 2644, 1962, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0],
[0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 0],
[1, 1, 1, 1]])}
`
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2666 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA4FGTV237YQGP55ROWBNDSMZ6YDANCNFSM4R4DTYOA>
.
|
Is there an example of using the code in https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py ? It seems like that this is only a wrap, but there are more should be done if we want to load the pretrained gpt2 model from hugging face? Thank you! |
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment! |
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you! |
I want to load bert-base-chinese in huggingface or google bert and use fairseq to finetune it, how to do? thanks a lot!
The text was updated successfully, but these errors were encountered: