How to load a pretrained model from huggingface and use it in fairseq? #2666

ttzHome · 2020-09-28T06:44:51Z

I want to load bert-base-chinese in huggingface or google bert and use fairseq to finetune it, how to do? thanks a lot!

jia-zhuang · 2020-09-29T02:20:47Z

me too, hope for answers

myleott · 2020-09-29T21:43:45Z

It should be straightforward to wrap huggingface models in the corresponding fairseq abstractions. We've done this for the gpt2 language model implementation in huggingface: https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py

It'd be great to add more wrappers for other model types (e.g., FairseqEncoderModel for BERT-like models) and also to generalize it to load arbitrary pretrained models from huggingface (e.g., using AutoModel).

PRs are welcome! 😄

shamanez · 2020-10-06T07:06:15Z

@myleott According to the suggested way can we use the pretrained huggingface checkpoint?

I feel like we need to specially change data preprocessing steps.

Tokenization
Fairseq-preprocess function. (Here I don't understand how to create a dict.txt)

myleott · 2020-10-10T11:59:25Z

Fairseq doesn’t really do any preprocessing. If you want to apply tokenization or BPE, that should happen outside of fairseq, then you can feed the resulting text into fairseq-preprocess/train.

Steps might be:

start with raw text training data
use huggingface to tokenize and apply BPE. Get back a text file with BPE tokens separated by spaces
feed step 2 into fairseq-preprocess, which will tensorize and generate dict.txt

CheungZeeCn · 2020-10-27T08:16:50Z

@myleott Is it necessary to go through fairseq-preprocess ?
How about just use the output of the hugging face tokenizer(raw text like "您好，世界" as tokenizer's input, dict of tensors as output) as model's input ?

    from transformers import BertModel, BertTokenizer

    tokenizer = BertTokenizer.from_pretrained(model_path)
    model = BertModel.from_pretrained(model_path)
    input_texts = ["您好, 世界"]
    inputs = tokenizer(input_texts, padding=True, return_tensors='pt')
    print("inputs:{}".format(inputs))

got:


 inputs:{
'input_ids': tensor([[ 101, 2644, 1962,  117,  686, 4518,  102]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

Thank you!

shamanez · 2020-10-27T08:30:32Z

You can do it. But it will slow down your training. Specially the data feeding part.

…

On Tue, Oct 27, 2020, 21:17 CheungZee ***@***.***> wrote: @myleott <https://github.com/myleott> Is it necessary to go through fairseq-preprocess ? How about just use the output of the hugging face tokenizer(raw text like "您好，世界" as tokenizer's input, dict of tensors as output) as model's input ? ` from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained(model_path) model = BertModel.from_pretrained(model_path) input_texts = ["您好, 世界"] inputs = tokenizer(input_texts, padding=True, return_tensors='pt') print("inputs:{}".format(inputs)) got: inputs:{ 'input_ids': tensor([[ 101, 8701, 102, 0], [ 101, 2644, 1962, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0], [0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 0], [1, 1, 1, 1]])} ` — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2666 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA4FGTV237YQGP55ROWBNDSMZ6YDANCNFSM4R4DTYOA> .

CheungZeeCn · 2020-10-27T08:36:16Z

Is there an example of using the code in https://github.com/pytorch/fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py ?
@myleott @shamanez

It seems like that this is only a wrap, but there are more should be done if we want to load the pretrained gpt2 model from hugging face?

Thank you!

CheungZeeCn · 2020-12-10T15:24:48Z

Hi guys, Here is my code for this task exactly, HERE plz check whether it can help you!
@ttzHome @shamanez

cc @myleott

stale · 2021-07-21T05:04:35Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale · 2022-05-02T20:21:31Z

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

ttzHome added needs triage question labels Sep 28, 2020

myleott added enhancement help wanted and removed needs triage question labels Sep 29, 2020

myleott mentioned this issue Oct 10, 2020

Huggingface : Can we finetune pretrained-huggingface models with fairseq framework? #2698

Closed

stale bot added the stale label Jul 21, 2021

stale bot closed this as completed May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load a pretrained model from huggingface and use it in fairseq? #2666

How to load a pretrained model from huggingface and use it in fairseq? #2666

ttzHome commented Sep 28, 2020

jia-zhuang commented Sep 29, 2020

myleott commented Sep 29, 2020 •

edited

shamanez commented Oct 6, 2020

myleott commented Oct 10, 2020

CheungZeeCn commented Oct 27, 2020 •

edited

shamanez commented Oct 27, 2020 via email

CheungZeeCn commented Oct 27, 2020 •

edited

CheungZeeCn commented Dec 10, 2020 •

edited

stale bot commented Jul 21, 2021

stale bot commented May 2, 2022

How to load a pretrained model from huggingface and use it in fairseq? #2666

How to load a pretrained model from huggingface and use it in fairseq? #2666

Comments

ttzHome commented Sep 28, 2020

jia-zhuang commented Sep 29, 2020

myleott commented Sep 29, 2020 • edited

shamanez commented Oct 6, 2020

myleott commented Oct 10, 2020

CheungZeeCn commented Oct 27, 2020 • edited

got:

shamanez commented Oct 27, 2020 via email

CheungZeeCn commented Oct 27, 2020 • edited

CheungZeeCn commented Dec 10, 2020 • edited

stale bot commented Jul 21, 2021

stale bot commented May 2, 2022

myleott commented Sep 29, 2020 •

edited

CheungZeeCn commented Oct 27, 2020 •

edited

CheungZeeCn commented Oct 27, 2020 •

edited

CheungZeeCn commented Dec 10, 2020 •

edited