Save tokenizer in conversion script #128

jaketae · 2021-10-07T06:31:27Z

This PR implements the following:

Accept tokenizer_type and tokenizer_name_or_path as conversion script arguments
Save pretrained tokenizers alongside the converted model

stas00 · 2021-10-07T06:58:49Z

Thank you for working on it, @jaketae

FYI, in the future you can add:

Fixes: #126

to the OP, and it'll automatically close the Issue when PR is merged.

oh, sorry, perhaps my spec wasn't clear enough. The args are already in the checkpoint's args:

ds_args = ds_checkpoint.get_args()

No need to do it 2nd time.

Additionally to the original spec

it looks like we also need to set:

tokenizer_class in the config object, it should correspond to the HF Transformers tokenizer class. It's probably safe to use the fast tokenizers, so it'd be one of GPT2TokenizerFast and T5TokenizerFast:

Please see huggingface/transformers#13906 for context

jaketae · 2021-10-07T08:02:26Z

@stas00 Thanks for the feedback!

Would hard-coding the tokenizer class name be preferable over something like type(tokenizer).__name__ (or some other way of getting the class name from a Python object)? I'm wondering how robust the assumption is that we would always be using one of the two tokenizers.

I used to put "Fixes X", until I realized that one can explicitly link an issue to a PR. I think it achieves the same thing, namely closing the issue when a PR is merged. But I'll keep that in mind.

stas00 · 2021-10-07T18:18:11Z

I used to put "Fixes X", until I realized that one can explicitly link an issue to a PR. I think it achieves the same thing, namely closing the issue when a PR is merged. But I'll keep that in mind.

Oh, you did it manually, I see. I missed that. I guess this is just a convention we use at HF, so the reviewers quickly see which issue(s) it's resolving. I can now see the linked issue in the right bar. So either way works, it's just further away from the OP and not always immediately obvious.

I myself always start a PR with something like:

"This PR is addressing issue #xxx , "

but that's just my personal convention.

Would hard-coding the tokenizer class name be preferable over something like type(tokenizer).__name__ (or some other way of getting the class name from a Python object)? I'm wondering how robust the assumption is that we would always be using one of the two tokenizers.

That's even better - I forgot that we were creating the corresponding tokenizer anyway to get its files, so by all means yes - your proposal is great!

Thank you, @jaketae

jaketae · 2021-10-07T19:12:55Z

@stas00 Thanks for the feedback! I've updated the code so that it saves the tokenizer_class field in config.json. The class name is dynamically fetched as proposed in my previous comment.

I'll make sure to link relevant issues more explicitly in future PRs. Thank you!

stas00

Fantastic, thank you, @jaketae

Just a small suggestion inside.

The test will be part of #121 once that is merged.

tools/convert_checkpoint/deepspeed_to_transformers.py

stas00 · 2021-10-07T19:31:54Z

The following comment is for a new Issue/PR:

Since you started to use auto-formatter - let's add the config you used to something we all can run automatically and use the same setup - I highly recommend replicating HF transformers setup, since it's already done and has been thought through.

again, we can do that only for the test suite, since the main code needs to remain in the same format, in order for us to be able to easily sync with Megatron-LM and Megatron-Deepspeed original trees.

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

jaketae · 2021-10-07T19:40:22Z

@stas00 Sounds great! I think HF uses black and isort. So maybe I'll just add a few line to the Makefile to make sure that test and tool directories are correctly formatted. I'll open an issue and a corresponding PR in the next day or two. Thanks!

stas00 · 2021-10-07T19:49:37Z

it does, but I meant to copy the specific config both use:

black's config is in pyproject.toml
isort's is in setup.cfg
and most importantly we have to require the same minimum versions of each, (best to sync with transformers) - since different versions lead to very different outcomes even with the same config, so us developers need to be in sync.

stas00 · 2021-10-07T19:51:18Z

tools is in Megatron-LM, so not sure about that... and Meg-DS original uses it too.. so best to leave it it alone.

jaketae · 2021-10-07T21:45:31Z

tools is in Megatron-LM, so not sure about that... and Meg-DS original uses it too.. so best to leave it it alone.

Wasn't aware of that, I'll make sure to remove them from the formatted directories. I was thinking maybe the conversation directory could be included.

I'll also make sure to check the formatter configuration files. Thanks for the heads up!

stas00 · 2021-10-07T21:51:03Z

I was thinking maybe the conversation directory could be included.

Sure and then we can ask Meg-DS original to sync with ours. Let's just not forget that if we reformat Tunji's files.

feature: save tokenizer based on script args

2dbe65c

jaketae linked an issue Oct 7, 2021 that may be closed by this pull request

auto-add tokenizer files to the converted model #126

Closed

chore: use none instead of empty str for consistency

32bc91d

jaketae requested a review from stas00 October 7, 2021 06:39

fix: rm duplicate args, save tokenizer_class key

5a5e1e9

stas00 approved these changes Oct 7, 2021

View reviewed changes

tools/convert_checkpoint/deepspeed_to_transformers.py Outdated Show resolved Hide resolved

Update tools/convert_checkpoint/deepspeed_to_transformers.py

2e83bec

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

jaketae merged commit 23dded0 into main Oct 7, 2021

jaketae deleted the feat-save-tokenizer branch October 7, 2021 19:38

jaketae mentioned this pull request Oct 8, 2021

Consistency in code convention #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save tokenizer in conversion script #128

Save tokenizer in conversion script #128

jaketae commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

jaketae commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021

jaketae commented Oct 7, 2021

stas00 left a comment

stas00 commented Oct 7, 2021 •

edited

Loading

jaketae commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021

jaketae commented Oct 7, 2021

stas00 commented Oct 7, 2021

Save tokenizer in conversion script #128

Save tokenizer in conversion script #128

Conversation

jaketae commented Oct 7, 2021 • edited Loading

stas00 commented Oct 7, 2021 • edited Loading

jaketae commented Oct 7, 2021 • edited Loading

stas00 commented Oct 7, 2021

jaketae commented Oct 7, 2021

stas00 left a comment

Choose a reason for hiding this comment

stas00 commented Oct 7, 2021 • edited Loading

jaketae commented Oct 7, 2021 • edited Loading

stas00 commented Oct 7, 2021 • edited Loading

stas00 commented Oct 7, 2021

jaketae commented Oct 7, 2021

stas00 commented Oct 7, 2021

jaketae commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

jaketae commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading

jaketae commented Oct 7, 2021 •

edited

Loading

stas00 commented Oct 7, 2021 •

edited

Loading