Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19

Closed
andreaskoepf opened this issue Aug 10, 2023 · 2 comments · Fixed by #27
Assignees

Comments

@andreaskoepf
Copy link
Contributor

andreaskoepf commented Aug 10, 2023

The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (special_tokens_map.json, tokenizer.json, tokenizer.model, tokenizer_config.json) should be generated as part of the megatron2hf conversion script.

As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.

Additionally I noticed the following points:

  • Unlike _SentencePieceTokenizer the _FalconTokenizer doesn't add special tokens like <CLS><SEP><EOD><MASK> and uses the standard EOS token (<|endoftext|>) also as EOD token.
  • For _SentencePieceTokenizer the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK> are added when new_tokens == True) even though they might not be used (eod should always be mapped to eos (</s>) since it is used by get_ltor_masks_and_position_ids() when reset_position_ids or reset_attention_mask are True)
  • SentencePieceTokenizer requires a vocab file and the test for it should not be excluded here only to do the check a few lines below
@andreaskoepf andreaskoepf changed the title Generate Huggingface tokenizer configuration as part of weight conversion Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) Aug 10, 2023
@AleHD
Copy link
Collaborator

AleHD commented Aug 13, 2023

Could you please elaborate on the second point? I think depending on the settings when the data was tokenized (i.e. either if new_tokens=True or not), during training the code will either look for the <eos> or <eod> token, right? Sorry if I misunderstood something.

@AleHD AleHD linked a pull request Aug 13, 2023 that will close this issue
5 tasks
@andreaskoepf
Copy link
Contributor Author

andreaskoepf commented Aug 14, 2023

Could you please elaborate on the second point?

Defining custom tokens (passed via vocal_extra_ids_list) currently implies the addition of built-in special tokens <CLS>, <SEP>, <EOD>, <MASK>. Adding these built-in special tokens is not always necessary. I suggest that the ctor's new_tokens parameter should only control whether the builtin standard tokens are added and not influence the addition of tokens specified via vocab_extra_ids_list. The function _add_special_token() currently checks the new_tokens argument and it is also used for adding the entries in vocab_extra_ids_list...
Regarding eod: The current implementation of the eod porperty already returns _eos_id if _eod_id is None so nothing needs to be changed there.

Background to EOD: The eod token appears at several locations in the megatron code and it can be used to separate documents within a sequence, for example GPTDataset potentially concatenates several documents and if EOD tokens were added via preprocess_data.py they could be further used for attention-masking and positon-id resetting in get_ltor_masks_and_position_ids().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants