Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19

andreaskoepf · 2023-08-10T10:55:23Z

The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (special_tokens_map.json, tokenizer.json, tokenizer.model, tokenizer_config.json) should be generated as part of the megatron2hf conversion script.

As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.

Additionally I noticed the following points:

Unlike _SentencePieceTokenizer the _FalconTokenizer doesn't add special tokens like <CLS><SEP><EOD><MASK> and uses the standard EOS token (<|endoftext|>) also as EOD token.
For _SentencePieceTokenizer the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK> are added when new_tokens == True) even though they might not be used (eod should always be mapped to eos (</s>) since it is used by get_ltor_masks_and_position_ids() when reset_position_ids or reset_attention_mask are True)
SentencePieceTokenizer requires a vocab file and the test for it should not be excluded here only to do the check a few lines below

The text was updated successfully, but these errors were encountered:

AleHD · 2023-08-13T16:41:33Z

Could you please elaborate on the second point? I think depending on the settings when the data was tokenized (i.e. either if new_tokens=True or not), during training the code will either look for the <eos> or <eod> token, right? Sorry if I misunderstood something.

andreaskoepf · 2023-08-14T09:45:12Z

Could you please elaborate on the second point?

Defining custom tokens (passed via vocal_extra_ids_list) currently implies the addition of built-in special tokens <CLS>, <SEP>, <EOD>, <MASK>. Adding these built-in special tokens is not always necessary. I suggest that the ctor's new_tokens parameter should only control whether the builtin standard tokens are added and not influence the addition of tokens specified via vocab_extra_ids_list. The function _add_special_token() currently checks the new_tokens argument and it is also used for adding the entries in vocab_extra_ids_list...
Regarding eod: The current implementation of the eod porperty already returns _eos_id if _eod_id is None so nothing needs to be changed there.

Background to EOD: The eod token appears at several locations in the megatron code and it can be used to separate documents within a sequence, for example GPTDataset potentially concatenates several documents and if EOD tokens were added via preprocess_data.py they could be further used for attention-masking and positon-id resetting in get_ltor_masks_and_position_ids().

andreaskoepf changed the title ~~Generate Huggingface tokenizer configuration as part of weight conversion~~ Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) Aug 10, 2023

AleHD mentioned this issue Aug 13, 2023

Weight conversion testing and other features #27

Merged

5 tasks

AleHD linked a pull request Aug 13, 2023 that will close this issue

Weight conversion testing and other features #27

Merged

5 tasks

Olivia-fsm self-assigned this Aug 14, 2023

AleHD closed this as completed in #27 Aug 17, 2023

xingyaoww mentioned this issue Oct 24, 2023

Use --no_new_tokens to stop adding built-in special tokens #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19

andreaskoepf commented Aug 10, 2023 •

edited

Loading

AleHD commented Aug 13, 2023

andreaskoepf commented Aug 14, 2023 •

edited

Loading

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19

Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19

Comments

andreaskoepf commented Aug 10, 2023 • edited Loading

AleHD commented Aug 13, 2023

andreaskoepf commented Aug 14, 2023 • edited Loading

andreaskoepf commented Aug 10, 2023 •

edited

Loading

andreaskoepf commented Aug 14, 2023 •

edited

Loading