-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate HuggingFace tokenizer configuration as part of megatron2hf.py (weight conversion) #19
Comments
Could you please elaborate on the second point? I think depending on the settings when the data was tokenized (i.e. either if |
Defining custom tokens (passed via Background to EOD: The eod token appears at several locations in the megatron code and it can be used to separate documents within a sequence, for example |
The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (
special_tokens_map.json
,tokenizer.json
,tokenizer.model
,tokenizer_config.json
) should be generated as part of the megatron2hf conversion script.As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.
Additionally I noticed the following points:
_SentencePieceTokenizer
the_FalconTokenizer
doesn't add special tokens like<CLS><SEP><EOD><MASK>
and uses the standard EOS token (<|endoftext|>
) also as EOD token._SentencePieceTokenizer
the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK>
are added whennew_tokens == True
) even though they might not be used (eod should always be mapped to eos (</s>
) since it is used byget_ltor_masks_and_position_ids()
whenreset_position_ids
orreset_attention_mask
areTrue
)The text was updated successfully, but these errors were encountered: