New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tokenizer Module (Pre-trained Tokenizer) #167
Conversation
Quick question: How would the tokenizers interact with the |
Is the file |
texar/torch/modules/tokenizers/pretrained_bert_tokenizer_utils.py
Outdated
Show resolved
Hide resolved
This is a standard file used in the testing of |
@gpengzhi Regarding |
I think we can. I will change it accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls see the above unresolved comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once the last comment on docstring is resolved.
Pls update bert, gpt2, xlnet examples to use these tokenizers, and delete unnecessary utils in those examples |
Do the updated examples reproduce the previous results? |
Yes |
Once the method name is updated, it's good to merge |
The design of the pre-trained tokenizer module borrows ideas from both
pytorch-transformer
and ourpretrained
module.To initiate a tokenizer, you can sepcify
pretrained_model_name
or sepcify
hparams
or directly load from the
vocab file
For the downstream tasks, you can add new tokens to the tokenizer
and get the lasted vocabulary size, which can be used to define the downstream model
You can also
save
the tokenizer to a directory (vocab file
,special token file
,added token file
, andconfig file
will be saved)or load one from a directory
Basically, we provide four core function for each tokenizer:
text-to-token
,text-to-id
,id-to-text
,token-to-text
.