Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tokenizer Module (Pre-trained Tokenizer) #167

Merged
merged 23 commits into from Aug 29, 2019
Merged

Add Tokenizer Module (Pre-trained Tokenizer) #167

merged 23 commits into from Aug 29, 2019

Conversation

gpengzhi
Copy link
Collaborator

The design of the pre-trained tokenizer module borrows ideas from both pytorch-transformer and our pretrained module.

To initiate a tokenizer, you can sepcify pretrained_model_name

tokenizer = BERTTokenizer(pretrained_model_name='bert-base-uncased')

or sepcify hparams

tokenizer = BERTTokenizer(hparams={'pretrained_model_name': 'bert-base-uncased'})

or directly load from the vocab file

tokenizer = BERTTokenizer(hparams={'pretrained_model_name': None, 'vocab_file': 'path_to_vocab'})

For the downstream tasks, you can add new tokens to the tokenizer

tokenizer.add_tokens([...])

and get the lasted vocabulary size, which can be used to define the downstream model

current_vocab_size = len(tokenizer)

You can also save the tokenizer to a directory (vocab file, special token file, added token file, and config file will be saved)

tokenizer.save('path-to-directory')

or load one from a directory

tokenizer = BERTTokenizer.load('path-to-directory')

Basically, we provide four core function for each tokenizer: text-to-token, text-to-id, id-to-text, token-to-text.

tokens = tokenizer(inputs=text, task='text-to-token')
ids = tokenizer(inputs=text, task='text-to-id')
tokens = tokenizer(inputs=ids, task='id-to-token')
text = tokenizer(inputs=ids, task='id-to-text')

@ZhitingHu
Copy link
Member

Quick question: How would the tokenizers interact with the tx.data.Vocabulary class? Perhaps this is what to be considered when adding non-pretrained tokenizers (I remembered we discussed this in our weekly sync-up)?

@ZhitingHu
Copy link
Member

Is the file test_sentencepiece.model intended? It has 247KB, make it smaller if possible

@gpengzhi
Copy link
Collaborator Author

Is the file test_sentencepiece.model intended? It has 247KB, make it smaller if possible

This is a standard file used in the testing of google/sentence. We can reuse it when we add SentencePieceTokenzier (that provides BPE, unigram language model functionality).

@huzecong
Copy link
Collaborator

@gpengzhi Regarding test_sentencepiece.model, can we download the file during testing? We do so for RecordData where we download a couple of images.

@gpengzhi
Copy link
Collaborator Author

@gpengzhi Regarding test_sentencepiece.model, can we download the file during testing? We do so for RecordData where we download a couple of images.

I think we can. I will change it accordingly.

Copy link
Member

@ZhitingHu ZhitingHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls see the above unresolved comments

Copy link
Member

@ZhitingHu ZhitingHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once the last comment on docstring is resolved.

@ZhitingHu
Copy link
Member

Pls update bert, gpt2, xlnet examples to use these tokenizers, and delete unnecessary utils in those examples

@ZhitingHu
Copy link
Member

Do the updated examples reproduce the previous results?

@gpengzhi
Copy link
Collaborator Author

Do the updated examples reproduce the previous results?

Yes

examples/bert/utils/data_utils.py Outdated Show resolved Hide resolved
examples/bert/utils/data_utils.py Outdated Show resolved Hide resolved
examples/gpt-2/utils/data_utils.py Outdated Show resolved Hide resolved
examples/xlnet/utils/data_utils.py Show resolved Hide resolved
examples/xlnet/xlnet_generation_ipython.py Show resolved Hide resolved
examples/bert/utils/data_utils.py Outdated Show resolved Hide resolved
examples/gpt-2/utils/data_utils.py Outdated Show resolved Hide resolved
examples/gpt-2/utils/data_utils.py Outdated Show resolved Hide resolved
texar/torch/data/tokenizers/pretrained_gpt2_tokenizer.py Outdated Show resolved Hide resolved
examples/gpt-2/utils/data_utils.py Outdated Show resolved Hide resolved
texar/torch/utils/utils.py Show resolved Hide resolved
texar/torch/data/tokenizers/pretrained_bert_tokenizer.py Outdated Show resolved Hide resolved
@gpengzhi gpengzhi mentioned this pull request Aug 29, 2019
examples/bert/utils/data_utils.py Outdated Show resolved Hide resolved
@ZhitingHu
Copy link
Member

Once the method name is updated, it's good to merge

@gpengzhi gpengzhi merged commit 070d0d3 into asyml:master Aug 29, 2019
@gpengzhi gpengzhi deleted the tokenizer branch November 1, 2019 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants