Add Tokenizer Module (Pre-trained Tokenizer) #167

gpengzhi · 2019-08-21T20:13:51Z

The design of the pre-trained tokenizer module borrows ideas from both pytorch-transformer and our pretrained module.

To initiate a tokenizer, you can sepcify pretrained_model_name

tokenizer = BERTTokenizer(pretrained_model_name='bert-base-uncased')

or sepcify hparams

tokenizer = BERTTokenizer(hparams={'pretrained_model_name': 'bert-base-uncased'})

or directly load from the vocab file

tokenizer = BERTTokenizer(hparams={'pretrained_model_name': None, 'vocab_file': 'path_to_vocab'})

For the downstream tasks, you can add new tokens to the tokenizer

tokenizer.add_tokens([...])

and get the lasted vocabulary size, which can be used to define the downstream model

current_vocab_size = len(tokenizer)

You can also save the tokenizer to a directory (vocab file, special token file, added token file, and config file will be saved)

tokenizer.save('path-to-directory')

or load one from a directory

tokenizer = BERTTokenizer.load('path-to-directory')

Basically, we provide four core function for each tokenizer: text-to-token, text-to-id, id-to-text, token-to-text.

tokens = tokenizer(inputs=text, task='text-to-token')
ids = tokenizer(inputs=text, task='text-to-id')
tokens = tokenizer(inputs=ids, task='id-to-token')
text = tokenizer(inputs=ids, task='id-to-text')

ZhitingHu · 2019-08-21T21:46:30Z

Quick question: How would the tokenizers interact with the tx.data.Vocabulary class? Perhaps this is what to be considered when adding non-pretrained tokenizers (I remembered we discussed this in our weekly sync-up)?

ZhitingHu · 2019-08-21T22:32:30Z

Is the file test_sentencepiece.model intended? It has 247KB, make it smaller if possible

texar/torch/modules/tokenizers/pretrained_tokenizer_base.py

texar/torch/modules/tokenizers/pretrained_bert_tokenizer.py

texar/torch/modules/tokenizers/pretrained_tokenizer_base.py

texar/torch/modules/tokenizers/pretrained_xlnet_tokenizer.py

texar/torch/modules/tokenizers/pretrained_bert_tokenizer.py

texar/torch/modules/tokenizers/pretrained_bert_tokenizer_utils.py

texar/torch/modules/tokenizers/pretrained_roberta_tokenizer.py

texar/torch/modules/tokenizers/pretrained_xlnet_tokenizer.py

gpengzhi · 2019-08-26T00:06:32Z

Is the file test_sentencepiece.model intended? It has 247KB, make it smaller if possible

This is a standard file used in the testing of google/sentence. We can reuse it when we add SentencePieceTokenzier (that provides BPE, unigram language model functionality).

huzecong · 2019-08-26T17:37:18Z

@gpengzhi Regarding test_sentencepiece.model, can we download the file during testing? We do so for RecordData where we download a couple of images.

gpengzhi · 2019-08-26T17:38:39Z

@gpengzhi Regarding test_sentencepiece.model, can we download the file during testing? We do so for RecordData where we download a couple of images.

I think we can. I will change it accordingly.

ZhitingHu

Pls see the above unresolved comments

ZhitingHu

Looks good once the last comment on docstring is resolved.

ZhitingHu · 2019-08-27T03:10:25Z

Pls update bert, gpt2, xlnet examples to use these tokenizers, and delete unnecessary utils in those examples

ZhitingHu · 2019-08-27T21:35:33Z

Do the updated examples reproduce the previous results?

gpengzhi · 2019-08-27T21:36:16Z

Do the updated examples reproduce the previous results?

Yes

examples/bert/utils/data_utils.py

examples/gpt-2/utils/data_utils.py

examples/xlnet/utils/data_utils.py

examples/xlnet/xlnet_generation_ipython.py

examples/bert/utils/data_utils.py

examples/gpt-2/utils/data_utils.py

texar/torch/data/tokenizers/pretrained_gpt2_tokenizer.py

examples/gpt-2/utils/data_utils.py

texar/torch/utils/utils.py

texar/torch/data/tokenizers/pretrained_bert_tokenizer.py

examples/bert/utils/data_utils.py

ZhitingHu · 2019-08-29T19:01:22Z

Once the method name is updated, it's good to merge

gpengzhi added 5 commits August 15, 2019 18:38

First draft of Pre-trained Tokenizer modules

1b5058e

Merge branch 'master' into tokenizer

1301edb

Enrich Tokenizer module

7843717

Merge branch 'master' into tokenizer

e8e324a

Code refactor

d934b99

gpengzhi requested review from huzecong, AvinashBukkittu and ZhitingHu August 21, 2019 20:14

Fix CI

7be8535

ZhitingHu requested changes Aug 21, 2019

View reviewed changes