Add GPT-2 modules & Polish GPT-2 examples (#99)

* Add GPT2 modules and Polish GPT2 example
asyml · Jul 12, 2019 · c04990d · c04990d
1 parent 4f414cd
commit c04990d
Show file tree

Hide file tree

Showing 21 changed files with 2,064 additions and 338 deletions.
diff --git a/docs/code/modules.rst b/docs/code/modules.rst
@@ -64,6 +64,11 @@ Encoders
 .. autoclass:: texar.modules.BertEncoder
     :members:
 
+:hidden:`GPT2Encoder`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: texar.modules.GPT2Encoder
+    :members:
+
 :hidden:`Conv1DEncoder`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: texar.modules.Conv1DEncoder
@@ -119,6 +124,11 @@ Decoders
 .. autoclass:: texar.modules.AttentionRNNDecoderOutput
     :members:
 
+:hidden:`GPT2Decoder`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: texar.modules.GPT2Decoder
+    :members:
+
 :hidden:`TransformerDecoder`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: texar.modules.TransformerDecoder
@@ -192,6 +202,11 @@ Classifiers
 .. autoclass:: texar.modules.BertClassifier
     :members:
 
+:hidden:`GPT2Classifier`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: texar.modules.GPT2Classifier
+    :members:
+
 :hidden:`Conv1DClassifier`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: texar.modules.Conv1DClassifier
@@ -225,3 +240,8 @@ Pre-trained
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: texar.modules.BertBase
     :members:
+
+:hidden:`GPT2Base`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: texar.modules.GPT2Base
+    :members:
diff --git a/examples/gpt-2/README.md b/examples/gpt-2/README.md
@@ -1,38 +1,25 @@
 # GPT-2: Pre-trained Langauge Model
 
-This is a Texar implementation of [OpenAI GPT-2 (Generative Pre-Trainning)](https://github.com/openai/gpt-2) language model, which allows to load official pre-trained model parameters, generate samples, and fine-tune the model, etc.
+This is a Texar PyTorch implementation of [OpenAI GPT-2 (Generative Pre-Trainning)](https://github.com/openai/gpt-2) language model, which allows to load official pre-trained model parameters, generate samples, and fine-tune the model, etc.
 
-With Texar, building the GPT-2 model is as simple as creating a [`TransformerDecoder`](https://texar.readthedocs.io/en/latest/code/modules.html#transformerdecoder) instance. We can initialize the parameters of the TransformerDecoder using a pre-trained GPT-2 checkpoint by calling `init_gpt2_checkpoint(path_to_gpt2_checkpoint)` .
+Texar provides ready-to-use modules including [`GPT2Decoder`](https://texar-pytorch.readthedocs.io/en/latest/code/modules.html#gpt2decoder), [`GPT2Encoder`](https://texar-pytorch.readthedocs.io/en/latest/code/modules.html#gpt2encoder), [`GPT2Classifier`](https://texar-pytorch.readthedocs.io/en/latest/code/modules.html#gpt2classifier), etc. This example shows the use of `GPT2Decoder` for generation tasks.
 
 In sum, this example showcases:
 
 * Contructing and using pre-trained GPT-2 models in Texar
 * Using GPT-2 to generate text samples with or without context
-* **Train or fine-tune** the model with **distributed GPU**
+* **Train or fine-tune** the model with **distributed GPU** (coming soon)
 * Examples of other use cases
 
 ## Quick Start (I) - Generation with the Pre-trained Model
 
-### Download GPT-2 Pre-trained Model
-
-Download the GPT-2 `117M` model checkpoint with the following command:
-```
-sh gpt2_pretrained_models/download_model.sh model_117M
-```
-By default, it will download a pretrained model named `model_117M` to `gpt2_pretrained_models/`.
-
-To download the GPT-2 `345M` model checkpoint, use:
-```
-sh gpt2_pretrained_models/download_model.sh model_345M
-```
-
 ### Usage
 | WARNING: Samples are unfiltered and may contain offensive content. |
 | --- |
 
 #### Interactive mode (to generate samples with context)
 
-This mode will initialize an interactive interface, which allows users to type in the context sentence. The model then generates continuation of the context. Top-K sample decoding is used. By default, the GPT-2 `117M` model is used.
+This mode will initialize an interactive interface, which allows users to type in the context sentence. The model then generates continuation of the context. The example supports both Top-K and Top-P sample decoding. By default, the GPT-2 `117M` model with Top-K sample decoding is used.
 
 ```
 python gpt2_generate_main.py --is_interactive \
@@ -45,26 +32,45 @@ Here:
 
 - `is_interactive`: Specifies interactive mode.
 - `max_decoding_length`: The maximum number of tokens in the sample. **Note that this includes tokens in the context**. 
+- `nsamples`: Number of samples to generate for each input. 
+
+For *top-k decoding*: 
+
 - `temperature`: Softmax temperature of top-k sample decoding. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to `0.7`.
 - `top_k`: Number of top most likely candidates from a vocab distribution in each decoding step. Defaults to `40`.
-- `nsamples`: Number of samples to generate for each input. 
 
-To use the GPT-2 `345M` model, specify `--pretrain_checkpoint` and `--config_model`:
+For *top-p decoding*:
+- `top_p`: Select tokens with cumulative probability of at most 'top_p' as candidates for sampling. Do not specify it if you want to use top-k decoding. 
+
+To use the GPT-2 `345M` model, specify `--config_model`:
 
 ```
 python gpt2_generate_main.py --is_interactive \
 --max_decoding_length=100 \
 --temperature=0.7 \
 --top_k=40 \
---config_model=configs.config_model_345M \
---pretrain_checkpoint=gpt2_pretrained_models/model_345M/model.ckpt
+--config_model=configs.config_model_345M 
 ```
 
 Here:
 
-- `pretrain_checkpoint`: Path to the model checkpoints. Default to `gpt2_pretrained_models/model_117M/model.ckpt`.
 - `config_model`: Model configuration file. Default to `configs.config_model_117M`. 
 
+To use Top-P sample decoding, specify `--top_p`:
+
+```
+python gpt2_generate_main.py --is_interactive \
+--max_decoding_length=100 \
+--temperature=0.7 \
+--top_p=40 \
+--config_model=configs.config_model_345M 
+```
+
+Here:
+
+- `top_p`: Select tokens with cumulative probability of at most `p` when arranged in decreasing order. Default to be `None`. 
+
+
 **Example input:**
 ```
 Model input >>> Micheal Jordan is the greatest player in history !
@@ -101,7 +107,7 @@ Here:
 - `nsamples`: Total number of samples to generate, must be dividable by the `batch_size`.
 - `batch_size`: Each iteration generates `batch_size` number of samples.
 
-To use GPT-2 `345M` model, specify `--pretrain_checkpoint` and `--config_model` as above.
+To use GPT-2 `345M` model, `--config_model` as above.
 
 **Example output:**
 
@@ -116,112 +122,24 @@ in this way".
 
 ## Quick Start (II) - Fine-tune the Pre-trained Model 
 
-This section shows how we can fine-tune the pre-trained GPT2 model and use the resulting model for generation.
-
-First of all, **download** the pre-trained model [as above](https://github.com/asyml/texar/tree/master/examples/gpt-2#download-gpt-2-pre-trained-model).
-
-### Prepare data
-
-We first preprocess data with the GPT-2 BPE encoding. 
-
-A toy dataset is provided under [`data/toy/`](data/toy) which includes `train.txt`, `dev.txt`, and `test.txt`. This example will fit the GPT-2 model on `train.txt`, evaluate perplexity on `dev.txt`, and do continuation generation using `test.txt` as the context.
-
-Run the following cmd to transform the data into [TFRecord](https://www.tensorflow.org/tutorials/load_data/tf_records) format and perform processing such as truncation, BPE encoding, adding special tokens, etc:
-
-```
-    python prepare_data.py --data_dir data/toy 
-    [--max_seq_length=128]
-    [--tfrecord_output_dir=data/toy] 
-    [--pretrain_model_dir=gpt2_pretrained_models/model_117M]
-```
-- `data_dir`: The directory of raw data, wherein data files must be named as 'train.txt', 'dev.txt', or 'test.txt'. It is *not* necessary to provide all three files.
-- `max_seq_length`: The maxium length of sequence after BPE encoding. This includes GPT-2 special tokens that will be automatically added. Longer sequence will be trimmed. 
-- `tfrecord_output_dir`: The output path where the resulting TFRecord files will be put in. Be default, it is set to be the same as `data_dir`. 
-- `pretrain_model_dir`: The downloaded pretrained model directory, wherein the vocabulary files are used for data processing. 
-
-The above cmd will output TFRecord files in the specified output directory. E.g., if `train.txt` is provided under `data_dir`, the output file `train.tf_record` will be produced under `tfrecord_output_dir`. 
-
-### Train and Evaluate
-
-For **single-GPU** training (and evaluation), run the following cmd. The cmd fine-tunes the pre-trained GPT-2 parameters, and evalautes perplexity on the dev set.
-```
-    python gpt2_train_main.py --do_train --do_eval
-    [--config_train=configs.config_train]
-    [--output_dir=output]
-```
-Here:
-
-- `config_train`: Configurations of GPT-2 training, including data and optimization hyperparameters. By default, the config file [`configs/config_train.py`](configs/config_train.py) is used. Remember to specify correct data path if you are using your own data.
-- `output_dir`: The output path where checkpoints are saved.
-
-By default, the GPT-2 `117M` model is used. To use the GPT-2 `345M` model instead, specify relevant FLAGS as below:
-```
-    python gpt2_train_main.py --do_train --do_eval \
-    --config_model=configs.config_model_345M \
-    --pretrain_model_dir=gpt2_pretrained_models/model_345M \
-    --pretrain_checkpoint=gpt2_pretrained_models/model_345M/model.ckpt
-    [--config_train=configs.config_train]
-    [--output_dir=output]
-```
-where `--pretrain_checkpoint` is necessary only when you want to load the pretrained checkpoint, and is ignored if `--checkpoint` is specified. 
-
-Please see the FLAGS in the code for more options.
-
-For **Multi-GPU training** on one or multiple machines, you may first install the prerequisite OpenMPI and Hovorod packages, as detailed in the [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example. 
-
-Then run the following cmd for training and evaluation. The cmd trains the model on local with 2 GPUs. Evaluation is performed with the single rank-0 GPU.
-```
-mpirun -np 2 \
-    -H  localhost:2\
-    -bind-to none -map-by slot \
-    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-    -mca pml ob1 -mca btl tcp,self \
-    -mca btl_tcp_if_include ens3 \
-    python gpt2_train_main.py --do_train --do_eval --distributed
-    [--config_train=configs.config_train]
-    [--output_dir=output]
-```
-The key configurations of multi-gpu training:
-
-* `-np`: total number of processes
-* `-H`: IP addresses of different servers and the number of processes used in each server. For example, `-H 192.168.11.22:1,192.168.33.44:1`
-* `-mca`: sets the MPI communication interface. Use the setting specified above to avoid possible multiprocessing and network communication issues.
-
-  - The above configuration uses the `ens3` network interface. If this interface does not work in your environment (e.g., yielding error message `Unknown interfance name`), you may want to use a different interface (Run cmd `ifconfig` to see alternative interfaces in your environment.)
-
-Please refer to [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example for more details of the other multi-gpu configurations.
-
-Make sure to specifiy the `--distributed` flag as above for multi-gpu training.
-
-
-### Restore and Test
-
-``
-python gpt2_train_main.py --do_test --checkpoint=output/model.ckpt
-[--config_train=config_train]
-[--output_dir=output]
-``
-
-The output is by default saved in `output/test_samples.tsv`, where each line contains the context text and the generated continuation (separated with TAB). 
-
+Coming soon!
 
 ## Other Use Cases
 
-Texar's `TransformerDecoder` (and other RNN-based decoders) easily supports common, advanced, or customized use, such as:
+Texar's `GPT2Decoder` (and other RNN-based decoders) easily supports common, advanced, or customized use, such as:
 
 * Sample or continuation generation
 * Greedy / (top-k) sample / Gumbel-softmax / beam-search / ... / your-customized decoding
 * Training / fine-tuning in (un)conditional settings
 * Perplexity evaluation
 
 **For example**, after creating the language model
-```python
-def _embedding_fn(ids, times):
-    return word_embedder(ids) + pos_embedder(times)
+
+```python    
+decoder = GPT2Decoder(hparams=gpt2_hparams)
 
-decoder = TransformerDecoder(
-    output_layer=tf.transpose(word_embedder.embedding), 
-    hparams=gpt2_hparams)
+def _embedding_fn(ids, times):
+    return decoder.word_embedder(ids) + decoder.position_embedder(times)
 ```
 We can do
 
@@ -258,7 +176,7 @@ output, output_length = decoder(
 **Ex. Use 3): Fine-tuning for conditional generation**
 
 ```python
-tgt_embed = word_embedder(truth_target[:, :-1]) + pos_embedder(sequence_length=tgt_len-1)
+tgt_embed = decoder.word_embedder(truth_target[:, :-1]) + decoder.position_embedder(sequence_length=tgt_len-1)
 
 output = decoder(
     memory=source_hidden_states, 

diff --git a/examples/gpt-2/configs/config_model_117M.py b/examples/gpt-2/configs/config_model_117M.py
@@ -1,5 +1,6 @@
 """Texar config file of the GPT-2 model_117M model.
 """
+pretrained_model_name = "117M"
 
 vocab_size = 50257
 dim = 768
@@ -8,7 +9,7 @@
     "dim": dim,
 }
 
-pos_embed = {
+position_embed = {
     'dim': dim
 }
 position_size = 1024

diff --git a/examples/gpt-2/configs/config_model_345M.py b/examples/gpt-2/configs/config_model_345M.py
@@ -1,5 +1,6 @@
 """Texar config file of the GPT-2 model_345M model.
 """
+pretrained_model_name = "345M"
 
 vocab_size = 50257
 dim = 1024
@@ -8,7 +9,7 @@
     "dim": dim,
 }
 
-pos_embed = {
+position_embed = {
     "dim": dim
 }
 position_size = 1024