Skip to content

Commit

Permalink
Add GPT-2 modules & Polish GPT-2 examples (#99)
Browse files Browse the repository at this point in the history
* Add GPT2 modules and Polish GPT2 example
  • Loading branch information
gpengzhi committed Jul 12, 2019
1 parent 4f414cd commit c04990d
Show file tree
Hide file tree
Showing 21 changed files with 2,064 additions and 338 deletions.
20 changes: 20 additions & 0 deletions docs/code/modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,11 @@ Encoders
.. autoclass:: texar.modules.BertEncoder
:members:

:hidden:`GPT2Encoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.GPT2Encoder
:members:

:hidden:`Conv1DEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.Conv1DEncoder
Expand Down Expand Up @@ -119,6 +124,11 @@ Decoders
.. autoclass:: texar.modules.AttentionRNNDecoderOutput
:members:

:hidden:`GPT2Decoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.GPT2Decoder
:members:

:hidden:`TransformerDecoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.TransformerDecoder
Expand Down Expand Up @@ -192,6 +202,11 @@ Classifiers
.. autoclass:: texar.modules.BertClassifier
:members:

:hidden:`GPT2Classifier`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.GPT2Classifier
:members:

:hidden:`Conv1DClassifier`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.Conv1DClassifier
Expand Down Expand Up @@ -225,3 +240,8 @@ Pre-trained
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.BertBase
:members:

:hidden:`GPT2Base`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.GPT2Base
:members:
156 changes: 37 additions & 119 deletions examples/gpt-2/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,25 @@
# GPT-2: Pre-trained Langauge Model

This is a Texar implementation of [OpenAI GPT-2 (Generative Pre-Trainning)](https://github.com/openai/gpt-2) language model, which allows to load official pre-trained model parameters, generate samples, and fine-tune the model, etc.
This is a Texar PyTorch implementation of [OpenAI GPT-2 (Generative Pre-Trainning)](https://github.com/openai/gpt-2) language model, which allows to load official pre-trained model parameters, generate samples, and fine-tune the model, etc.

With Texar, building the GPT-2 model is as simple as creating a [`TransformerDecoder`](https://texar.readthedocs.io/en/latest/code/modules.html#transformerdecoder) instance. We can initialize the parameters of the TransformerDecoder using a pre-trained GPT-2 checkpoint by calling `init_gpt2_checkpoint(path_to_gpt2_checkpoint)` .
Texar provides ready-to-use modules including [`GPT2Decoder`](https://texar-pytorch.readthedocs.io/en/latest/code/modules.html#gpt2decoder), [`GPT2Encoder`](https://texar-pytorch.readthedocs.io/en/latest/code/modules.html#gpt2encoder), [`GPT2Classifier`](https://texar-pytorch.readthedocs.io/en/latest/code/modules.html#gpt2classifier), etc. This example shows the use of `GPT2Decoder` for generation tasks.

In sum, this example showcases:

* Contructing and using pre-trained GPT-2 models in Texar
* Using GPT-2 to generate text samples with or without context
* **Train or fine-tune** the model with **distributed GPU**
* **Train or fine-tune** the model with **distributed GPU** (coming soon)
* Examples of other use cases

## Quick Start (I) - Generation with the Pre-trained Model

### Download GPT-2 Pre-trained Model

Download the GPT-2 `117M` model checkpoint with the following command:
```
sh gpt2_pretrained_models/download_model.sh model_117M
```
By default, it will download a pretrained model named `model_117M` to `gpt2_pretrained_models/`.

To download the GPT-2 `345M` model checkpoint, use:
```
sh gpt2_pretrained_models/download_model.sh model_345M
```

### Usage
| WARNING: Samples are unfiltered and may contain offensive content. |
| --- |

#### Interactive mode (to generate samples with context)

This mode will initialize an interactive interface, which allows users to type in the context sentence. The model then generates continuation of the context. Top-K sample decoding is used. By default, the GPT-2 `117M` model is used.
This mode will initialize an interactive interface, which allows users to type in the context sentence. The model then generates continuation of the context. The example supports both Top-K and Top-P sample decoding. By default, the GPT-2 `117M` model with Top-K sample decoding is used.

```
python gpt2_generate_main.py --is_interactive \
Expand All @@ -45,26 +32,45 @@ Here:

- `is_interactive`: Specifies interactive mode.
- `max_decoding_length`: The maximum number of tokens in the sample. **Note that this includes tokens in the context**.
- `nsamples`: Number of samples to generate for each input.

For *top-k decoding*:

- `temperature`: Softmax temperature of top-k sample decoding. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to `0.7`.
- `top_k`: Number of top most likely candidates from a vocab distribution in each decoding step. Defaults to `40`.
- `nsamples`: Number of samples to generate for each input.

To use the GPT-2 `345M` model, specify `--pretrain_checkpoint` and `--config_model`:
For *top-p decoding*:
- `top_p`: Select tokens with cumulative probability of at most 'top_p' as candidates for sampling. Do not specify it if you want to use top-k decoding.

To use the GPT-2 `345M` model, specify `--config_model`:

```
python gpt2_generate_main.py --is_interactive \
--max_decoding_length=100 \
--temperature=0.7 \
--top_k=40 \
--config_model=configs.config_model_345M \
--pretrain_checkpoint=gpt2_pretrained_models/model_345M/model.ckpt
--config_model=configs.config_model_345M
```

Here:

- `pretrain_checkpoint`: Path to the model checkpoints. Default to `gpt2_pretrained_models/model_117M/model.ckpt`.
- `config_model`: Model configuration file. Default to `configs.config_model_117M`.

To use Top-P sample decoding, specify `--top_p`:

```
python gpt2_generate_main.py --is_interactive \
--max_decoding_length=100 \
--temperature=0.7 \
--top_p=40 \
--config_model=configs.config_model_345M
```

Here:

- `top_p`: Select tokens with cumulative probability of at most `p` when arranged in decreasing order. Default to be `None`.


**Example input:**
```
Model input >>> Micheal Jordan is the greatest player in history !
Expand Down Expand Up @@ -101,7 +107,7 @@ Here:
- `nsamples`: Total number of samples to generate, must be dividable by the `batch_size`.
- `batch_size`: Each iteration generates `batch_size` number of samples.

To use GPT-2 `345M` model, specify `--pretrain_checkpoint` and `--config_model` as above.
To use GPT-2 `345M` model, `--config_model` as above.

**Example output:**

Expand All @@ -116,112 +122,24 @@ in this way".

## Quick Start (II) - Fine-tune the Pre-trained Model

This section shows how we can fine-tune the pre-trained GPT2 model and use the resulting model for generation.

First of all, **download** the pre-trained model [as above](https://github.com/asyml/texar/tree/master/examples/gpt-2#download-gpt-2-pre-trained-model).

### Prepare data

We first preprocess data with the GPT-2 BPE encoding.

A toy dataset is provided under [`data/toy/`](data/toy) which includes `train.txt`, `dev.txt`, and `test.txt`. This example will fit the GPT-2 model on `train.txt`, evaluate perplexity on `dev.txt`, and do continuation generation using `test.txt` as the context.

Run the following cmd to transform the data into [TFRecord](https://www.tensorflow.org/tutorials/load_data/tf_records) format and perform processing such as truncation, BPE encoding, adding special tokens, etc:

```
python prepare_data.py --data_dir data/toy
[--max_seq_length=128]
[--tfrecord_output_dir=data/toy]
[--pretrain_model_dir=gpt2_pretrained_models/model_117M]
```
- `data_dir`: The directory of raw data, wherein data files must be named as 'train.txt', 'dev.txt', or 'test.txt'. It is *not* necessary to provide all three files.
- `max_seq_length`: The maxium length of sequence after BPE encoding. This includes GPT-2 special tokens that will be automatically added. Longer sequence will be trimmed.
- `tfrecord_output_dir`: The output path where the resulting TFRecord files will be put in. Be default, it is set to be the same as `data_dir`.
- `pretrain_model_dir`: The downloaded pretrained model directory, wherein the vocabulary files are used for data processing.

The above cmd will output TFRecord files in the specified output directory. E.g., if `train.txt` is provided under `data_dir`, the output file `train.tf_record` will be produced under `tfrecord_output_dir`.

### Train and Evaluate

For **single-GPU** training (and evaluation), run the following cmd. The cmd fine-tunes the pre-trained GPT-2 parameters, and evalautes perplexity on the dev set.
```
python gpt2_train_main.py --do_train --do_eval
[--config_train=configs.config_train]
[--output_dir=output]
```
Here:

- `config_train`: Configurations of GPT-2 training, including data and optimization hyperparameters. By default, the config file [`configs/config_train.py`](configs/config_train.py) is used. Remember to specify correct data path if you are using your own data.
- `output_dir`: The output path where checkpoints are saved.

By default, the GPT-2 `117M` model is used. To use the GPT-2 `345M` model instead, specify relevant FLAGS as below:
```
python gpt2_train_main.py --do_train --do_eval \
--config_model=configs.config_model_345M \
--pretrain_model_dir=gpt2_pretrained_models/model_345M \
--pretrain_checkpoint=gpt2_pretrained_models/model_345M/model.ckpt
[--config_train=configs.config_train]
[--output_dir=output]
```
where `--pretrain_checkpoint` is necessary only when you want to load the pretrained checkpoint, and is ignored if `--checkpoint` is specified.

Please see the FLAGS in the code for more options.

For **Multi-GPU training** on one or multiple machines, you may first install the prerequisite OpenMPI and Hovorod packages, as detailed in the [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example.

Then run the following cmd for training and evaluation. The cmd trains the model on local with 2 GPUs. Evaluation is performed with the single rank-0 GPU.
```
mpirun -np 2 \
-H localhost:2\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl tcp,self \
-mca btl_tcp_if_include ens3 \
python gpt2_train_main.py --do_train --do_eval --distributed
[--config_train=configs.config_train]
[--output_dir=output]
```
The key configurations of multi-gpu training:

* `-np`: total number of processes
* `-H`: IP addresses of different servers and the number of processes used in each server. For example, `-H 192.168.11.22:1,192.168.33.44:1`
* `-mca`: sets the MPI communication interface. Use the setting specified above to avoid possible multiprocessing and network communication issues.

- The above configuration uses the `ens3` network interface. If this interface does not work in your environment (e.g., yielding error message `Unknown interfance name`), you may want to use a different interface (Run cmd `ifconfig` to see alternative interfaces in your environment.)

Please refer to [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example for more details of the other multi-gpu configurations.

Make sure to specifiy the `--distributed` flag as above for multi-gpu training.


### Restore and Test

``
python gpt2_train_main.py --do_test --checkpoint=output/model.ckpt
[--config_train=config_train]
[--output_dir=output]
``

The output is by default saved in `output/test_samples.tsv`, where each line contains the context text and the generated continuation (separated with TAB).

Coming soon!

## Other Use Cases

Texar's `TransformerDecoder` (and other RNN-based decoders) easily supports common, advanced, or customized use, such as:
Texar's `GPT2Decoder` (and other RNN-based decoders) easily supports common, advanced, or customized use, such as:

* Sample or continuation generation
* Greedy / (top-k) sample / Gumbel-softmax / beam-search / ... / your-customized decoding
* Training / fine-tuning in (un)conditional settings
* Perplexity evaluation

**For example**, after creating the language model
```python
def _embedding_fn(ids, times):
return word_embedder(ids) + pos_embedder(times)

```python
decoder = GPT2Decoder(hparams=gpt2_hparams)

decoder = TransformerDecoder(
output_layer=tf.transpose(word_embedder.embedding),
hparams=gpt2_hparams)
def _embedding_fn(ids, times):
return decoder.word_embedder(ids) + decoder.position_embedder(times)
```
We can do

Expand Down Expand Up @@ -258,7 +176,7 @@ output, output_length = decoder(
**Ex. Use 3): Fine-tuning for conditional generation**

```python
tgt_embed = word_embedder(truth_target[:, :-1]) + pos_embedder(sequence_length=tgt_len-1)
tgt_embed = decoder.word_embedder(truth_target[:, :-1]) + decoder.position_embedder(sequence_length=tgt_len-1)

output = decoder(
memory=source_hidden_states,
Expand Down
3 changes: 2 additions & 1 deletion examples/gpt-2/configs/config_model_117M.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Texar config file of the GPT-2 model_117M model.
"""
pretrained_model_name = "117M"

vocab_size = 50257
dim = 768
Expand All @@ -8,7 +9,7 @@
"dim": dim,
}

pos_embed = {
position_embed = {
'dim': dim
}
position_size = 1024
Expand Down
3 changes: 2 additions & 1 deletion examples/gpt-2/configs/config_model_345M.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Texar config file of the GPT-2 model_345M model.
"""
pretrained_model_name = "345M"

vocab_size = 50257
dim = 1024
Expand All @@ -8,7 +9,7 @@
"dim": dim,
}

pos_embed = {
position_embed = {
"dim": dim
}
position_size = 1024
Expand Down

0 comments on commit c04990d

Please sign in to comment.