asyml · ZhitingHu · Feb 28, 2019 · Feb 28, 2019
diff --git a/.gitignore b/.gitignore
@@ -255,3 +255,5 @@ simple-examples.tgz
 /examples/bert/bert_pretrained_models/
 !/examples/bert/bert_pretrained_models/download_model.sh
 /examples/bert/output
+/examples/gpt-2/gpt2_pretrained_models/
+!/examples/gpt-2/gpt2_pretrained_models/download_model.sh
diff --git a/examples/gpt-2/README.md b/examples/gpt-2/README.md
@@ -0,0 +1,150 @@
+# GPT-2: Pre-trained Langauge Model
+
+This is a Texar implementation of [OpenAI GPT-2 (Generative Pre-Trainning)](https://github.com/openai/gpt-2) language model, which allows to load official pre-trained model parameters, generate samples, etc.
+
+With Texar, building the GPT-2 model is as simple as creating a [`TransformerDecoder`](https://texar.readthedocs.io/en/latest/code/modules.html#transformerdecoder) instance. We can initialize the parameters of the TransformerDecoder using a pre-trained GPT-2 checkpoint by calling `init_gpt2_checkpoint(path_to_gpt2_checkpoint)` .
+
+In sum, this example showcases:
+
+* Contructing and using pre-trained GPT-2 models in Texar
+* Using GPT-2 to generate text samples with or without context
+* Examples of other use cases
+
+## Quick Start
+### Download GPT-2 Pre-trained Model
+
+Download the GPT-2 model checkpoint with the following command:
+```
+sh gpt2_pretrained_models/download_model.sh 117M
+```
+By default, it will download a pretrained model named `117M` to `gpt2_pretrained_models/`.
+
+### Usage
+| WARNING: Samples are unfiltered and may contain offensive content. |
+| --- |
+
+#### Interactive mode (to generate samples with context)
+
+This mode will initialize an interactive interface, which allows users to type in the context sentence. The model then generates continuation of the context. Top-K sample decoding is used.
+
+```
+python generative_pretraining_main.py --is_interactive \
+--max_decoding_length=100 \
+--temperature=0.7 \
+--top_k=40
+```
+
+Here:
+
+- `is_interactive`: Specifies interactive mode.
+- `max_decoding_length`: The maximum number of tokens in the sample. **Note that this includes tokens in the context**. 
+- `temperature`: Softmax temperature of top-k sample decoding. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to `0.7`.
+- `top_k`: Number of top most likely candidates from a vocab distribution in each decoding step. Defaults to `40`.
+- `nsamples`: Number of samples to generate for each input. 
+
+**Example input:**
+```
+Model input >>> Micheal Jordan is the greatest player in history !
+```
+**Example output:**
+```
+======================================== SAMPLE 1 ========================================
+
+He's the one who has made all the difference. He's a true legend. He's a great athlete, 
+a great athlete. He's a great athlete. I'm so happy for him. I'm so happy for his family, 
+the family, and I'm so happy for him. I'm so happy for his teammates, his teammates, and 
+I'm so happy for him.
+
+The last time we saw him on stage, he
+
+================================================================================
+```
+
+#### Non-interactive mode (to generate samples from scratch)
+
+This mode generates a batch of samples from scratch.
+
+```
+python generative_pretraining_main.py
+--nsamples=1 \
+--batch_size=1 \
+--max_decoding_len=100 \
+--temperature=0.7 \
+--top_k=40
+```
+
+Here:
+
+- `nsamples`: Total number of samples to generate, must be dividable by the `batch_size`.
+- `batch_size`: Each iteration generates `batch_size` number of samples.
+
+**Example output:**
+
+```
+"A new government and a healthy economy have a chance to take this up."
+
+After he said the election's outcome in the House was important and had helped to build 
+confidence in the House, former Ukip leader Nigel Farage spoke about working to boost 
+the economy, saying the vote for the "lefties" and others "were bad optics for Labour 
+in this way".
+```
+
+## Other Use Cases
+
+Texar's `TransformerDecoder` (and other RNN-based decoders) easily supports common, advanced, or customized use, such as:
+
+* Sample or continuation generation
+* Greedy / (top-k) sample / Gumbel-softmax / beam-search / ... / your-customized decoding
+* Training / fine-tuning in (un)conditional settings
+* Perplexity evaluation
+
+**For example**, after creating the language model
+```python
+decoder = TransformerDecoder(embedder, hparams=gpt2_hparams)
+```
+We can do
+
+**Ex. Use 1): Continuation generation w/ greedy decoding**
+
+```python
+output, output_length = decoder(
+    context=ctx,
+    context_sequence_length=ctx_len,
+    decoding_strategy='infer_greedy',
+    end_token=end_token)
+
+sample_id = output.sample_id
+logits = output.logits
+```
+
+**Ex. Use 2): Top-k sample decoding**
+
+```python
+topk_helper = tx.modules.TopKSampleEmbeddingHelper(
+    embedding=embedder,
+    start_tokens=ctx[:,0],
+    end_token=end_token,
+    top_k=20,
+    softmax_temperature=0.7)
+
+output, output_length = decoder(
+    context=ctx,
+    context_sequence_length=ctx_len,
+    helper=topk_helper)
+```
+
+**Ex. Use 3): Fine-tuning for conditional generation**
+
+```python
+output, output_length = decoder(
+    memory=source_hidden_states, 
+    memory_sequence_length=src_len,
+    inputs=truth_target,
+    sequence_length=tgt_len-1,
+    decoding_strategy='train_greedy')
+
+loss = tx.losses.sequence_sparse_softmax_cross_entropy(
+    lables=truth_target[:, 1:],
+    logits=output.logits,
+    sequence_length=tgt_len-1)
+```
diff --git a/examples/gpt-2/gpt2_config_lib/README.md b/examples/gpt-2/gpt2_config_lib/README.md
@@ -0,0 +1,3 @@
+### Configuration files of GPT-2 models in Texar style.
+
+For example, `config_model_117M.py` is the Texar configuration file corresponding to the `117M` model downloaded from [GPT-2 official release](https://github.com/openai/gpt-2).
diff --git a/examples/gpt-2/gpt2_config_lib/__init__.py b/examples/gpt-2/gpt2_config_lib/__init__.py
diff --git a/examples/gpt-2/gpt2_config_lib/config_model_117M.py b/examples/gpt-2/gpt2_config_lib/config_model_117M.py
@@ -0,0 +1,56 @@
+"""Texar config file of the GPT-2 117M model.
+"""
+
+vocab_size = 50257
+dim = 768
+
+embed = {
+    "dim": dim,
+}
+
+decoder = {
+    "scale_embeds": False,
+    "dim": dim,
+    "num_blocks": 12,
+    "multihead_attention": {
+        "use_bias": True,
+        "num_units": dim,
+        "num_heads": 12,
+        "output_dim": dim,
+    },
+    "position_embedder_type": "simple",
+    "position_size": 1024,
+    "position_embedder_hparams": {
+        "dim": dim,
+    },
+    "initializer": {
+        "type": "variance_scaling_initializer",
+        "kwargs": {
+            "scale": 1.0,
+            "mode": "fan_avg",
+            "distribution": "uniform",
+        },
+    },
+    "poswise_feedforward": {
+        "layers": [
+            {
+                "type": "Dense",
+                "kwargs": {
+                    "name": "conv1",
+                    "units": dim*4,
+                    "activation": "gelu",
+                    "use_bias": True,
+                }
+            },
+            {
+                "type": "Dense",
+                "kwargs": {
+                    "name": "conv2",
+                    "units": dim,
+                    "use_bias": True,
+                }
+            }
+        ],
+        "name": "ffn",
+    },
+}