## Text generation using tensor2tensor on Cloud ML Engine

This notebook illustrates using the <a href="https://github.com/tensorflow/tensor2tensor">tensor2tensor</a> library to do from-scratch, distributed training of a poetry model. Then, the trained model is used to complete new poems.
<p/>
### Install tensor2tensor, and specify Google Cloud Platform project and bucket

Install the necessary packages. tensor2tensor will give us the Transformer model. Project Gutenberg gives us access to historical poems.

In [3]:
%bash
pip freeze | grep tensor

tensorflow==1.4.1
tensorflow-tensorboard==0.4.0rc3


In [None]:
%bash
pip install --upgrade tensor2tensor gutenberg

In [5]:
%bash
pip freeze | grep tensor

tensor2tensor==1.4.2
tensorflow==1.4.1
tensorflow-tensorboard==0.4.0rc3


In [1]:
import os
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# this is what this notebook is demonstrating
PROBLEM= 'poetry_line_problem'

# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['PROBLEM'] = PROBLEM

In [5]:
! gcloud config set project $PROJECT

Updated property [core/project].


Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update



### Download data

We will get some <a href="https://www.gutenberg.org/wiki/Poetry_(Bookshelf)">poetry anthologies</a> from Project Gutenberg.

In [4]:
%bash
rm -rf data/poetry
mkdir -p data/poetry

In [26]:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
import re

books = [
  # bookid, skip N lines
  (19221, 223),
  (15553, 522) 
]

with open('data/poetry/raw.txt', 'w') as ofp:
  for (id_nr, toskip) in books:
    text = strip_headers(load_etext(id_nr)).strip()
    lines = text.split('\n')[toskip:]
    # any line that is all upper case is a title or author name
    for line in lines:
      if len(line) > 0 and line.upper() != line:
        cleaned = re.sub('[^a-z\'\-]+', ' ', line.strip().lower())
        ofp.write(cleaned)
        ofp.write('\n')
      else:
        ofp.write('\n')

In [14]:
!wc -l data/poetry/*.txt

22544 data/poetry/raw.txt


## Create training dataset

We are going to train a machine learning model to write poetry given a starting point. We'll give it one line, and it is going to tell us the next line.  So, naturally, we will train it on real poetry. Our feature will be a line of a poem and the label will be next line of that poem.
<p>
Our training dataset will consist of two files.  The first file will consist of the input lines of poetry and the other file will consist of the corresponding output lines, one output line per input line.

In [17]:
from random import random
with open('data/poetry/raw.txt', 'r') as rawfp,\
  open('data/poetry/train_input.txt', 'w') as train_infp,\
  open('data/poetry/train_output.txt', 'w') as train_outfp,\
  open('data/poetry/test_input.txt', 'w') as test_infp,\
  open('data/poetry/test_output.txt', 'w') as test_outfp:
    
    for curr_line in rawfp:
        curr_line = curr_line.strip()
        # poems break at empty lines, so this ensures we train only
        # on lines of the same poem
        if len(prev_line) > 0 and len(curr_line) > 0:
          if random() < 0.9:        
            train_infp.write(prev_line + '\n')
            train_outfp.write(curr_line + '\n')
          else:        
            test_infp.write(prev_line + '\n')
            test_outfp.write(curr_line + '\n')
        prev_line = curr_line      

In [18]:
!head -5 data/poetry/*.txt

==> data/poetry/raw.txt <==



spring the sweet spring is the year's pleasant king 
then blooms each thing then maids dance in a ring 

==> data/poetry/test_input.txt <==
phoebus arise
that she may thy career with roses spread
your furious chiding stay
when i have seen the hungry ocean gain
when i have seen such interchange of state

==> data/poetry/test_output.txt <==
and paint the sable skies
the nightingales thy coming eachwhere sing
let zephyr only breathe
advantage on the kingdom of the shore
or state itself confounded to decay

==> data/poetry/train_input.txt <==
spring the sweet spring is the year's pleasant king
then blooms each thing then maids dance in a ring
cold doth not sting the pretty birds do sing
the palm and may make country houses gay
lambs frisk and play the shepherds pipe all day

==> data/poetry/train_output.txt <==
then blooms each thing then maids dance in a ring
cold doth not sting the pretty birds do sing
cuckoo jug-jug pu-we to-

In [19]:
!tar cvfz poetrydata.tgz data/poetry

data/poetry/
data/poetry/raw.txt
data/poetry/test_input.txt
data/poetry/test_output.txt
data/poetry/train_input.txt
data/poetry/train_output.txt


### Set up problem
The Problem in tensor2tensor is where you specify parameters like the size of your vocabulary and where to get the training data from.

In [8]:
%bash
rm -rf poetry
mkdir -p poetry/trainer

In [7]:
%writefile poetry/trainer/problem.py
import os
import tensorflow as tf
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
from tensor2tensor.models import transformer
from tensor2tensor.utils import registry

TRAIN_DATASETS = [
    [
        "{}/poetrydata.tgz".format(os.getcwd()),
        ("data/poetry/train_input.txt",
         "data/poetry/train_output.txt")
    ],
]
TEST_DATASETS = [
    [
        "{}/poetrydata.tgz".format(os.getcwd()),
        ("data/poetry/test_input.txt",
         "data/poetry/test_output.txt")
    ],
]

@registry.register_problem
class PoetryLineProblem(translate.TranslateProblem):
  @property
  def targeted_vocab_size(self):
    return 2**12  # 4096

  @property
  def vocab_name(self):
    return "vocab.poetry_anthology"
  
  def generator(self, data_dir, tmp_dir, train):
    symbolizer_vocab = generator_utils.get_or_generate_vocab(
        data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size, sources=TRAIN_DATASETS)
    datasets = TRAIN_DATASETS if train else TEST_DATASETS
    tag = "train" if train else "dev"
    data_path = translate.compile_data(tmp_dir, datasets, "wmt_ende_tok_%s" % tag)
    return translate.token_generator(data_path + ".lang1", data_path + ".lang2",
                           symbolizer_vocab, text_encoder.EOS_ID)

  @property
  def input_space_id(self):
    return problem.SpaceID.EN_TOK

  @property
  def target_space_id(self):
    return problem.SpaceID.EN_TOK

# Smaller than the typical translate model, and with more regularization
@registry.register_hparams
def transformer_poetry():
  hparams = transformer.transformer_base()
  hparams.num_hidden_layers = 2
  hparams.hidden_size = 128
  hparams.filter_size = 512
  hparams.num_heads = 4
  hparams.attention_dropout = 0.6
  hparams.layer_prepostprocess_dropout = 0.6
  hparams.learning_rate = 0.05
  return hparams

Overwriting poetry/trainer/problem.py


In [10]:
%%writefile poetry/trainer/__init__.py
from . import problem

Writing poetry/trainer/__init__.py


In [11]:
%%writefile poetry/setup.py
from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
  'tensor2tensor'
]

setup(
    name='poetry',
    version='0.1',
    author = 'Google',
    author_email = 'training-feedback@cloud.google.com',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='Poetry Line Problem',
    requires=[]
)

Writing poetry/setup.py


## Generate training data 

Our problem (translation) requires the creation of text sequences from the training dataset.  This is done using t2t-datagen and the Problem defined in the previous section. 

In [None]:
%bash
DATA_DIR=./t2t_data
TMP_DIR=$DATA_DIR/tmp
rm -rf $DATA_DIR $TMP_DIR
mkdir -p $DATA_DIR $TMP_DIR
# Generate data
t2t-datagen \
  --t2t_usr_dir=./poetry/trainer \
  --problem=$PROBLEM \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR

## Provide Cloud ML Engine access to data

Copy the data to Google Cloud Storage, and then provide access to the data

In [None]:
%bash
DATA_DIR=./t2t_data
gsutil -m rm -r gs://${BUCKET}/poetry/
gsutil -m cp ${DATA_DIR}/${PROBLEM}* ${DATA_DIR}/vocab* gs://${BUCKET}/poetry/data

In [None]:
%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print response['serviceAccount']")

echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

## Train model as a Python package

To submit the training job to Cloud Machine Learning Engine, we need a Python module with a main(). We'll use the t2t-trainer that is distributed with tensor2tensor as the main

In [None]:
%bash
wget https://raw.githubusercontent.com/tensorflow/tensor2tensor/master/tensor2tensor/bin/t2t-trainer
mv t2t-trainer poetry/trainer/t2t-trainer.py

In [16]:
!touch poetry/__init__.py

In [17]:
!find poetry

poetry
poetry/__init__.py
poetry/setup.py
poetry/trainer
poetry/trainer/__init__.py
poetry/trainer/__init__.pyc
poetry/trainer/problem.py
poetry/trainer/problem.pyc
poetry/trainer/t2t-trainer.py


Let's test that the Python package works. Since we are running this locally, I'll try it out on a subset of the original data

In [None]:
%bash
BASE=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/subset
gsutil -m rm -r $OUTDIR
gsutil -m cp \
    ${BASE}/${PROBLEM}-train-0008* \
    ${BASE}/${PROBLEM}-dev-00000*  \
    ${BASE}/vocab* \
    $OUTDIR

Note: the following will work only if you are running Datalab on a beefy machine, for example, if you started Datalab  on a machine with a GPU.  Otherwise, don't be alarmed if your process is killed.

In [None]:
%bash
OUTDIR=./trained_model
rm -rf $OUTDIR
export PYTHONPATH=${PYTHONPATH}:${PWD}/poetry
python -m trainer.t2t-trainer \
  --data_dir=gs://${BUCKET}/poetry/subset \
  --problems=$PROBLEM \
  --model=transformer \
  --hparams_set=transformer_poetry \
  --output_dir=$OUTDIR --job-dir=$OUTDIR --train_steps=10

## Train on Cloud ML Engine

Once we have a working Python package, training on a Cloud ML Engine GPU is straightforward.
To run on a single GPU, you would specify 
```
--scale-tier=BASIC_GPU
...
--train_steps=5000
--worker_gpu=1
```

In [None]:
%bash
OUTDIR=gs://${BUCKET}/poetry/model
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --module-name=trainer.t2t-trainer \
   --package-path=${PWD}/poetry/trainer \
   --job-dir=$OUTDIR \
   --runtime-version=1.4 \
   -- \
  --data_dir=gs://${BUCKET}/poetry/data \
  --problems=$PROBLEM \
  --model=transformer \
  --hparams_set=transformer_poetry \
  --output_dir=$OUTDIR \
  --train_steps=5000 --worker_gpu=1

The job took about <b>20 minutes</b> for me and ended with these evaluation metrics:
<pre>
Saving dict for global step 6000: global_step = 6000, loss = 4.98682, metrics-poetry_line_problem/accuracy = 0.191315, metrics-poetry_line_problem/accuracy_per_sequence = 0.0, metrics-poetry_line_problem/accuracy_top5 = 0.319305, metrics-poetry_line_problem/approx_bleu_score = 0.00794831, metrics-poetry_line_problem/neg_log_perplexity = -5.50358, metrics-poetry_line_problem/rouge_2_fscore = 0.0171307, metrics-poetry_line_problem/rouge_L_fscore = 0.187759
</pre>
Notice that accuracy_per_sequence is 0 -- Considering that we are asking the NN to be rather creative, that doesn't surprise me. Why am I looking at accuracy_per_sequence and not the other metrics? This is because it is more appropriate for problem we are solving; metrics like Bleu score are better for translation.

In [None]:
%bash
gsutil ls gs://${BUCKET}/poetry/model

## Training longer

Let's train on 4 GPUs for 75,000 steps. Does the model improve?
Note the change from above; I am specifying:
```
--scale-tier=CUSTOM --config four_gpus.json 
...
--train_steps=75000
--worker_gpu=4
```

In [20]:
%writefile four_gpus.json
{
  "trainingInput": {
    "scaleTier": "CUSTOM",
    "masterType": "complex_model_m_gpu",
  },
}

Writing four_gpus.json


In [None]:

XXX  This takes 12 hours on 4 GPUs. Remove this line if you are sure you want to do this.

%bash
OUTDIR=gs://${BUCKET}/poetry/model_full
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=CUSTOM --config four_gpus.json \
   --module-name=trainer.t2t-trainer \
   --package-path=${PWD}/poetry/trainer \
   --job-dir=$OUTDIR \
   --runtime-version=1.4 \
   -- \
  --data_dir=gs://${BUCKET}/poetry/data \
  --problems=$PROBLEM \
  --model=transformer \
  --hparams_set=transformer_poetry \
  --output_dir=$OUTDIR \
  --train_steps=75000 --worker_gpu=4

This job took <b>12 hours</b> for me and ended with these metrics:
<pre>
global_step = 76000, loss = 4.99763, metrics-poetry_line_problem/accuracy = 0.219792, metrics-poetry_line_problem/accuracy_per_sequence = 0.0192308, metrics-poetry_line_problem/accuracy_top5 = 0.37618, metrics-poetry_line_problem/approx_bleu_score = 0.017955, metrics-poetry_line_problem/neg_log_perplexity = -5.38725, metrics-poetry_line_problem/rouge_2_fscore = 0.0325563, metrics-poetry_line_problem/rouge_L_fscore = 0.210618
</pre>
At least the accuracy per sequence is no longer zero. It is now 0.0192308 ... note that we are using a relatively small dataset (12K lines) and this is *tiny* in the world of natural language problems.
<p>
In order that you have your expectations set correctly: a high-performing translation model needs 400-million lines of input and takes 1 whole day on a TPU pod!

## Batch-predict

How will our poetry model do when faced with Rumi's spiritual couplets?

In [1]:
%writefile data/poetry/rumi.txt
Where did the handsome beloved go?
I wonder, where did that tall, shapely cypress tree go?
He spread his light among us like a candle.
Where did he go? So strange, where did he go without me?
All day long my heart trembles like a leaf.
All alone at midnight, where did that beloved go?
Go to the road, and ask any passing traveler — 
That soul-stirring companion, where did he go?
Go to the garden, and ask the gardener — 
That tall, shapely rose stem, where did he go?
Go to the rooftop, and ask the watchman — 
That unique sultan, where did he go?
Like a madman, I search in the meadows!
That deer in the meadows, where did he go?
My tearful eyes overflow like a river — 
That pearl in the vast sea, where did he go?
All night long, I implore both moon and Venus — 
That lovely face, like a moon, where did he go?
If he is mine, why is he with others?
Since he’s not here, to what “there” did he go?
If his heart and soul are joined with God,
And he left this realm of earth and water, where did he go?
Tell me clearly, Shams of Tabriz,
Of whom it is said, “The sun never dies” — where did he go?

Overwriting data/poetry/rumi.txt


Let's write out the odd-numbered lines. We'll compare how close our model can get to the beauty of Rumi's second lines given his first.

In [2]:
%bash
awk 'NR % 2 == 1' data/poetry/rumi.txt | tr '[:upper:]' '[:lower:]' | sed "s/[^a-z\'-\ ]//g" > data/poetry/rumi_leads.txt
head -3 data/poetry/rumi_leads.txt

where did the handsome beloved go
he spread his light among us like a candle
all day long my heart trembles like a leaf


In [None]:
%bash
# same as the above training job ...
TOPDIR=gs://${BUCKET}
OUTDIR=${TOPDIR}/poetry/model_full  # or ${TOPDIR}/poetry/model
DATADIR=${TOPDIR}/poetry/data
MODEL=transformer
HPARAMS=transformer_poetry

# the file with the input lines
DECODE_FILE=data/poetry/rumi_leads.txt

BEAM_SIZE=4
ALPHA=0.6

t2t-decoder \
  --data_dir=$DATADIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$OUTDIR \
  --t2t_usr_dir=./poetry/trainer \
  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
  --decode_from_file=$DECODE_FILE

<b> Note </b> if you get an error about "AttributeError: 'HParams' object has no attribute 'problems'" please <b>Reset Session</b>, run the cell that defines the PROBLEM and run the above cell again.

In [14]:
%bash  
DECODE_FILE=data/poetry/rumi_leads.txt
cat ${DECODE_FILE}.*.decodes

and the old familiar faces
and gave him low
i'll borrow
the rapid of the valleys of hall
and all the world is gone
and let me to thee
i'll borrow
and thou art thou art gone
and nothing more
a famous victory
and he is marching on his hard heart
and many a passage


Some of these are still phrases and not complete sentences. This indicates that we might need to train longer or better somehow. We need to diagnose the model ...
<p>
### Diagnosing training run
<p>
Let's diagnose the training run to see what we'd improve the next time around.

In [15]:
from google.datalab.ml import TensorBoard
TensorBoard().start('gs://{}/poetry/model_full'.format(BUCKET))

18780

In [7]:
from google.datalab.ml import TensorBoard
TensorBoard().stop(13280)
print 'stopped TensorBoard'

stopped TensorBoard


<table>
<tr>
<td><img src="diagrams/poetry_loss.png"/></td>
<td><img src="diagrams/poetry_acc.png"/></td>
</table>
Looking at the loss curve, it is clear that we are overfitting (note that the orange training curve is well below the blue eval curve). Both loss curves and the accuracy-per-sequence curve, which is our key evaluation measure, plateaus after 40k. (The red curve is a faster way of computing the evaluation metric, and can be ignored). So, how do we improve the model? Well, we need to reduce overfitting and make sure the eval metrics keep going down as long as the loss is also going down.
<p>
What we really need to do is to get more data, but if that's not an option, we could try to reduce the NN and increase the dropout regularization. We could also do hyperparameter tuning on the dropout and network sizes.

## Serving poetry

[TBD]

How would you serve these predictions? The easiest way would to be take t2t-decoder and wrap it with a Python Flask web application and run it on a GCE instance with a GPU. It's just Python code, after all.

Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License