# Text generation using tensor2tensor on Cloud AI Platform

This notebook illustrates the use of [tensor2tensor](https://github.com/tensorflow/tensor2tensor) library to do from-scratch, distributed training of a training model. Then, the trained model is used to complete new poems.

## Install tensor2tensor, and specify Google Cloud Platform project and bucket

Install the necessary packages. tensor2tensor will give us the Transformer model. Project Gutenberg gives us access to historical poems.

**p.s.** Note that this notebook uses Python 2 because Project Gutenberg relies on BSD-DB which was deprecated in Python 3 and removed from the standard library. tensor2tensor itself can be used on Python 3.

In [None]:
%%bash
pip freeze | grep tensor

In [None]:
%%bash
pip install tensor2tensor=1.13.1 tensorflow=1.13.1 tensorflow-serving-api=1.13 gutenberg
pip install tensorflow_hub

# install from sou
#git clone https://github.com/tensorflow/tensor2tensor.git
#cd tensor2tensor
#yes | pip install --user -e .

In [None]:
%%bash
pip freeze | grep tensor

In [None]:
import os
PROJECT = "cloud-training-demos" # Replace with your Project ID
BUCKET = "cloud-training-demos-ml" # Replace with your bucket name
REGION = "us-central1" # Replace with your bucket region

# this is what this notebook is demonstrating
PROBLEM = "poetry_line_problem"

# for bash
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["PROBLEM"] = PROBLEM

# os.environ["PATH"] = os.environ["PATH"] + ":" + os.getcwd() + "/tensor2tensor/tensor2tensor/bin"

In [None]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## Download data

We will get some [poetry anthologies](https://www.gutenberg.org/wiki/Poetry_(Bookshelf)) from Project Gutenberg.

In [None]:
%%bash
rm -rf data/poetry
mkdir -p data/poetry

In [None]:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
import re

books = [
    # bookid, skip N lines, title
    (26715, 1000, "Victorian songs"),
    (30235, 580, "Baldwin collection"),
    (35402, 710, "Swinburne collection"),
    (574, 15, "Blake"),
    (1304, 172, "Bulchevys collection"),
    (19221, 223, "Palgrave-Pearse collection"),
    (15553, 522, "Knowles collection")
]

with open("data/poetry/raw.txt", "w") as ofp:
    lineno = 0
    for (id_nr, toskip, title) in books:
        startline = lineno
        text = strip_headers(load_etext(id_nr)).strip()
        lines = text.split("\n")[toskip:]
        # any line that is all upper case is a title or author name
        # also don't want any lines with years (numbers)
        for line in lines:
            if (len(line) > 0
                and line.upper() != line
                and not re.match(".*[0-9]+.*", line)
                and len(line) < 50
                ):
                cleaned = re.sub("[^a-z\"\-]+", " ", line.strip().lower())
                ofp.write(cleaned)
                ofp.write("\n")
                lineno += 1
            else:
                ofp.write("\n")
        print("Wrote lines {} to {} from {}".format(startline, lineno, title))

In [None]:
!wc -l data/poetry/*.txt

## Create training dataset

We are going to train an ML model to write poetry given a starting point. We'll give it one line, and it's going to tell us the next line. So, naturally, we will train it on real poetry. Our feature will be a line of a poem and the label will be next line of that poem.

Our training dataset will consist of two files. The first file will consist of the input lines of poetry and the other file will consist of the corresponding output lines, one output line per input line.

In [None]:
with open("data/poetry/raw.txt", "r") as rawfp,\
  open("data/poetry/input.txt", "w") as infp,\
  open("data/poetry/output.txt", "w") as outfp:
    
    prev_line = ""
    for curr_line in rawfp:
        curr_line = curr_line.strip()
        # poems break at empty lines, so this ensures we train only on lines of the same poem
        if len(prev_line) > 0 and len(curr_line) > 0:
            infp.write(prev_line + "\n")
            outfp.write(curr_line + "\n")
        prev_line = curr_line

In [None]:
!head -5 data/poetry/*.txt

We do not need to generate the data beforehand $-$ instead, we can have tensor2tensor create the training dataset for us. So, in the code below, we'll use only `data/poetry/raw.txt` $-$ obviously, this allows us to productionise our model better. Simply keep collecting raw data and generate the training/test data at the time of training.

## Set up problem

The Problem in tensor2tensor is where you specify parameters like the size of your vocabulary and where to get the training data from.

In [None]:
%%bash
rm -rf poetry
mkdir -p poetry/trainer

In [None]:
%%writefile poetry/trainer/problem.py
import os
import tensorflow as tf
from tensor2tensor.utils import registry
from tensor2tensor.models import transformer
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import text_problems
from tensor2tensor.data_generators import generator_utils

tf.summary.FileWriteCache.clear() # ensure filewriter cache is clear for TensorBoard events file

@registry.register_problem
class PoetryLineProblem(text_problems.Text2TextProblem):
    """Predict next line of poetry from the last line from Gutenberg texts.
    """
    
    @property
    def approx_vocab_size(self):
        return 2**13 # ~8k
    
    @property
    def is_generate_per_split(self):
        # generate data will NOT shard the data into TRAIN and EVAL for us
        return False
    
    @property
    def dataset_splits(self):
        """Splits of data to produce and number of output shards for each.
        """
        # 10% evaluation data
        return [{
            "split": problem.DatasetSplit.TRAIN,
            "shards": 90,
        }, {
            "split": problem.DatasetSplit.Eval,
            "shards": 10
        }]
    
    def generate_samples(self, data_dir, tmp_dir, dataset_split):
        with open("data/poetry/raw.txt", "r") as rawfp:
            prev_line = ""
            for curr_line in rawfp:
                curr_line = curr_line.strip()
            # poems break at empty lines, so this ensures we train only on lines of the same poem
            if len(prev_line) > 0 and len(curr_line) > 0:
                yield {
                    "inputs": prev_line,
                    "targets": curr_line
                }
            prev_line = curr_line
            
# Smaller than the typical translate model, and with more regularisation
@registry.register_hparams
def transformer_poetry():
    hparams = transformer.transformer_base()
    hparams.num_hidden_layers = 2
    hparams.hidden_size = 128
    hparams.filter_size = 512
    hparams.num_heads = 4
    hparams.attention_dropout = 0.6
    hparams.layer_prepostprocess_dropout = 0.6
    hparams.learning_rate = 0.05
    return hparams

@registry.register_hparams
def transformer_poetry_tpu():
    hparams = transformer_poetry()
    transformer.update_hparams_for_tpu(hparams)
    return hparams

# hyperparameter tuning ranges
@registry.register_ranged_hparams
def transformer_poetry_range(rhp):
    rhp.set_float("learning_rate", 0.05, 0.25, scale=rhp.LOG_SCALE)
    rhp.set_int("num_hidden_layers", 2, 4)
    rhp.set_discrete("hidden_size", [128, 256, 512])
    rhp.set_float("attention_dropout", 0.4, 0.7)

In [None]:
%%writefile poetry/trainer/__init__.py
from . import problem

In [None]:
%%writefile poetry/setup.py
from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    "tensor2tensor"
]

setup(
    name="poetry",
    version="0.1",
    author="Google",
    author_email="training-feedback@cloud.google.com",
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description="Poetry Line Problem",
    requires=[]
)

In [None]:
!touch poetry/__init__.py

In [None]:
!find poetry

## Generate training data

Our problem (translation) requires the creation of text sequences from the training dataset. This is done using tensor2tensor-datagen and the Problem defined in the previous section.

In [None]:
%%bash
DATA_DIR=./t2t_data
TMP_DIR=$DATA_DIR/tmp
rm -rf $DATA_DIR $TMP_DIR
mkdir -p $DATA_DIR $TMP_DIR
# Generate data
t2t-datagen \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --data_dir=$DATA_DIR \
    --tmp_dir=$TMP_DIR

Let's check to see the files that were output.

In [None]:
!ls t2t_data | head

## Provide Cloud AI Platform access to data

Copy the data to Google Cloud Storage, and then provide access to the data.

In [None]:
%%bash
DATA_DIR=./t2t_data
gsutil -m rm -r gs://${BUCKET}/poetry/
gsutil -m cp ${DATA_DIR}/${PROBLEM}* ${DATA_DIR}/vocab* gs://${BUCKET}/poetry/data

In [None]:
%%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin) \
    print(response['serviceAccount'])")

echo "Authorizing the Cloud AI Platform Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
gsutil -m acl ch -u $SVC_ACOUNT:R -r gs://$BUCKET
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

## Train model locally on subset of data

Let's run it locally on a subset of the data to make sure it works.

In [None]:
%%bash
BASE=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/subset
gsutil -m rm -r $OUTDIR
gsutil -m cp \
    ${BASE}/${PROBLEM}-train-0008* \
    ${BASE}/${PROBLEM}-dev-00000*  \
    ${BASE}/vocab* \
    $OUTDIR

Note: the following will work only if you are running Jupyter on a reasonably powerful machine. Don't be alarmed if your process is killed.

In [None]:
%%bash
DATA_DIR=$gs://${BUCKET}/poetry/subset
OUTDIR=./trained_model
rm -rf $OUTDIR
t2t-trainer \
    --data_dir=gs://${BUCKET}/poetry/subset \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --model=transformer \
    --hparams_set=transformer_poetry \
    --output_dir=$OUTDIR --job-dir=$OUTDIR --train_steps=10

## Option 1: Train model locally on full dataset (use if running on Notebook instance with a GPU)

You can train on the full dataset if you are on a Google Cloud Notebook Instance with a P100 or better GPU

In [None]:
%%bash
LOCALGPU="--train_steps=7500 --worker_gpu=1 --hparams_set=transformer_poetry"

DATA_DIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model
rm -rf $OUTDIR
t2t-trainer \
    --data_dir=gs://${BUCKET}/poetry/subset \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --model=transformer \
    --hparams_set=transformer_poetry \
    --output_dir=$OUTDIR ${LOCAL_GPU}

## Option 2: Train on Cloud AI Platform

tensor2tensor has a convenient `--cloud_mlengine` option to kick off the training on the managed service. It uses the Cloud AI Platform Python API, rather than requiring you to use gcloud to submit the job.

Note: your project needs P100 quota in the region

In [None]:
%%bash
GPU="--train_steps=7500 --cloud_mlengine --worker_gpu=1 --hparams_set=tranformer_poetry"

DATADIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
echo "'Y'" | t2t-trainer \
    --data_dir=gs://${BUCKET}/poetry/subset \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --model=transformer \
    --output_dir=$OUTDIR \
    ${GPU}

## Option 3: Train on a directly-connected TPU

If you are running on a VM directly connected to a Cloud TPU, you can run t2t-trainer directly. Unfortunately, you won't see any output from Jupyter while the program is running.

In [None]:
%%bash
TPU="--train_steps=7500 --use_tpu=True --cloud_tpu_name=laktpu --hparams_set=transformer_poetry_tpu"

DATADIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model_tpu
JOBNAME=poetry_$(date -y +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
echo "'Y'" | t2t-trainer \
    --data_dir=gs://${BUCKET}/poetry/subset \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --model=transformer \
    --output_dir=$OUTDIR \
    ${TPU}

In [None]:
%%bash
gsutil ls gs://${BUCKET}/poetry/model_tpu

## Option 4: Training longer

Let's train on 4 GPUs for 75,000 steps. Note the change in the last line of the job.

In [None]:
%%bash

DATADIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model_full2
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
echo "'Y'" | t2t-trainer \
    --data_dir=gs://${BUCKET}/poetry/subset \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --model=transformer \
    --hparams_set=transformer_poetry \
    --output_dir=$OUTDIR \
    --train_steps=75000 --cloud_mlengine --worker_gpu=4

In orde that you have your expectations set correctly: a high-performing translation model needs 400-million lines of input and takes 1 whole day on a TPU pod!

## Check trained model

In [None]:
%%bash
gsutil ls gs://${BUCKET}/poetry/model

## Batch predict

How will your poetry model do when faced with Rumi's spiritual couplets?

In [None]:
%%writefile data/poetry/rumi.txt
Where did the handsome beloved go?
I wonder, where did that tall, shapely cypress tree go?
He spread his light among us like a candle.
Where did he go? So strange, where did he go without me?
All day long my heart trembles like a leaf.
All alone at midnight, where did that beloved go?
Go to the road, and ask any passing traveler — 
That soul-stirring companion, where did he go?
Go to the garden, and ask the gardener — 
That tall, shapely rose stem, where did he go?
Go to the rooftop, and ask the watchman — 
That unique sultan, where did he go?
Like a madman, I search in the meadows!
That deer in the meadows, where did he go?
My tearful eyes overflow like a river — 
That pearl in the vast sea, where did he go?
All night long, I implore both moon and Venus — 
That lovely face, like a moon, where did he go?
If he is mine, why is he with others?
Since he’s not here, to what “there” did he go?
If his heart and soul are joined with God,
And he left this realm of earth and water, where did he go?
Tell me clearly, Shams of Tabriz,
Of whom it is said, “The sun never dies” — where did he go?

Let's write out the odd-numbered lines. We'll compare how close our model can get to the beauty of Rumi's second lines given his first.

In [None]:
%%bash
awk "NR % 2 == 1" data/poetry/rumitxt | tr "[:upper:]" "[:lower:]" | sed "s/[^a-z\'-\ ]//g" > data/poetry/rumi_leads.txt
head -3 data/poetry/rumi_leads.txt

In [None]:
%%bash
TOPDIR=gs://${BUCKET}
OUTDIR=${TOPDIR}/poetry/model
DATADIR=${TOPDIR}/poetry/data
MODEL=transformer
HPARAMS=transformer_poetry

# The file with the input lines
DECODE_FILE=data/poetry/rumi_leads.txt

BEAM_SIZE=4
ALPHA=0.6

t2t-decoder \
    --data_dir=$DATADIR \
    --problem=$PROBLEM \
    --model=$MODEL \
    --hparams_set=$HPARAMS \
    --output_dir=$OUTDIR \
    --t2t_usr_dir=./poetry/trainer \
    --decode_hparams="beam_size=$BEAM_SIZE, alpha=$ALPHA" \
    --decode_from_file=$DECODE_FILE

In [None]:
%%bash
DECODE_FILE=data/poetry/rumi_leads.txt
cat ${DECODE_FILE}.*.decodes

Some of these are still phrases and not complete sentences. This indicates that we might need to train longer or better somehow. We need to diagnose the model...

## Diagnosing training run

Let's diagnose the training run to see what we'd improve the next time around. (Note that this package may not be present on Jupyter `--pip install pydatalab` if necessary)

**Monitor training with TensorBoard**

To activate TensorBoard within the JupyterLab UI, navigate to **"File" - "New Launcher"**. Then double-click the "TensorBoard" icon on the bottom row.

TensorBoard 1 will appear in the new tab. Navigate through the three tabs to see the active TensorBoard. The "Graphs" and "Projector" tabs offer very interesting information including the ability to replay the tests.

You may close the TensorBoard tab when you are finished exploring.

We need to reduce overfitting and make sure the eval metrics keep going down as long as the loss is also going down.

What we really need is to get more data, but if that's not an option, we could try to reduce the Neural Network and increase the dropout regularisation. We could also do hyperparameter tuning on the dropout and network sizes.

## Hyperparameter tuning

tensor2tensor also supports hyperparameter tuning on Cloud AI Platform. Note the addition of autotune flags.

The `transformer_poetry_range` was registered in `problem.py` above.

In [None]:
%%bash

DATADIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model_hparam
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
echo "'Y'" | t2t-trainer \
    --data_dir=gs://${BUCKET}/poetry/subset \
    --t2t_usr_dir=./poetry/trainer \
    --problem=$PROBLEM \
    --model=transformer \
    --hparams_set=transformer_poetry \
    --output_dir=$OUTDIR \
    --haprams_range=transformer_poetry_range \
    --autotune_objective="metrics-poetry_line_problem/accuracy_per_sequence" \
    --autotune_maximize \
    --autotune_max_trials=4 \
    --autotune_parallel_trials=4 \
    --train_steps=7500 --cloud_mlengine --worker_gpu=4

Let's try predicting with this optimised model.

In [None]:
%%bash
BEST_TRIAL = xx # change as needed
TOPDIR=gs://${BUCKET}
OUTDIR=${TOPDIR}/poetry/model_hparam/$BEST_TRIAL
DATADIR=${TOPDIR}/poetry/data
MODEL=transformer
HPARAMS=transformer_poetry

# the file with the input lines
DECODE_FILE=data/poetry/rumi_leads.txt

BEAM_SIZE=4
ALPHA=0.6

t2t-decoder \
    --data_dir=$DATADIR \
    --problem=$PROBLEM \
    --model=$MODEL \
    --hparams_set=$HPARAMS \
    --output_dir=$OUTDIR \
    --t2t_usr_dir=./poetry/trainer \
    --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
    --decode_from_file=$DECODE_FILE \
    --hparams="num_hidden_layers=4,hidden_size=512"

In [None]:
%%bash
DECODE_FILE=data/poetry/rumi_leads.txt
cat ${DECODE_FILE}.*.decodes

## Serving model

There are two ways of serving predictions:

1. Use Cloud AI Platform $-$ this is serverless and you don't have to manage any infrastructure
2. Use Kubeflow on Google Kubernetes Engine $-$ this uses clusters but will also work on-prem on your own Kubernetes cluster

In either case, you need to export the model first and have TensorFlow serving serve the model. The model, however, expects to see *encoded* (i.e. preprocessed) data. So, we'll do that in the Python Flask application (in AppEngine Flex) that serves the user interface.

In [None]:
%%bash
TOPDIR=gs://${BUCKET}
OUTDIR=${TOPDIR}/poetry/model_full2
DATADIR=${TOPDIR}/poetry/data
MODEL=transformer
HPARAMS=transformer_poetry
BEAM_SIZE=4
ALPHA=0.6

t2t-exporter \
    --model=$MODEL \
    --hparams_set=$HPARAMS \
    --problem=$PROBLEM \
    --t2t_usr_dir=./poetry/trainer \
    --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
    --data_dir=$DATADIR \
    --output_dir=$OUTDIR

In [None]:
%%bash
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/poetry/model_full2/export | tail -1)
echo $MODEL_LOCATION
sved_model_cli show --dir $MODEL_LOCATION --tag_set serve --signature_def serving_default

### Cloud AI Platform

In [None]:
%%writefile mlengine.json
description: Poetry service on AI Platform
autoScaling:
    minNodes: 1 # We don't want this model to autoscale down to zero

In [None]:
%%bash
MODEL_NAME="poetry"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/poetry/model_full2/export | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} \
        --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=1.13 --config=mlengine.json

### Kubeflow

Follow these instructions:

- On the GCP console, launch a Google Kubernetes Engine (GKE) cluster named `poetry` with **2 nodes**, each of which is a **n1-standard-2** (2 vCPUs, 7.5 GB memory) VM
- On the GCP console, click on the Connect button for your cluster, and choose the Cloud Shell option
- In Cloud Shell, run:

    `git clone https://github.com/GoogleCloudPlatform/training-data-analyst`
    
    `cd training-data-analyst/courses/machine_learning/deepdive/09_sequence`
- Look at `./setup_kubeflow.sh` and modify as appropriate

### AppEngine

What's deployed in Cloud AI Platform or Kubeflow is only the TensorFlow model. We still need a preprocessing service. That is done using AppEngine. Edit `application/app.yaml` appropriately.

In [None]:
!cat application/app.yaml

In [None]:
%%bash
cd application
#gcloud app create # if this is your first app
#gcloud app deploy --quiet --stop-previous-version app.yaml

Now visit https://mlpoetry-dot-cloud-training-demos.appspot.com and try out the prediction app!