# MNIST Image Classification with TensorFlow on Cloud AI Platform

This lab demonstrates how to implement different image models on MNIST using the `tf.keras` API.

**Learning objectives**

1. Understand how to build a Dense Neural Network (DNN) for image classification
2. Understand how to use dropout for image classification
3. Understand how to use Convolutional Neural Networks (CNN)
4. Know how to deploy and use an image classification model using Cloud AI Platform

In [None]:
from datetime import datetime
import os

PROJECT = "your-project-id-here"
BUCKET = "your-bucket-id-here"
REGION = "us-central1"
MODEL_TYPE = "cnn"

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["MODEL_TYPE"] = MODEL_TYPE
os.environ["TFVERSION"] = "2.1"
os.environ["IMAGE_URI"] = os.path.join("gcr.io", PROJECT, "mnist_models")

## Building a dynamic model

Model code needs to be packaged as a Python module in order to run it on Cloud AI Platform.

The boilerplate structure for this module has already been set up in the `mnist/learned` folder. The module lives in the sub-folder `trainer` and is designated as a Python package with the empty `__init__.py` (`mnist/trainer/__init__.py`) file. It still needs the model and a trainer to run it, so let's make them.

Let's start with the trainer file first. This file parses command line arguments to feed into the model.

In [None]:
%%writefile mnist_models/trainer/task.py
import argparse
import json
import os
import sys

from . import model

def _parse_arguments(argv):
    """Parses command-line arguments"""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_type",
        help="Which model type to use",
        type=str,
        default="linear")
    parser.add_argument(
        "--epochs",
        help="The number of epochs to train on",
        type=int,
        default=10)
    parser.add_argument(
        "--steps_per_epoch",
        help="The number of steps per epoch to train on",
        type=int,
        default=100)
    parser.add_argument(
        "--job-dir",
        help="Directory where to save the given model",
        type=str,
        default="mnist/")
    return parser.parse_known_args(argv)

def main():
    """Parses command line arguments and kicks off model training"""
    args = _parse_arguments(sys.argv[1:])[0]
    
    # Configure path for hyperparameter tuning
    trial_id = json.loads(
        os.environ.get("TF_CONFIG", "{}")).get("task", {}).get("trial", "")
    output_path = args.job_dir if not trial_id else args.job_dir + "/"
    
    model_layers = model.get_layers(args.model_type)
    image_model = model.build_model(model_layers, args.job_dir)
    model_history = model.train_and_evaluate(
        image_model, args.epochs, args.steps_per_epoch, args.job_dir)
    
if __name__ == "__main__":
    main()

Next, let's group non-model functions into a `util` file to keep the model file simple. We'll copy over the `scale` and `load_dataset` from the `MNIST_linear` lab.

In [None]:
%%writefile mnist_models/trainer/util.py
import tensorflow as tf

def scale(image, label):
    """Scales images from a 0-255 int range to a 0-1 float range"""
    image = tf.cast(image, tf.float32)
    image /= 255
    image = tf.expand_dims(image, -1)
    return image, label

def load_dataset(data, training=True, buffer_size=5000, batch_size=100, nclasses=10):
    """Loads MNIST data set into a tf.data.Dataset"""
    (x_train, y_train), (x_test, y_test) = data
    x = x_train if training else x_test
    y = y_train if training else y_test
    # One-hot encode the classes
    y = tf.keras.utils.to_categorical(y, nclasses)
    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataste = dataset.map(scale).batch(batch_size)
    if training:
        dataset = dataset.shuffle(buffer_size).repeat()
    return dataset

Finally, let's code the models! The `tf.keras` API accepts an array of `layers` into a `model` object, so we can create a dictionary of layers based on the different model types we want to use. The below file has two functions: `get_layers` and `create_and_train_model`. We will build the structure of our model in `get_layers`.

These models progressively build on each other.

In [None]:
%%writefile mnist_models/trainer/model.py
import os
import shutil

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Softmax

from . import util

# Image variables
WIDTH = 28
HEIGHT = 28


def get_layers(
    model_type,
    nclasses=10,
    hidden_layer_1_neurons=400,
    hidden_layer_2_neurons=100,
    dropout_rate=0.25,
    num_filters_1=64,
    kernel_size_1=3,
    pooling_size_1=2,
    num_filters_2=32,
    kernel_size_2=3,
    pooling_size_2=2):
    """Constructs layers for a Keras model based on a dict of model types"""
    model_layers = {
        "linear": [
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(nclasses),
            tf.keras.layers.Softmax()
        ],
        "dnn": [
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(hidden_layer_1_neurons, activation="relu"),
            tf.keras.layers.Dense(hidden_layer_2_neurons, activation="relu"),
            tf.keras.layers.Dense(nclasses),
            tf.keras.layers.Softmax()
        ],
        "dnn_dropout": [
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(hidden_layer_1_neurons, activation="relu"),
            tf.keras.layers.Dense(hidden_layer_2_neurons, activation="relu"),
            tf.keras.layers.Dropout(dropout_rate),
            tf.keras.layers.Dense(nclasses),
            tf.keras.layers.Softmax()
        ],
        "cnn": [
            tf.keras.layers.Conv2D(num_filters_1, kernel_size=kernel_size_1, activation="relu",
                                   input_shape=(WIDTH, HEIGHT, 1)),
            tf.keras.layers.MaxPooling2D(pooling_size_1),
            tf.keras.layers.Conv2D(num_filters_2, kernel_size=kernel_size_2, activation="relu"),
            tf.keras.layers.MaxPooling2D(pooling_size_2),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(hidden_layer_1_neurons, activation="relu"),
            tf.keras.layers.Dense(hidden_layer_2_neurons, activation="relu"),
            tf.keras.layers.Dropout(dropout_rate),
            tf.keras.layers.Dense(nclasses),
            tf.keras.layers.Softmax()
        ]
        
    }
    return model_layers[model_type]

def build_model(layers, output_dir):
    """Compiles Keras model for image classification"""
    model = tf.keras.Sequential(layers)
    model.compile(
        optimizer="adam",
        loss="categorical_crossentropy",
        metrics=["accuracy"])
    return model

def train_and_evaluate(model, num_epochs, steps_per_epoch, output_dir):
    """Compiles Keras model and loads data into it for training"""
    mnist = tf.keras.datasets.mnist.load_data()
    train_data = util.load_dataset(mnist)
    validation_data = util.load_dataset(mnist, training=False)
    
    callbacks = []
    if output_dir:
        tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=output_dir)
        callbacks = [tensorboard_callback]
        
    history = model.fit(
        train_data,
        validation_data=validation_data,
        epochs=num_epochs,
        steps_per_epoch=steps_per_epoch,
        verbose=2,
        callbacks=callbacks)
    
    if output_dir:
        export_path = os.path.join(output_dir, "keras_export")
        model.save(export_path, save_format="tf")
        
    return history

## Local training

With everything set up, let's run the code locally to test it.

In [None]:
!python3 -m mnist_models.trainer.test

Our model is working well locally. Let's test it on Cloud AI Platform. We can run it as a Python module locally first using the command line.

The below cell transfers some of our variables to the command line as well as it creates a job directory including timestamp.

In [None]:
current_time = datetime.now().strftime("%y%m%d_%H%M%S")
model_type = "cnn"

os.environ["MODEL_TYPE"] = model_type
os.environ["JOB_DIR"] = "mnist_models/models/{}_{}/".format(
    model_type, current_time)

The cell below runs local version of the code. The `epochs` and `steps_per_epoch` flags can be changed to run for longer or shorter, as defined in our `mnist_models/trainer/task.py` file.

In [None]:
%%bash
python3 -m mnist_models.trainer.task \
    --job-dir=$JOB_DIR \
    --epochs=5 \
    --steps_per_epoch=50 \
    --model_type=$MODEL_TYPE

## Training on the cloud

Since we're using an unreleased version of TensorFlow on Cloud AI Platform, we can instead use a [Deep Learning Container](https://cloud.google.com/ai-platform/deep-learning-containers/docs/overview) to take advantage of libraries and apps not normally packaged with AI Platform. Below is a simple Dockerfile which copies our code to be used in a TensorFlow 2 environment.

In [None]:
%%writefile mnist_models/Dockerfile
FROM gcr.io/deeplearning-platform-release/tf2-cpu
COPY mnist_models/trainer /mnist_models/trainer
ENTRYPOINT ["python3", "-m", "mnist_models.trainer.task"]

The below command builds the image and ships it off to Google Cloud so it can be used on Cloud AI Platform. Cloud Build must be enabled.

In [None]:
!docker build -f mnist_models/Dockerfile -t $IMAGE_URI ./

In [None]:
!docker push $IMAGE_URI

Finally we can kickoff the [Cloud AI Platform training job](https://cloud.google.com/sdk/gcloud/reference/ai-platform/jobs/submit/training). We can pass in our docker image using the `master-image-uri` flag.

In [None]:
current_time = datetime.now().strftime("%y%m%d_%H%M%S")
model_type = "cnn"

os.environ["MODEL_TYPE"] = model_type
os.environ["JOB_DIR"] = "gs://{}/mnist_{}_{}/".format(
    BUCKET, model_type, current_time)
os.environ["JOB_NAME"] = "mnist_{}_{}".format(
    model_type, current_time)

In [None]:
%%bash
echo $JOB_DIR $REGION $JOB_NAME
gcloud ai-platform jobs submit training $JOB_NAME \
    --staging-buckets=gs://$BUCKET \
    --region=$REGION \
    --master-image-uri=$IMAGE_URI \
    --scale-tier=BASIC_GPU \
    --job-dir=$JOB_DIR \
    -- \
    --model_type=$MODEL_TYPE

## Deploying and predicting with the model

Once you have a model you're proud of, let's deploy it! All we need to do is give AI Platform the location of the model. Below code uses the Keras export path of the previous job, but `${JOB_DIR}keras_export/` can always be changed to a different path.

In [None]:
%%bash
MODEL_NAME="mnist"
MODEL_VERSION=${MODEL_TYPE}
MODEL_LOCATION=${JOB_DIR}keras_export/
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
gcloud ai-platform models delete ${MODEL_NAME}
gcloud config set ai_platform/region global
gcloud ai-platform models create ${MODEL_NAME} --regions $REGION
gcloud ai-platform versions create ${MODEL_VERSION} \
    --model ${MODEL_NAME}
    --origin ${MODEL_LOCATION}
    --framework tensorflow
    --runtime-version=2.1

To predict with our model, let's take one of the example images.

In [None]:
import json, codecs
import tensorflow as tf
import matplotlib.pyplot as plt
from mnist_models.trainer import util

HEIGHT = 28
WIDTH = 28
IMGNO = 12

mnist = tf.keras.datasets.mnist.load_data()
(x_train, y_train), (x_test, y_test) = mnist
test_image = x_test[IMGNO]

jsondata = test_image.reshape(HEIGHT, WIDTH, 1).tolist()
json.dump(jsondata, codecs.open("test.json", "w", encoding="utf-8"))
plt.imshow(test_image.reshape(HEIGHT, WIDTH));