# Using custom containers with AI Platform Training

Objetives: 

1. ¿? how to create a train and a validation split with BigQuery
2. how to wrap a machine learning model into a Docker container and train in on AI Platform
3. Learn how to use the hyperparameter tunning engine on Google Cloud to find the best hyperparameters
4. Learn how to deploy a trained machine learning model Google Cloud as a rest API and query it

Main steps:

1. Create the training script
2. Package training script into a Docker Image
3. Build and push training image to Google Cloud Container Registry

tricks shell!

export PROJECT_ID=$(gcloud config get-value core/project)
gcloud config set project $PROJECT_ID

gcloud services enable \
cloudbuild.googleapis.com \
container.googleapis.com \
cloudresourcemanager.googleapis.com \
iam.googleapis.com \
containerregistry.googleapis.com \
containeranalysis.googleapis.com \
ml.googleapis.com \
dataflow.googleapis.com

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
CLOUD_BUILD_SERVICE_ACCOUNT="${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com"
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member serviceAccount:$CLOUD_BUILD_SERVICE_ACCOUNT \
  --role roles/editor

### create a GKE in a shell

gcloud container clusters create cluster-1 --zone us-central1-a --cluster-version 1.18.20 --machine-type n1-standard-2 --enable-basic-auth --scopes=https://www.googleapis.com/auth/cloud-platform

In [10]:
import json
import os
import numpy as np
import pandas as pd
import pickle
import uuid
import time
import tempfile

from googleapiclient import discovery
from googleapiclient import errors

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

In [1]:
REGION = 'us-central1'
PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]
BUCKET = 'gs://' + PROJECT_ID
print(BUCKET)

ARTIFACT_STORE = BUCKET # + 'kubeflowpipelines-default'

DATA_ROOT='{}/data'.format(ARTIFACT_STORE)
JOB_DIR_ROOT='{}/jobs'.format(ARTIFACT_STORE)
TRAINING_FILE_PATH='{}/{}'.format(DATA_ROOT, 'dataset.csv')
# VALIDATION_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'validation', 'dataset.csv')
print(TRAINING_FILE_PATH)
OUTPUT_DIR = '{}/models'.format(ARTIFACT_STORE)
print(OUTPUT_DIR)


gs://qwiklabs-gcp-01-43b0d7048e07
gs://qwiklabs-gcp-01-43b0d7048e07/data/dataset.csv
gs://qwiklabs-gcp-01-43b0d7048e07/models


In [4]:
!gsutil cp -r data gs://qwiklabs-gcp-01-43b0d7048e07

Copying file://data/data.json [Content-Type=application/json]...
Copying file://data/dataset_eval.csv [Content-Type=text/csv]...                 
Copying file://data/dataset.csv [Content-Type=text/csv]...                      
/ [3 files][ 40.4 KiB/ 40.4 KiB]                                                
Operation completed over 3 objects/40.4 KiB.                                     


In [15]:
!mkdir tensorflow_trainer_image

mkdir: cannot create directory ‘tensorflow_trainer_image’: File exists


In [5]:
%%writefile ./tensorflow_trainer_image/train.py

"""Tensorflow predictor script."""

import pickle
import subprocess
import sys
import fire
import tensorflow as tf
import datetime
import os

import pandas as pd
import numpy as np

def load_dataset(pattern, window_size=30, batch_size=16, shuffle_buffer=100):
    """
    Description:  
    Input: 
      - series:
      - window_size:
      - batch_size: the batches to use when training
      -shuffle_buffer: size buffer, how data will be shuffled

    Output:
    """
    
    # read data
    data = pd.read_csv(pattern)
    time = np.array(data.times)
    series = np.array(data.values)[:,1].astype('float32')
    
    dataset = tf.data.Dataset.from_tensor_slices(series)
    dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
    dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1])) # x and y (last one)
    dataset = dataset.batch(batch_size).prefetch(1)
    return dataset

def train_evaluate(training_dataset_path, 
                   # validation_dataset_path,
                   window_size,
                   batch_size,
                   epochs,
                   lr,
                   # num_train_examples, num_evals, 
                   output_dir):
    """
    Description: train script
    """
    
    EPOCHS = epochs
    LR = lr
    
    l0 = tf.keras.layers.Dense(2*window_size+1, input_shape=[window_size], activation='relu')
    l2 = tf.keras.layers.Dense(1)
    model = tf.keras.models.Sequential([l0, l2])
    
    lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-3)
    optimizer = tf.keras.optimizers.SGD(lr=LR, momentum=0.9)
    model.compile(loss="mse", optimizer=optimizer, metrics=['mae'])
    
    # load data
    trainds = load_dataset(pattern=training_dataset_path, window_size=window_size, batch_size=batch_size)
    # evalds = load_dataset(pattern=validation_dataset_path, mode='eval')
    
    history = model.fit(trainds, epochs=EPOCHS, verbose=0)
    
    EXPORT_PATH = os.path.join(output_dir, datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
    tf.saved_model.save(obj=model, export_dir=EXPORT_PATH)  # with default serving function
    
    print("Exported trained model to {}".format(EXPORT_PATH))
    
if __name__ == '__main__':
    fire.Fire(train_evaluate)

Overwriting ./tensorflow_trainer_image/train.py


## Package TensorFlow Training Script into a Docker Image

In [6]:
%%writefile ./tensorflow_trainer_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire tensorflow==2.1.1
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Overwriting ./tensorflow_trainer_image/Dockerfile


## Build the Tensorflow Trainer Image

In [7]:
TF_IMAGE_NAME='tensorflow_trainer_image'
TF_IMAGE_TAG='latest'
TF_IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, TF_IMAGE_NAME, TF_IMAGE_TAG)

In [8]:
!gcloud builds submit --tag $TF_IMAGE_URI $TF_IMAGE_NAME

Creating temporary tarball archive of 2 file(s) totalling 2.5 KiB before compression.
Uploading tarball of [tensorflow_trainer_image] to [gs://qwiklabs-gcp-01-43b0d7048e07_cloudbuild/source/1631460341.83998-8b4d6e59834c4fc19e148543f468e634.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/qwiklabs-gcp-01-43b0d7048e07/locations/global/builds/105775d5-70aa-43f7-b4d9-89539b369917].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/105775d5-70aa-43f7-b4d9-89539b369917?project=821318692321].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "105775d5-70aa-43f7-b4d9-89539b369917"

FETCHSOURCE
Fetching storage object: gs://qwiklabs-gcp-01-43b0d7048e07_cloudbuild/source/1631460341.83998-8b4d6e59834c4fc19e148543f468e634.tgz#1631460342050308
Copying gs://qwiklabs-gcp-01-43b0d7048e07_cloudbuild/source/1631460341.83998-8b4d6e59834c4fc19e148543f468e634.tgz#1631460342050308...
/ [1 files][  1.2 KiB/  1.2 KiB]           

## Submit an AI Platform training job

In [12]:
JOB_NAME = "JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = "{}/{}".format(JOB_DIR_ROOT, JOB_NAME)
SCALE_TIER = "BASIC"

WINDOW_SIZE = 30
BATCH_SIZE = 16 
EPOCHS = 10
LR = 1e-3

!gcloud ai-platform jobs submit training $JOB_NAME \
--region=$REGION \
--job-dir=$JOB_DIR \
--master-image-uri=$TF_IMAGE_URI \
--scale-tier=$SCALE_TIER \
-- \
--training_dataset_path=$TRAINING_FILE_PATH \
--window_size=$WINDOW_SIZE \
--epochs=$EPOCHS \
--batch_size=$BATCH_SIZE \
--output_dir=$OUTPUT_DIR \
--lr=$LR \

Job [JOB_20210912_153757] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe JOB_20210912_153757

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs JOB_20210912_153757
jobId: JOB_20210912_153757
state: QUEUED


In [14]:
!gcloud ai-platform jobs describe $JOB_NAME

createTime: '2021-09-12T15:37:59Z'
etag: _v0deQoujVg=
jobId: JOB_20210912_153757
startTime: '2021-09-12T15:42:24Z'
state: RUNNING
trainingInput:
  args:
  - --training_dataset_path=gs://qwiklabs-gcp-01-43b0d7048e07/data/dataset.csv
  - --window_size=30
  - --epochs=10
  - --batch_size=16
  - --output_dir=gs://qwiklabs-gcp-01-43b0d7048e07/models
  - --lr=0.001
  - \
  jobDir: gs://qwiklabs-gcp-01-43b0d7048e07/jobs/JOB_20210912_153757
  masterConfig:
    imageUri: gcr.io/qwiklabs-gcp-01-43b0d7048e07/tensorflow_trainer_image:latest
  region: us-central1
trainingOutput: {}

View job in the Cloud Console at:
https://console.cloud.google.com/mlengine/jobs/JOB_20210912_153757?project=qwiklabs-gcp-01-43b0d7048e07

View logs at:
https://console.cloud.google.com/logs?resource=ml_job%2Fjob_id%2FJOB_20210912_153757&project=qwiklabs-gcp-01-43b0d7048e07


## HPTUNNING

In [None]:
%%writefile tensorflow_trainer_image/hptuning_config.yaml

trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 4
    maxParallelTrials: 4
    hyperparameterMetricTag: accuracy
    enableTrialEarlyStopping: TRUE 
    params:
    - parameterName: WINDOW_SIZE
      type: DISCRETE
      discreteValues: [
          200,
          500
          ]
    - parameterName: alpha
      type: DOUBLE
      minValue:  0.00001
      maxValue:  0.001
      scaleType: UNIT_LINEAR_SCALE