<h1> Scaling up ML using GCP ML Engine </h1>

<h4>Scripts to train the classifier in local system as well as in cloud.</h4>

<h2> Environment variables for project and bucket </h2>

Note that:
<ol>
<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos </li>
<li> Cloud training often involves saving and restoring model files. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available). A common pattern is to prefix the bucket name by the project id, so that it is unique. Also, for cost reasons, you might want to use a single region bucket. </li>
</ol>
<b>Change the cell below</b> to reflect your Project ID and bucket name.

In [4]:
import os
PROJECT = 'xxxxxxxxxxxxxxxxxx' # REPLACE WITH YOUR PROJECT ID
REGION = 'xxxxxxxxxxxxxxxxxxx' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.
BUCKET = 'twitter-sentiment-classifier' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.

In [5]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.13'  # Tensorflow version

In [7]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


Allow the Cloud ML Engine service account to read/write to the bucket containing training data.

In [17]:
%%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print(response['serviceAccount'])")

echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

<h2> 1.1 Running the Python module from the command-line </h2>

Submit the job locally using this command

In [12]:
%%bash
rm -rf output/
mkdir output
export PYTHONPATH=${PYTHONPATH}:${PWD}
python -m src.task \
   --root_path=${PWD} \
   --train_data_path='data/train' \
   --val_data_path='data/test'  \
   --resources_path='resources' \
   --output_dir='output' \
   --job-dir='./tmp'

In [16]:
%%bash
ls $PWD/output/lstm

<h2> 1.2 Running locally using gcloud </h2>
Run jobs locally using GCP cli tool. Make sure you have installed GCP cli tools in your machine.

In [21]:
%%bash
rm -rf output/
mkdir output
gcloud ml-engine local train \
   --module-name=src.task \
   --package-path=${PWD}/src \
   -- \
   --root_path=${PWD} \
   --train_data_path='data/train' \
   --val_data_path='data/test'  \
   --resources_path='resources' \
   --output_dir='output' \
   --job-dir='./tmp'

Process is terminated.


<h2> 1.3 Submit training job using gcloud </h2>

First copy the training data to the cloud.  Then, launch a training job.

After you submit the job, go to the cloud console (http://console.cloud.google.com) and select <b>Machine Learning | Jobs</b> to monitor progress.  

<b>Note:</b> Don't be concerned if the notebook stalls (with a blue progress bar) or returns with an error about being unable to refresh auth tokens. This is a long-lived Cloud job and work is going on in the cloud.  Use the Cloud Console link (above) to monitor the job.

In [19]:
%bash
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/output/
gsutil -m cp ${PWD}/data/* gs://${BUCKET}/data/

In [20]:
%%bash
OUTDIR=gs://${BUCKET}/output
JOBNAME=train_job_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=src.task \
   --package-path=${PWD}/src \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=$TFVERSION \
   -- \
   --root_path="gs://${BUCKET}" \
   --train_data_path="gs://${BUCKET}/data/train" \
   --val_data_path="gs://${BUCKET}/data/test"  \
   --resources_path="gs://${BUCKET}/resources" \
   --output_dir=$OUTDIR \
   --train_steps=10000