## Setup

In [None]:
! python3 -m virtualenv env

In [None]:
! source env/bin/activate

In [None]:
! pip3 install tensorflow==2.2

In [None]:
! pip install jupyter notebook

In [None]:
git clone https://github.com/cloudacademy/aiplatform-intro.git

In [None]:
cd aiplatform-intro/iris/trainer

## train data

In [None]:
python3 iris.py --job-dir export

output:

4/4 - test data accuracy

1/1 - train data accuracy (much less than test if overfit)

## train locally

go into parent path of where traning script(iris.py) is (not trainer folder) 
- tensorflow1/aiplatform-intro/iris/trainer/iris.py

cd aiplatform-intro/iris

- local 
    - to train locally
- module-name: trainer.iris 
    - trainer is folder
    - iris.py, but drop the .py
- job-dir
    - the folder that has the results (export folder). in current directory

In [None]:
! gcloud ai-platform local train --module-name trainer.iris --package-path trainer --job-dir export

## install GCP SDK

https://cloud.google.com/sdk/docs/install#deb

## setup GCP

In [18]:
# see list of projects
! gcloud projects list

PROJECT_ID              NAME            PROJECT_NUMBER
graceful-smithy-315106  va-google-auth  349991753941
keras1-316117           keras1          646436777927
stoked-aloe-316710      tensorflow2     683223887003


In [9]:
! gcloud config get-value project

stoked-aloe-316710


In [10]:
PROJECT_ID = 'stoked-aloe-316710'

In [11]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


In [12]:
# check default project_id
! gcloud config list core/project

[core]
project = stoked-aloe-316710

Your active configuration is: [default]


## bucket

In [18]:
# BUCKET_NAME must be globally unique. thus add project_name in front of it  
# exclude country name from region!! i.e. asia-southeast1 (Singapore)

BUCKET_NAME= PROJECT_ID + '_bucket1'
REGION= 'us-central1'
BUCKET = 'gs://' + BUCKET_NAME

In [19]:
# if not bucket made, make bucket via:
! gsutil mb -l $REGION $BUCKET

Creating gs://stoked-aloe-316710_bucket1/...


In [1]:
# to check if bucket created
# https://cloud.google.com/storage/docs/gsutil/commands/ls
! gsutil ls

gs://stoked-aloe-316710_bucket1/


## train model via ai platform

In [56]:
## auth on gcloud

gcloud auth login

SyntaxError: invalid syntax (<ipython-input-56-992908956a87>, line 3)

to run tensorflow script on aiplatform, need to package it first.
- have __init__.py file in folder

In [24]:
from datetime import datetime
dt= datetime.now()
dt_str= dt.strftime('%m%d%y_%H%M')
print(dt_str)

061321_1103


In [25]:
# job_name cannot be repeated across jobs. use timestamp to make it unique
JOB_NAME = 'iris1_' + dt_str
JOB_DIR = 'gs://' + BUCKET_NAME + '/job1'

In [26]:
JOB_NAME

'iris1_061321_1103'

In [27]:
%pwd

'/home/galen/Desktop/tensorflow2'

In [28]:
%cd aiplatform-intro/iris

/home/galen/Desktop/tensorflow2/aiplatform-intro/iris


In [30]:
# run on terminal better monitoring

!gcloud ai-platform jobs submit training $JOB_NAME  --module-name trainer.iris   --package-path trainer   --staging-bucket $BUCKET  --region $REGION  --python-version 3.7   --runtime-version 2.2   --job-dir $JOB_DIR 

^C


Command killed by keyboard interrupt



In [None]:
# see if job created
! gsutil ls $BUCKET

                                 gs://stoked-aloe-316710_bucket1/iris1_061321_1103/


In [55]:
# get info about running job: ran after completion. can see completion in log 'task completed'
! gcloud ai-platform jobs describe iris1_061221_1924

createTime: '2021-06-12T20:33:44Z'
endTime: '2021-06-12T20:41:49Z'
etag: DOMtBVi7g68=
jobId: iris1_061221_1924
startTime: '2021-06-12T20:39:17Z'
state: SUCCEEDED
trainingInput:
  jobDir: gs://keras1-316117_bucket1/job1
  packageUris:
  - gs://keras1-316117_bucket1/iris1_061221_1924/827c1b4382400915233bf183725708023f8636e5f988d42b6431194ebe23e75b/trainer-0.0.0.tar.gz
  pythonModule: trainer.iris
  pythonVersion: '3.7'
  region: asia-southeast1
  runtimeVersion: '2.2'
trainingOutput:
  consumedMLUnits: 0.07

View job in the Cloud Console at:
https://console.cloud.google.com/mlengine/jobs/iris1_061221_1924?project=keras1-316117

View logs at:
https://console.cloud.google.com/logs?resource=ml_job%2Fjob_id%2Firis1_061221_1924&project=keras1-316117


### terminal-only method to do all above

In [2]:
! PROJECT=$(gcloud config list project --format 'value(core.project)')
! BUCKET=gs://${PROJECT}-aiplatform 
! REGION=asia-southeast1
! gsutil mb -l $REGION $BUCKET

CommandException: Incorrect option(s) specified. Usage:

  gsutil mb [-b (on|off)] [-c <class>] [-l <location>] [-p <proj_id>]
            [--retention <time>] gs://<bucket_name>...

For additional help run:
  gsutil help mb


In [4]:
%cd aiplatform-intro/iris

/home/galen/Desktop/tensorflow1/aiplatform-intro/iris


In [5]:
! JOB=job1

#always leave space after new line (\), else no space between arguments
! gcloud ai-platform jobs submit training $JOB \
 --module-name trainer.iris\
 --package-path trainer\
 --staging-bucket $BUCKET \
 --region $REGION\
 --python-version 3.7\
 --runtime-version 2.2\
 --job-dir $BUCKET/$JOB

[1;31mERROR:[0m (gcloud.ai-platform.jobs.submit.training) argument --region: expected one argument
Usage: gcloud ai-platform jobs submit training JOB [optional flags] [-- USER_ARGS ...]
  optional flags may be  --async | --config | --help | --job-dir | --kms-key |
                         --kms-keyring | --kms-location | --kms-project |
                         --labels | --master-accelerator | --master-image-uri |
                         --master-machine-type | --module-name |
                         --package-path | --packages |
                         --parameter-server-accelerator |
                         --parameter-server-count |
                         --parameter-server-image-uri |
                         --parameter-server-machine-type | --python-version |
                         --region | --runtime-version | --scale-tier |
                         --service-account | --staging-bucket | --stream-logs |
                         --use-chief-in-tf-config |

# Feature engineering

In [10]:
! pip3 install numpy pandas sklearn

/home/galen/Desktop/tensorflow1


In [21]:
%cd /home/galen/Desktop/tensorflow1/aiplatform-intro/pets

/home/galen/Desktop/tensorflow1/aiplatform-intro/pets


In [14]:
! gcloud ai-platform local train --module-name trainer.pets --package-path trainer --job-dir export

7383 train examples
1846 validation examples
2308 test examples
2021-06-13 07:53:57.303801: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy 0.7435008883476257


- no hidden layers in code. thus just linear model, not deep model
    - shld experiment with feature and deep neural networks to see what works well
- embedding columns (solves one-hot column high dimensionality problems)
- feature engineering by creating catagorical variables from numerical variables
- bucketized columns
- crossed-feature columns (combined)
    - need hash_bucket to limit the number of permutations, esp if many catogries in the combined columns

# Hyperparameter tuning

settings for training run, set ahead of time.

eg.
- batch size
- hidden layers

parameters are weights model learns during training.
hyperparameters are set manually and do not change during training

 auto tuning by tf. 
- bayesian search method as default
- use less hyperparameters to be efficient

## Distributed training

- most real-world models take too long to train on 1 machine
- --scale-tier flag to run distributed job

#### training cluster

- group of VMs, each called training instance or node.
- dependencies installed on each instance
- when trainer script runs, each one is called replica
- 1 of the replicas designated as master
- some repliacs designated as workers, each running part of the job
- some replicas dsignedated as parameter servers

#### 2 types of distributed training
- synchronous
    - all workers keep copy of parameters, and parameters updated at end of every training step
- asynchronous
    - workers run independently and send parameter updates to parameter servers
    
 use tf.distribute.Strategy to choose

## Deploy model on AIplatform

- not same as tensorflow model
- model is resource for diff versions of a trained model

why version?
- Versioning can help you ensure that you don’t break users who are dependent on a specific version of your model when you publish a new version. Depending on your use case, you can also serve different model versions to a subset of your users, for example, to run an experiment.


https://blog.tensorflow.org/2020/04/how-to-deploy-tensorflow-2-models-on-cloud-ai-platform.html

In [19]:
%pwd

'/home/galen/Desktop/tensorflow1/aiplatform-intro/pets'

In [20]:
%cd /home/galen/Desktop/tensorflow1/aiplatform-intro/iris

/home/galen/Desktop/tensorflow1/aiplatform-intro/iris


#### create model

In [9]:
# only regions from here  https://cloud.google.com/ai-platform/prediction/docs/regions. not same as training.
# use us-central1 for example.

# model resource name (iris) must be unique within a project

! gcloud ai-platform models create iris_model1 --regions='us-central1'

Using endpoint [https://ml.googleapis.com/]
Created ai platform model [projects/stoked-aloe-316710/models/iris_model1].


can see model in aiplatform console, with no versions. yet

#### create version of the model

In [11]:
# origin: where the saved_model.pb is located from prev training, which will be used to created the version

! gcloud ai-platform versions create v1 \
  --model iris_model1 \
  --runtime-version 2.2 \
  --region global \
  --staging-bucket gs://stoked-aloe-316710_bucket1 \
  --origin gs://stoked-aloe-316710_bucket1/job1

Using endpoint [https://ml.googleapis.com/]
[1;31mERROR:[0m (gcloud.ai-platform.versions.create) FAILED_PRECONDITION: Framework can not be identified from model path.


#### get online prediction

- get predictions quick, real-time response

In [14]:
%cd /home/galen/Desktop/tensorflow1/aiplatform-intro/iris

/home/galen/Desktop/tensorflow1/aiplatform-intro/iris


In [16]:
# json-request is the local path of test dataset: /home/galen/Desktop/tensorflow1/aiplatform-intro/iris/test.json
# normally call predict from app that will translate scores 

! gcloud ai-platform predict \
  --model iris_model1 \
  --version v1 \
  --region global \
  --json-request test.json

Using endpoint [https://ml.googleapis.com/]
DENSE_2
[8.927209854125977, 5.369498252868652, -7.129472732543945]


#### get batch prediction

- for big jobs
- longer to startup
- prediction files written to Cloud Storage instead of console
- cheaper

In [None]:
gcloud ai-platform jobs submit prediction