 ============================================================================== \
 Copyright 2021 Google LLC. This software is provided as-is, without warranty \
 or representation for any use or purpose. Your use of it is subject to your \
 agreement with Google. \
 ============================================================================== 
 
 Author: Elvin Zhu, Chanchal Chatterjee \
 Email: elvinzhu@google.com \
<img src="img/google-cloud-icon.jpg" alt="Drawing" style="width: 200px;"/>

### Import packages

In [20]:
!cd /home/jupyter/vapit/ai-platform-tf
!python3 -m pip install -r ./requirements.txt --user
!python3 -m pip install google-cloud-aiplatform
!python3 -m pip install google-cloud-storage==1.32
!gcloud components update --quiet
!python3 -m pip install build


Collecting google-cloud-storage<2.0.0dev,>=1.32.0
  Using cached google_cloud_storage-1.38.0-py2.py3-none-any.whl (103 kB)
Collecting google-resumable-media<2.0dev,>=0.6.0
  Using cached google_resumable_media-1.3.0-py2.py3-none-any.whl (75 kB)
Installing collected packages: google-resumable-media, google-cloud-storage
  Attempting uninstall: google-resumable-media
    Found existing installation: google-resumable-media 0.5.1
    Uninstalling google-resumable-media-0.5.1:
      Successfully uninstalled google-resumable-media-0.5.1
  Attempting uninstall: google-cloud-storage
    Found existing installation: google-cloud-storage 1.23.0
    Uninstalling google-cloud-storage-1.23.0:
      Successfully uninstalled google-cloud-storage-1.23.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-cloud 0.1.13 requires tensorboard>=2.3.0, but you have tensor

In [21]:
# Import packages

import json
import logging
import pandas as pd
import numpy as np
from datetime import datetime
from pytz import timezone
from googleapiclient import discovery
from google.cloud import aiplatform

### Configure Global Variables

List your current GCP project name

In [22]:
!gcloud config list --format 'value(core.project)' 2>/dev/null

cchatterjee-sandbox


Configure your system variables

In [23]:
# Configure your global variables
PROJECT = 'cchatterjee-sandbox'  # Replace with your project ID
USER = 'cchatterjee'             # Replace with your user name
BUCKET_NAME = 'vapit_data'       # Replace with your gcs bucket name

FOLDER_NAME = 'tf_models'
TIMEZONE = 'US/Pacific'
REGION = 'us-central1'
PACKAGE_URIS = f"gs://{BUCKET_NAME}/trainer/tensorflow/trainer-0.1.tar.gz" 
TRAIN_FEATURE_PATH = f"gs://{BUCKET_NAME}/tf_data/mortgage_structured_x_train.csv" 
TRAIN_LABEL_PATH = f"gs://{BUCKET_NAME}/tf_data/mortgage_structured_y_train.csv" 
TEST_FEATURE_PATH = f"gs://{BUCKET_NAME}/tf_data/mortgage_structured_x_test.csv" 
TEST_LABEL_PATH = f"gs://{BUCKET_NAME}/tf_data/mortgage_structured_y_test.csv"


Create your bucket

In [24]:
!gsutil mb -l $REGION gs://$BUCKET_NAME 

Creating gs://vapit_data/...
ServiceException: 409 A Cloud Storage bucket named 'vapit_data' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Build python package and upload to your bucket

In [25]:
!cd /home/jupyter/vapit/ai-platform-tf
!python3 -m build
!gsutil cp ./dist/trainer-0.1.tar.gz $PACKAGE_URIS

Found existing installation: setuptools 47.1.0
Uninstalling setuptools-47.1.0:
  Successfully uninstalled setuptools-47.1.0
Collecting setuptools>=40.8.0
  Using cached setuptools-57.0.0-py3-none-any.whl (821 kB)
Collecting wheel
  Using cached wheel-0.36.2-py2.py3-none-any.whl (35 kB)
Installing collected packages: setuptools, wheel
Successfully installed setuptools-57.0.0 wheel-0.36.2
You should consider upgrading via the '/tmp/build-env-_3d0_c_1/bin/python -m pip install --upgrade pip' command.[0m
running egg_info
writing trainer.egg-info/PKG-INFO
writing dependency_links to trainer.egg-info/dependency_links.txt
writing requirements to trainer.egg-info/requires.txt
writing top-level names to trainer.egg-info/top_level.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
running sdist
running egg_info
writing trainer.egg-info/PKG-INFO
writing dependency_links to trainer.egg-info/dependency_links.txt
writing requirements to tra

In [26]:
# freddie mac public mortgage data (Don't change it)
INPUT_DATA = "gs://tuti_asset/datasets/mortgage_structured.csv" # public mortgage data 
TARGET_COLUMN = "TARGET" # Column name for target labels

-----------
### Dataset preprocessing

Preprocess input data by

    1. Dropping unique ID column;
    2. Convert categorical into one-hot encodings;
    3. Count number of unique classes;
    4. Split train/test
    5. Save process data into gcs

In [27]:
!python3 preprocessing.py \
    --input_file $INPUT_DATA \
    --x_train_name $TRAIN_FEATURE_PATH \
    --x_test_name $TEST_FEATURE_PATH \
    --y_train_name $TRAIN_LABEL_PATH \
    --y_test_name $TEST_LABEL_PATH \
    --target_column $TARGET_COLUMN

INFO:root:Preprocessing raw data:
INFO:root: => Drop id column:
INFO:root: => One hot encoding categorical features
INFO:root: => Count number of classes
INFO:root: => Perform train/test split
INFO:root:Reading raw data file: gs://tuti_asset/datasets/mortgage_structured.csv
INFO:root:Drop unique id column which is not an useful feature for ML: LOAN_SEQUENCE_NUMBER
INFO:root:Convert categorical columns into one-hot encodings
INFO:root:categorical feature: first_time_home_buyer_flag
INFO:root:categorical feature: occupancy_status
INFO:root:categorical feature: channel
INFO:root:categorical feature: property_state
INFO:root:categorical feature: property_type
INFO:root:categorical feature: loan_purpose
INFO:root:categorical feature: seller_name
INFO:root:categorical feature: service_name
INFO:root:Count number of unique classes ...
INFO:root:No. of Classes: 4
INFO:root:Perform train/test split ...
INFO:root:Get feature/label shapes ...
INFO:root:x_train shape = (93639, 149)
INFO:root:x_tes

------
### Training with Google Vertex AI 

For the full article, please visit: https://cloud.google.com/vertex-ai/docs

Where Vertex AI fits in the ML workflow \
The diagram below gives a high-level overview of the stages in an ML workflow. The blue-filled boxes indicate where Vertex AI provides managed services and APIs:

<img src="img/ml-workflow.svg" alt="Drawing">

As the diagram indicates, you can use Vertex AI to manage the following stages in the ML workflow:

- Train an ML model on your data:
 - Train model
 - Evaluate model accuracy
 - Tune hyperparameters
 
 
- Deploy your trained model.

- Send prediction requests to your model:
 - Online prediction
 - Batch prediction (for TensorFlow only)
 
 
- Monitor the predictions on an ongoing basis.

- Manage your models and model versions.


#### Train at local

Before submitting training jobs to Cloud AI Platform, you can test your train.py code in the local environment. You can test by running your python script in command line, but another and maybe better choice is to use `gcloud ai-platform local train` command. The latter method could make sure your your entire python package are ready to be submitted to the remote VMs.

In [28]:
# Train on local machine with python command
!python3 trainer/train.py \
    --job-dir ./models \
    --train_feature_name $TRAIN_FEATURE_PATH \
    --train_label_name $TRAIN_LABEL_PATH \
    --test_feature_name $TEST_FEATURE_PATH \
    --test_label_name $TEST_LABEL_PATH

2021-06-15 21:23:55.324645: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-06-15 21:23:55.324787: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-06-15 21:23:55.324815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Namespace(batch_size=4, depth=3, dropout_rate=0.02, epochs=1, job_dir='./models', learnin

------
### Hyperparameter Tuning

To use hyperparameter tuning in your training job you must perform the following steps:

- Specify the hyperparameter tuning configuration for your training job by including a HyperparameterSpec in your TrainingInput object.

- Include the following code in your training application:

 - Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial.
 - Add your hyperparameter metric to the summary for your graph.


In [29]:
# Google Vertex AI requires each job to have unique name, 
# Therefore, we use prefix + timestamp to form job names.
JOBNAME_HPT = 'tensorflow_train_{}_{}_hpt'.format(
    USER,
    datetime.now(timezone(TIMEZONE)).strftime("%m%d%y_%H%M")
    ) # define unique job name

# We use the job names as folder names to store outputs.
JOB_DIR_HPT = 'gs://{}/{}/jobdir'.format(
    BUCKET_NAME,
    FOLDER_NAME,
    ) # define unique job dir on gcs

print("JOB_NAME_HPT = ", JOBNAME_HPT)
print("JOB_DIR_HPT = ", JOB_DIR_HPT)

JOB_NAME_HPT =  tensorflow_train_cchatterjee_061521_1424_hpt
JOB_DIR_HPT =  gs://vapit_data/tf_models/jobdir


### Submit the hyperparameter job to vertex AI

In [30]:
executor_image_uri = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-2:latest'
python_module =  "trainer.train_hpt"
api_endpoint = "us-central1-aiplatform.googleapis.com"
machine_type = "n1-standard-4"

# The AI Platform services require regional API endpoints.
client_options = {"api_endpoint": api_endpoint}
# Initialize client that will be used to create and send requests.
# This client only needs to be created once, and can be reused for multiple requests.
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
       
# study_spec
metric = {
    "metric_id": "accuracy",
    "goal": aiplatform.gapic.StudySpec.MetricSpec.GoalType.MAXIMIZE,
}

depth = {
        "parameter_id": "depth",
        "integer_value_spec": {"min_value": 1, "max_value": 10},
        "scale_type": aiplatform.gapic.StudySpec.ParameterSpec.ScaleType.UNIT_LINEAR_SCALE,
}
dropout_rate = {
        "parameter_id": "dropout_rate",
        "double_value_spec": {"min_value": 0.001, "max_value": 0.1},
        "scale_type": aiplatform.gapic.StudySpec.ParameterSpec.ScaleType.UNIT_LOG_SCALE,
}
learning_rate = {
        "parameter_id": "learning_rate",
        "double_value_spec": {"min_value": 0.00001, "max_value": 0.01},
        "scale_type": aiplatform.gapic.StudySpec.ParameterSpec.ScaleType.UNIT_LOG_SCALE,
}
batch_size = {
        "parameter_id": "batch_size",
        "integer_value_spec": {"min_value": 1, "max_value": 16},
        "scale_type": aiplatform.gapic.StudySpec.ParameterSpec.ScaleType.UNIT_LINEAR_SCALE,
}
epochs = {
        "parameter_id": "epochs",
        "integer_value_spec": {"min_value": 1, "max_value": 5},
        "scale_type": aiplatform.gapic.StudySpec.ParameterSpec.ScaleType.UNIT_LINEAR_SCALE,
}

# trial_job_spec
machine_spec = {
    "machine_type": machine_type,
}
worker_pool_spec = {
    "machine_spec": machine_spec,
    "replica_count": 1,
    "python_package_spec": {
        "executor_image_uri": executor_image_uri,
        "package_uris": [PACKAGE_URIS],
        "python_module": python_module,
        "args": [
            '--job-dir',
            JOB_DIR_HPT,
            '--train_feature_name',
            TRAIN_FEATURE_PATH,
            '--train_label_name',
            TRAIN_LABEL_PATH,
            '--test_feature_name',
            TEST_FEATURE_PATH,
            '--test_label_name',
            TEST_LABEL_PATH,
        ],
    },
}

# hyperparameter_tuning_job
hyperparameter_tuning_job = {
    "display_name": JOBNAME_HPT,
    "max_trial_count": 4,
    "parallel_trial_count": 2,
    "study_spec": {
        "metrics": [metric],
        "parameters": [depth, dropout_rate, learning_rate, batch_size, epochs],
#         "algorithm": aiplatform.gapic.StudySpec.Algorithm.RANDOM_SEARCH,
    },
    "trial_job_spec": {"worker_pool_specs": [worker_pool_spec]},
}
parent = f"projects/{PROJECT}/locations/{REGION}"
response = client.create_hyperparameter_tuning_job(
    parent=parent, hyperparameter_tuning_job=hyperparameter_tuning_job
)
print("response:", response)
job_name_hpt = response.name.split('/')[-1]


response: name: "projects/901951554789/locations/us-central1/hyperparameterTuningJobs/1276704523661869056"
display_name: "tensorflow_train_cchatterjee_061521_1424_hpt"
study_spec {
  metrics {
    metric_id: "accuracy"
    goal: MAXIMIZE
  }
  parameters {
    parameter_id: "depth"
    integer_value_spec {
      min_value: 1
      max_value: 10
    }
    scale_type: UNIT_LINEAR_SCALE
  }
  parameters {
    parameter_id: "dropout_rate"
    double_value_spec {
      min_value: 0.001
      max_value: 0.1
    }
    scale_type: UNIT_LOG_SCALE
  }
  parameters {
    parameter_id: "learning_rate"
    double_value_spec {
      min_value: 1e-05
      max_value: 0.01
    }
    scale_type: UNIT_LOG_SCALE
  }
  parameters {
    parameter_id: "batch_size"
    integer_value_spec {
      min_value: 1
      max_value: 16
    }
    scale_type: UNIT_LINEAR_SCALE
  }
  parameters {
    parameter_id: "epochs"
    integer_value_spec {
      min_value: 1
      max_value: 5
    }
    scale_type: UNIT_LINEAR_

#### Check the status of Long Running Operation (LRO) with Google API Client

Send an API request to Vertex AI to get the detailed information. The most interesting piece of information is the hyperparameter values in the trial with best performance metric.

In [33]:
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
name = client.hyperparameter_tuning_job_path(
    project=PROJECT,
    location=REGION,
    hyperparameter_tuning_job=job_name_hpt,
)
response = client.get_hyperparameter_tuning_job(name=name)
print("Job status = ", response.state)
print("response:", response)
# print("response state: ", str(response.state))
if "JobState.JOB_STATE_SUCCEEDED" == str(response.state):
    print("Job state succeeded.")


Job status =  JobState.JOB_STATE_SUCCEEDED
response: name: "projects/901951554789/locations/us-central1/hyperparameterTuningJobs/1276704523661869056"
display_name: "tensorflow_train_cchatterjee_061521_1424_hpt"
study_spec {
  metrics {
    metric_id: "accuracy"
    goal: MAXIMIZE
  }
  parameters {
    parameter_id: "depth"
    integer_value_spec {
      min_value: 1
      max_value: 10
    }
    scale_type: UNIT_LINEAR_SCALE
  }
  parameters {
    parameter_id: "dropout_rate"
    double_value_spec {
      min_value: 0.001
      max_value: 0.1
    }
    scale_type: UNIT_LOG_SCALE
  }
  parameters {
    parameter_id: "learning_rate"
    double_value_spec {
      min_value: 1e-05
      max_value: 0.01
    }
    scale_type: UNIT_LOG_SCALE
  }
  parameters {
    parameter_id: "batch_size"
    integer_value_spec {
      min_value: 1
      max_value: 16
    }
    scale_type: UNIT_LINEAR_SCALE
  }
  parameters {
    parameter_id: "epochs"
    integer_value_spec {
      min_value: 1
      max_

#### Get the hyperparameters associated with the best metrics

In [34]:
max_ind = 0
max_val = 0
for ind, trials in enumerate(response.trials):
    value = trials.final_measurement.metrics[0].value
    print("Metrics Value (larger is better):", value)
    if value > max_val:
        max_val = value
        max_ind = ind
        
param_dict = {}
for params in response.trials[max_ind].parameters:
    param_dict[params.parameter_id] = params.value

print(param_dict)

depth=str(int(param_dict['depth']))
dropout_rate=str(param_dict['dropout_rate'])
learning_rate=str(param_dict['learning_rate'])
batch_size=str(int(param_dict['batch_size']))
epochs=str(int(param_dict['epochs']))


Metrics Value (larger is better): 0.9526052474975586
Metrics Value (larger is better): 0.9338843822479248
Metrics Value (larger is better): 0.9288544058799744
Metrics Value (larger is better): 0.9516867995262146
{'batch_size': 9.0, 'depth': 6.0, 'dropout_rate': 0.010000000000000002, 'epochs': 3.0, 'learning_rate': 0.00031622776601683794}


------
### Training with Tuned Parameters

Once your hyperparameter training jobs are done. You can use the optimized combination of hyperparameters from your trials and start a single training job on Cloud AI Platform to train your final model.

In [36]:
# Google Cloud AI Platform requires each job to have unique name, 
# Therefore, we use prefix + timestamp to form job names.
JOBNAME_TRN = 'tensorflow_train_{}_{}'.format(
    USER,
    datetime.now(timezone(TIMEZONE)).strftime("%m%d%y_%H%M")
    )
# We use the job names as folder names to store outputs.
JOB_DIR_TRN = 'gs://{}/{}/{}'.format(
    BUCKET_NAME,
    FOLDER_NAME,
    JOBNAME_TRN,
    )

print("JOB_NAME_TRN = ", JOBNAME_TRN)
print("JOB_DIR_TRN = ", JOB_DIR_TRN)

JOB_NAME_TRN =  tensorflow_train_cchatterjee_061521_1449
JOB_DIR_TRN =  gs://vapit_data/tf_models/tensorflow_train_cchatterjee_061521_1449


In [37]:
executor_image_uri = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-2:latest'
python_module = "trainer.train"
api_endpoint = "us-central1-aiplatform.googleapis.com"
machine_type = "n1-standard-4"
        
# The AI Platform services require regional API endpoints.
client_options = {"api_endpoint": api_endpoint}
# Initialize client that will be used to create and send requests.
# This client only needs to be created once, and can be reused for multiple requests.
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
custom_job = {
    "display_name": JOBNAME_TRN,
    "job_spec": {
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": machine_type,
                },
                "replica_count": 1,
                "python_package_spec": {
                    "executor_image_uri": executor_image_uri,
                    "package_uris": [PACKAGE_URIS],
                    "python_module": python_module,
                    "args": [
                        '--job-dir',
                        JOB_DIR_TRN,
                        '--train_feature_name',
                        TRAIN_FEATURE_PATH,
                        '--train_label_name',
                        TRAIN_LABEL_PATH,
                        '--test_feature_name',
                        TEST_FEATURE_PATH,
                        '--test_label_name',
                        TEST_LABEL_PATH,
                        '--depth',
                        depth,
                        '--dropout_rate',
                        dropout_rate,
                        '--learning_rate',
                        learning_rate,
                        '--batch_size',
                        batch_size,
                        '--epochs',
                        epochs
                    ],
                },
            }
        ]
    },
}
parent = f"projects/{PROJECT}/locations/{REGION}"
response = client.create_custom_job(parent=parent, custom_job=custom_job)
print("response:", response)
job_id_trn = response.name.split('/')[-1]


response: name: "projects/901951554789/locations/us-central1/customJobs/276905406385618944"
display_name: "tensorflow_train_cchatterjee_061521_1449"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    python_package_spec {
      executor_image_uri: "us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-2:latest"
      package_uris: "gs://vapit_data/trainer/tensorflow/trainer-0.1.tar.gz"
      python_module: "trainer.train"
      args: "--job-dir"
      args: "gs://vapit_data/tf_models/tensorflow_train_cchatterjee_061521_1449"
      args: "--train_feature_name"
      args: "gs://vapit_data/data_split/mortgage_structured_x_train.csv"
      args: "--train_label_name"
      args: "gs://vapit_data/data_split/mortgage_structured_y_train.csv"
      args: "--test_feature_name"
      args: "gs://vapit_data/data_split/mortgage_structured_x_test.csv"
      a

Check the training job status

In [39]:
# check the training job status
client_options = {"api_endpoint": api_endpoint}
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
name = client.custom_job_path(
    project=PROJECT,
    location=REGION,
    custom_job=job_id_trn,
)
response = client.get_custom_job(name=name)
print(response.state)


JobState.JOB_STATE_SUCCEEDED


--------
### Deploy the Model

Vertex AI provides tools to upload your trained ML model to the cloud, so that you can send prediction requests to the model.

In order to deploy your trained model on Vertex AI, you must save your trained model using the tools provided by your machine learning framework. This involves serializing the information that represents your trained model into a file which you can deploy for prediction in the cloud.

Then you upload the saved model to a Cloud Storage bucket, and create a model resource on Vertex AI, specifying the Cloud Storage path to your saved model.

When you deploy your model, you can also provide custom code (beta) to customize how it handles prediction requests.



#### Import model artifacts to Vertex AI 

When you import a model, you associate it with a container for Vertex AI to run prediction requests. You can use pre-built containers provided by Vertex AI, or use your own custom containers that you build and push to Container Registry or Artifact Registry.

You can use a pre-built container if your model meets the following requirements:

- Trained in Python 3.7 or later
- Trained using TensorFlow, scikit-learn, or XGBoost
- Exported to meet framework-specific requirements for one of the pre-built prediction containers

The link to the list of pre-built predict container images:

https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers?_ga=2.125143370.-1302053296.1620920844&_gac=1.221340266.1622086653.CjwKCAjw47eFBhA9EiwAy8kzNOkCqVAmokRvQaxBDOoa8AhGOpzzW69x64rRzfgWxogIn3m6moQoBRoCuOsQAvD_BwE

In [40]:
MODEL_NAME = "my_first_tensorflow_model"

response = aiplatform.Model.upload(
    display_name = MODEL_NAME,
    serving_container_image_uri = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-2:latest',
    artifact_uri = JOB_DIR_TRN,
)

model_id = response.name.split('/')[-1]


INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/901951554789/locations/us-central1/models/6670121319205437440/operations/4476090121321447424
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/901951554789/locations/us-central1/models/6670121319205437440
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/901951554789/locations/us-central1/models/6670121319205437440')


#### Create Endpoint

You need the endpoint ID to deploy the model.

In [41]:
MODEL_ENDPOINT_DISPLAY_NAME = "my_first_tensorflow_model_endpoint"

aiplatform.init(project=PROJECT, location=REGION)
endpoint = aiplatform.Endpoint.create(
    display_name=MODEL_ENDPOINT_DISPLAY_NAME, project=PROJECT, location=REGION,
)

endpoint_id = endpoint.resource_name.split('/')[-1]

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/901951554789/locations/us-central1/endpoints/7226359853650804736/operations/3336679415596711936
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/901951554789/locations/us-central1/endpoints/7226359853650804736
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/901951554789/locations/us-central1/endpoints/7226359853650804736')


#### Deploy Model to the endpoint

You must deploy a model to an endpoint before that model can be used to serve online predictions; deploying a model associates physical resources with the model so it can serve online predictions with low latency. An undeployed model can serve batch predictions, which do not have the same low latency requirements.

In [None]:
MODEL_NAME = "my_first_tensorflow_model"
DEPLOYED_MODEL_DISPLAY_NAME = "my_first_tensorflow_model_deployed"
aiplatform.init(project=PROJECT, location=REGION)

model = aiplatform.Model(model_name=model_id)

# The explanation_metadata and explanation_parameters should only be
# provided for a custom trained model and not an AutoML model.
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=DEPLOYED_MODEL_DISPLAY_NAME,
    machine_type = "n1-standard-4",
    sync=True
)

print(model.display_name)
print(model.resource_name)


INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/901951554789/locations/us-central1/endpoints/7226359853650804736
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/901951554789/locations/us-central1/endpoints/7226359853650804736/operations/71569685753102336


------
### Send inference requests to your model

Vertex AI provides the services you need to request predictions from your model in the cloud.

There are two ways to get predictions from trained models: online prediction (sometimes called HTTP prediction) and batch prediction. In both cases, you pass input data to a cloud-hosted machine-learning model and get inferences for each data instance.

Vertex AI online prediction is a service optimized to run your data through hosted models with as little latency as possible. You send small batches of data to the service and it returns your predictions in the response.

#### Call Google API for online inference

In [172]:
from googleapiclient import errors

# Load test feature and labels
x_test = pd.read_csv(TEST_FEATURE_PATH)
#y_test = pd.read_csv(TEST_LABEL_PATH)

# Fill nan value with zeros (Prediction lacks the ability to handle nan values for now)
x_test = x_test.fillna(0)

pprobas = []
batch_size = 16
n_samples = min(160,x_test.shape[0])
print("batch_size=", batch_size)
print("n_samples=", n_samples)

aiplatform.init(project=PROJECT, location=REGION)

for i in range(0, n_samples, batch_size):
    j = min(i+batch_size, n_samples)
    print("Processing samples", i, j)
    response = aiplatform.Endpoint(endpoint_id).predict(instances=x_test.iloc[i:j].values.tolist())
    try:
        for prediction_ in response.predictions:
            pprobas.append(prediction_)
    except errors.HttpError as err:
        # Something went wrong, print out some information.
        tf.compat.v1.logging.error('There was an error getting the job info, Check the details:')
        tf.compat.v1.logging.error(err._get_reason())
        break


batch_size= 16
n_samples= 160
Processing samples 0 16
Processing samples 16 32
Processing samples 32 48
Processing samples 48 64
Processing samples 64 80
Processing samples 80 96
Processing samples 96 112
Processing samples 112 128
Processing samples 128 144
Processing samples 144 160


In [173]:
pprobas

[[0.999993324, 6.66599499e-06, 2.10555697e-08, 1.56754349e-11],
 [0.959246, 0.0303262044, 0.00902626477, 0.00140149589],
 [0.998992503, 0.00096288725, 4.39716168e-05, 7.54880546e-07],
 [0.98639828, 0.0114482045, 0.0019893162, 0.00016414019],
 [0.982392251, 0.0144744227, 0.00285905367, 0.00027423],
 [0.987437427, 0.0106448252, 0.00177782716, 0.000140015341],
 [0.999819219, 0.000177532143, 3.26836289e-06, 1.92990157e-08],
 [0.600835562, 0.130947724, 0.163055345, 0.105161272],
 [0.990599334, 0.00814621802, 0.00117642956, 7.81044e-05],
 [0.960262358, 0.029676374, 0.00872572884, 0.00133555022],
 [0.960214734, 0.0297069345, 0.00873977784, 0.00133861112],
 [0.744032502, 0.117052749, 0.0955036134, 0.0434111357],
 [0.998733, 0.00120375201, 6.19822822e-05, 1.22523193e-06],
 [0.982206941, 0.014611885, 0.00290118647, 0.000279968837],
 [0.994075775, 0.00528915646, 0.000604568864, 3.0496949e-05],
 [0.971631289, 0.0221424922, 0.00552800531, 0.000698168762],
 [0.992645741, 0.00648015738, 0.00082669960

#### Call Google GCLOUD API for online inference

In [174]:
# Load test feature and labels
x_test = pd.read_csv(TEST_FEATURE_PATH)
#y_test = pd.read_csv(TEST_LABEL_PATH)

# Fill nan value with zeros (Prediction lacks the ability to handle nan values for now)
x_test = x_test.fillna(0)

# Create a temporary json file to contain data to be predicted
JSON_TEMP = 'tf_test_data.json' # temp json file name to hold the inference data
batch_size = 100                # data batch size
start = 0
end = min(ind+batch_size, len(x_test))
body={'instances': x_test.iloc[start:end].values.tolist()}
# body = json.dumps(body).encode().decode()
with open(JSON_TEMP, 'w') as fp:
    fp.write(json.dumps(body))


In [175]:
!gcloud beta ai endpoints predict $endpoint_id \
  --region=$REGION \
  --json-request=$JSON_TEMP


Using endpoint [https://us-central1-prediction-aiplatform.googleapis.com/]
[[0.999993324, 6.66598862e-06, 2.10555697e-08, 1.56754054e-11], [0.959246, 0.0303262044, 0.00902626477, 0.00140149589], [0.998992503, 0.00096288725, 4.39716168e-05, 7.54880546e-07], [0.98639828, 0.0114482101, 0.00198931806, 0.000164140336], [0.982392251, 0.0144744301, 0.00285905506, 0.00027423], [0.987437427, 0.0106448252, 0.00177782716, 0.000140015341], [0.999819219, 0.000177532143, 3.26836289e-06, 1.92990157e-08], [0.600835562, 0.130947724, 0.163055405, 0.105161317], [0.990599334, 0.00814621802, 0.00117642956, 7.81044e-05], [0.960262358, 0.0296763889, 0.00872573722, 0.0013355515], [0.960214734, 0.0297069345, 0.00873977784, 0.00133861112], [0.744032502, 0.117052749, 0.0955036134, 0.0434111357], [0.998733, 0.00120375201, 6.19822822e-05, 1.22523193e-06], [0.982206941, 0.014611885, 0.00290118647, 0.000279968837], [0.994075775, 0.00528915646, 0.000604568864, 3.0496949e-05], [0.971631289, 0.0221424922, 0.00552800531

#### Call Google API for batch inference

In [119]:
# Write batch data to file in GCS

import shutil
import os

# Clean current directory
DATA_DIR = './batch_data'
shutil.rmtree(DATA_DIR, ignore_errors=True)
os.makedirs(DATA_DIR)

n_samples = min(1000,x_test.shape[0])
nFiles = 10
nRecsPerFile = min(1000,n_samples//nFiles)
print("n_samples =", n_samples)
print("nFiles =", nFiles)
print("nRecsPerFile =", nRecsPerFile)

# Create nFiles files with nImagesPerFile images each
for i in range(nFiles):
    with open(f'{DATA_DIR}/unkeyed_batch_{i}.json', "w") as file:
        for z in range(nRecsPerFile):
            print(f'{{"dense_input": {np.array(x_test)[i*nRecsPerFile+z].tolist()}}}', file=file)
            #print(f'{{"{model_layers[0]}": {np.array(x_test)[i*nRecsPerFile+z].tolist()}}}', file=file)
            #key = f'key_{i}_{z}'
            #print(f'{{"image": {x_test_images[z].tolist()}, "key": "{key}"}}', file=file)

# Write batch data to gcs file
!gsutil -m cp -r ./batch_data gs://$BUCKET_NAME/$FOLDER_NAME/
    
# Remove old batch prediction results
!gsutil -m rm -r gs://$BUCKET_NAME/$FOLDER_NAME/batch_predictions


n_samples = 1000
nFiles = 10
nRecsPerFile = 100
Copying file://./batch_data/unkeyed_batch_9.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_7.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_5.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_2.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_8.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_0.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_1.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_6.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_3.json [Content-Type=application/json]...
Copying file://./batch_data/unkeyed_batch_4.json [Content-Type=application/json]...
/ [10/10 files][870.2 KiB/870.2 KiB] 100% Done                                  
Operation completed over 10 obj

In [127]:
JOBNAME_BATCH = 'tensorflow_batch_{}_{}'.format(
    USER,
    datetime.now(timezone(TIMEZONE)).strftime("%m%d%y_%H%M")
    )
# We use the job names as folder names to store outputs.
JOB_DIR_BATCH = 'gs://{}/{}/{}'.format(
    BUCKET_NAME,
    FOLDER_NAME,
    JOBNAME_BATCH,
    )

INPUT_PATH='gs://' + BUCKET_NAME + '/' + FOLDER_NAME + '/batch_data/*'
OUTPUT_PATH='gs://' + BUCKET_NAME + '/' + FOLDER_NAME + '/batch_predictions'

print("JOB_NAME_BATCH = ", JOBNAME_BATCH)
print("JOB_DIR_BATCH = ", JOB_DIR_BATCH)


JOB_NAME_BATCH =  tensorflow_batch_cchatterjee_061521_1054
JOB_DIR_BATCH =  gs://vapit_data/tf_models/tensorflow_batch_cchatterjee_061521_1054


In [None]:
aiplatform.init(project=PROJECT, location=REGION)

my_model = aiplatform.Model(model_name=model_id)

# Make SDK batch_predict method call
batch_prediction_job = my_model.batch_predict(
    instances_format="jsonl",
    predictions_format="jsonl",
    job_display_name=JOBNAME_BATCH,
    gcs_source=INPUT_PATH,
    gcs_destination_prefix=OUTPUT_PATH,
    model_parameters=None,
    machine_type="n1-standard-4",
    starting_replica_count=1,
    max_replica_count=1,
    sync=True,
)
print(batch_prediction_job.display_name)
print(batch_prediction_job.resource_name)
print(batch_prediction_job.state)


INFO:google.cloud.aiplatform.jobs:Creating BatchPredictionJob
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob created. Resource name: projects/901951554789/locations/us-central1/batchPredictionJobs/2005161763389046784
INFO:google.cloud.aiplatform.jobs:To use this BatchPredictionJob in another session:
INFO:google.cloud.aiplatform.jobs:bpj = aiplatform.BatchPredictionJob('projects/901951554789/locations/us-central1/batchPredictionJobs/2005161763389046784')
INFO:google.cloud.aiplatform.jobs:View Batch Prediction Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/batch-predictions/2005161763389046784?project=901951554789
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/901951554789/locations/us-central1/batchPredictionJobs/2005161763389046784 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/901951554789/locations/us-central1/batchPredictionJobs/2005161763389046784 current state:
JobState.JOB_STAT

In [129]:
print("errors")
!gsutil cat $OUTPUT_PATH/prediction.errors_stats-00000-of-00001
print("batch prediction results")
!gsutil cat $OUTPUT_PATH/prediction.results-00000-of-00010


errors
CommandException: No URLs matched: gs://vapit_data/tf_models/batch_predictions/prediction.errors_stats-00000-of-00001
batch prediction results
CommandException: No URLs matched: gs://vapit_data/tf_models/batch_predictions/prediction.results-00000-of-00010
