In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Custom model batch prediction with feature filtering 
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fprediction%2Fcustom_batch_prediction_feature_filter.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

## Overview


This tutorial demonstrates how to use the Vertex AI SDK for Python to train a custom tabular classification model and perform batch prediction with feature filtering. This means that you can run batch prediction on a list of selected features or exclude a list of features from prediction.

Learn more about [Vertex AI Batch Prediction](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions).

### Objective

In this notebook, you learn how to create a custom-trained model from a Python script in a Docker container using the Vertex AI SDK for Python, and then run a batch prediction job by including or excluding a list of features. 

This tutorial uses the following Google Cloud ML services and resources:

- BigQuery
- Cloud Storage
- Vertex AI managed Datasets
- Vertex AI Training
- Vertex AI BatchPrediction

The steps performed include:

- Create a Vertex AI custom `TrainingPipeline` for training a model.
- Train a TensorFlow model.
- Send batch prediction job.

### Dataset

The dataset used for this tutorial is the penguins dataset from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). This dataset has the following fields: `culmen_length_mm`, `culmen_depth_mm`, `flipper_length_mm`, `body_mass_g` from the dataset to predict the penguins species (`species`).

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* BigQuery

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Get Started

### Install Vertex AI SDK for Python and other required packages

In [None]:
# Install the packages
! pip3 install --upgrade --quiet google-cloud-aiplatform \
                                 google-cloud-storage \
                                 google-cloud-bigquery \
                                 pyarrow \
                                 db-dtypes

### Restart runtime (Colab only)

To use the newly installed packages, you must restart the runtime on Google Colab.

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

Authenticate your environment on Google Colab.

In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
PROJECT_ID = "gurkomal-playground"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

When you submit a training job using the Cloud SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. Using this model artifact, you can then
create Vertex AI Model resource and use for prediction.

In [3]:
BUCKET_URI = f"gs://bq-test-gurkomal-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [3]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://bq-test-gurkomal-gurkomal-playground-unique/...


### Import libraries

In [5]:
import json

import numpy as np
from google.cloud import aiplatform, bigquery

### Initialize Vertex AI SDK for Python

Initialize the Vertex SDK for Python for your project and corresponding bucket.


In [6]:
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

In [65]:
import pandas as pd

# Read CSV file into a DataFrame
df = pd.read_csv('output.csv')

# Display the first few rows of the DataFrame to verify
df.head()

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,-1.351846,0.627215,-1.210563,-0.909144,0
1,0,-0.766694,0.982683,-1.210563,0.550092,1
2,0,-0.565548,0.881121,-1.210563,-0.381335,1
3,0,0.458467,0.373309,-0.639777,-0.878096,0
4,0,-1.223844,-0.185283,-0.639777,-1.499048,0


In [66]:
data_file_name='my_output.jsonl'
df.head(10).to_json(data_file_name, orient='records', lines=True)

In [67]:
!gsutil cp "{data_file_name}" "{BUCKET_URI}/{data_file_name}"

Copying file://my_output.jsonl [Content-Type=application/octet-stream]...
/ [1 files][  1.3 KiB/  1.3 KiB]                                                
Operation completed over 1 objects/1.3 KiB.                                      


In [69]:
# Inspect if file was successfully uploaded to GCS
!gsutil cat "{BUCKET_URI}/{data_file_name}"

{"island":0,"culmen_length_mm":-1.3518456,"culmen_depth_mm":0.6272151,"flipper_length_mm":-1.2105628,"body_mass_g":-0.909144,"sex":0}
{"island":0,"culmen_length_mm":-0.7666939,"culmen_depth_mm":0.98268336,"flipper_length_mm":-1.2105628,"body_mass_g":0.5500921,"sex":1}
{"island":0,"culmen_length_mm":-0.56554765,"culmen_depth_mm":0.88112074,"flipper_length_mm":-1.2105628,"body_mass_g":-0.3813352,"sex":1}
{"island":0,"culmen_length_mm":0.45846736,"culmen_depth_mm":0.3733094,"flipper_length_mm":-0.63977706,"body_mass_g":-0.8780964,"sex":0}
{"island":0,"culmen_length_mm":-1.2238436,"culmen_depth_mm":-0.18528321,"flipper_length_mm":-0.63977706,"body_mass_g":-1.499048,"sex":0}
{"island":0,"culmen_length_mm":-0.14497007,"culmen_depth_mm":0.6779964,"flipper_length_mm":-0.63977706,"body_mass_g":-0.13295458,"sex":1}
{"island":0,"culmen_length_mm":0.53161156,"culmen_depth_mm":-0.28684488,"flipper_length_mm":-0.63977706,"body_mass_g":-1.8716189,"sex":0}
{"island":0,"culmen_length_mm":1.1899068,"cul

### Send the BatchPredictionJob request using REST API

Now that you have test data, you can use it to send a batch prediction request using REST API. To do that you need to create a `JSON` request with the following information:

- `BATCH_JOB_NAME`: Display name for the batch prediction job.
- `MODEL_URI`: The URI for the Model resource to use for making predictions.
- `INPUT_FORMAT`: The format of your input data: bigquery, jsonl, csv, tf-record, tf-record-gzip, or file-list.
- `INPUT_URI`: Cloud Storage URI of your input data. May contain wildcards.
- `OUTPUT_URI`: Cloud Storage URI of a directory where you want Vertex AI to save output.
- `MACHINE_TYPE`: The machine resources to be used for this batch prediction job.

In this example, we create two versions of the same JSON request: one with `excludedFields` and the other with `includeFields` to show how to include or exclude certain features. Note that these two requests do the same job in this example!

Learn more about [request a batch prediction](https://cloud.google.com/vertex-ai/docs/predictions/get-predictions#api_1)<br>
Learn more about [instanceconfig](https://cloud.google.com/vertex-ai/docs/reference/rest/v1beta1/projects.locations.batchPredictionJobs#instanceconfig)

In [71]:
BATCH_JOB_NAME = "keras-batch"
#MODEL_URI = model.resource_name
MODEL_URI="projects/506365831141/locations/us-central1/models/2574608731018887168"
INPUT_FORMAT = "jsonl"
INSTANCE_TYPE="array"
INPUT_URI = f"{BUCKET_URI}/{data_file_name}"
OUTPUT_FORMAT = "jsonl"
OUTPUT_URI = f"{BUCKET_URI}/output"
MACHINE_TYPE = "n1-standard-2"

# Create a list of columns to be included
INCLUDED_FIELDS = list(df.columns)

### Create JSON body requests

In [73]:
request_with_included_fields = {
    "displayName": f"{BATCH_JOB_NAME}-included_fields",
    "model": MODEL_URI,
    "inputConfig": {
        "instancesFormat": INPUT_FORMAT,
        "gcsSource": {"uris": INPUT_URI},
    },
    "outputConfig": {
        "predictionsFormat": OUTPUT_FORMAT,
        "gcsDestination": {"outputUriPrefix": OUTPUT_URI},
    },
    "dedicatedResources": {
        "machineSpec": {
            "machineType": MACHINE_TYPE,
        }
    },
    "instanceConfig": {
        "includedFields": INCLUDED_FIELDS,
        "instanceType": INSTANCE_TYPE,
    },
}

with open("request_with_included_fields.json", "w") as outfile:
    json.dump(request_with_included_fields, outfile)

### Send the requests

To send the requests, specify the API version you want to use. In this case you use `v1beta1` to be able to use `instanceConfig`.

#### Include fields

Here, we send the request with `includedFields`:

In [74]:
! curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d @request_with_included_fields.json \
  https://{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{LOCATION}/batchPredictionJobs

{
  "name": "projects/506365831141/locations/us-central1/batchPredictionJobs/8688026319281192960",
  "displayName": "keras-batch-included_fields",
  "model": "projects/506365831141/locations/us-central1/models/2574608731018887168",
  "inputConfig": {
    "instancesFormat": "jsonl",
    "gcsSource": {
      "uris": [
        "gs://bq-test-gurkomal-gurkomal-playground-unique/my_output.jsonl"
      ]
    }
  },
  "outputConfig": {
    "predictionsFormat": "jsonl",
    "gcsDestination": {
      "outputUriPrefix": "gs://bq-test-gurkomal-gurkomal-playground-unique/output"
    }
  },
  "dedicatedResources": {
    "machineSpec": {
      "machineType": "n1-standard-2"
    }
  },
  "manualBatchTuningParameters": {},
  "state": "JOB_STATE_PENDING",
  "createTime": "2024-08-31T04:25:35.285706Z",
  "updateTime": "2024-08-31T04:25:35.285706Z",
  "instanceConfig": {
    "instanceType": "array",
    "includedFields": [
      "island",
      "culmen_length_mm",
      "culmen_depth_mm",
      "flipper_l

### Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this notebook.

In [None]:
# Warning: Setting this to true deletes everything in your bucket
delete_bucket = True

if delete_bucket:
    ! gsutil rm -r $BUCKET_URI