# Vertex AI SDK for Python: AutoML tabular training and prediction

Inspired by : https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl-tabular-classification.ipynb

In [None]:
# Install the packages
! pip3 install --quiet --upgrade google-cloud-aiplatform \
                                 google-cloud-storage

In [None]:
!python.exe -m pip install --upgrade pip

# Set Google Cloud project information

# Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
from google.oauth2 import service_account
from google.cloud import storage
import uuid # for generating unique bucket names

# Path to your service account key file
service_account_key_path = '<enter the path of the json key file of the Service Account>'

# Initialize the client
storage_client = storage.Client.from_service_account_json(service_account_key_path)

# Declare variables
PROJECT_ID = "<enter your project ID>"  # Replace with your GCP project ID
unique_suffix = str(uuid.uuid4())[:8]  # Generate a unique suffix for the bucket name
bucket_name = f"{project_id}-bucket-{unique_suffix}"  # Combine project ID and unique suffix
LOCATION = "asia-south1"  # Mumbai region


# Variable to hold the bucket URI
BUCKET_URI = None

# Function to create a bucket
def create_bucket():
    global BUCKET_URI
    try:
        # Create the bucket with a specific location
        bucket = storage_client.bucket(bucket_name)
        bucket.location = LOCATION
        bucket = storage_client.create_bucket(bucket, project=PROJECT_ID)
        print(f"Bucket {bucket.name} created successfully in location {bucket.location}.")
        
        # Assign the bucket URI
        BUCKET_URI = f"gs://{bucket.name}"
        print(f"BUCKET_URI is set to: {BUCKET_URI}")
    except Exception as e:
        print(f"Error: {e}")

# Call the function to create the bucket
create_bucket()

print(BUCKET_URI)

# Copy dataset into your Cloud Storage bucket

In [None]:
IMPORT_FILE = "petfinder-tabular-classification.csv"
! gsutil cp gs://cloud-samples-data/ai-platform-unified/datasets/tabular/{IMPORT_FILE} {BUCKET_URI}/data/

gcs_source = f"{BUCKET_URI}/data/{IMPORT_FILE}"



In [None]:
# Function to list files in the bucket
def list_files_in_bucket(bucket_name, prefix=None):
    try:
        bucket = storage_client.bucket(bucket_name)
        blobs = storage_client.list_blobs(bucket_name, prefix=prefix)
        print(f"Files in bucket '{bucket_name}':")
        for blob in blobs:
            print(f"- {blob.name}")
    except Exception as e:
        print(f"Error: {e}")

list_files_in_bucket(bucket_name, prefix="data/")

# Import Vertex AI SDK for Python

Import Vertex AI SDK into python env and initialize it.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION)

# Tutorial 

## Create a Managed tabular dataset from a CSV

This section creates a dataset from a CSV file stored on your GCS bucket.

In [None]:
ds = dataset = aiplatform.TabularDataset.create(
    display_name="petfinder-tabular-dataset",
    gcs_source=gcs_source,
)

ds.resource_name

## Launch a training job to create a model

Once you've defined your training script, you'll create a model. The __run__ function creates a training pipeline that trains and creates a model object. After the training pipeline completes, the run function returns the model object.

<span style="color:red">Note: Running this step will take more than an hour to train the model.</span>

In [None]:
job = aiplatform.AutoMLTabularTrainingJob(
    display_name="train-petfinder-automl-1",
    optimization_prediction_type="classification",
    column_transformations=[
        {"categorical": {"column_name": "Type"}},
        {"numeric": {"column_name": "Age"}},
        {"categorical": {"column_name": "Breed1"}},
        {"categorical": {"column_name": "Color1"}},
        {"categorical": {"column_name": "Color2"}},
        {"categorical": {"column_name": "MaturitySize"}},
        {"categorical": {"column_name": "FurLength"}},
        {"categorical": {"column_name": "Vaccinated"}},
        {"categorical": {"column_name": "Sterilized"}},
        {"categorical": {"column_name": "Health"}},
        {"numeric": {"column_name": "Fee"}},
        {"numeric": {"column_name": "PhotoAmt"}},
    ],
)

# This takes about an hour to run
model = job.run(
    dataset=ds,
    target_column="Adopted",
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    model_display_name="adopted-prediction-model",
    disable_early_stopping=False,
)

## Deploy your model
Before you use your model to make predictions, you need to deploy it to an endpoint. You can do this by calling the deploy function on the model resource. This function does two things:

Creates an endpoint resource to which the model resource is deployed.
Deploys the model resource to the endpoint resource.
Deploy your model.

NOTE: Wait until the model FINISHES deployment before proceeding to prediction.

In [None]:
endpoint = model.deploy(
    machine_type="n1-standard-4",
)

## Predict on the endpoint
This sample instance is taken from an observation in which Adopted = Yes
Note that the values are all strings. Since the original data was in CSV format, everything is treated as a string. The transformations you defined when creating your AutoMLTabularTrainingJob inform Vertex AI to transform the inputs to their defined types.

In [None]:
prediction = endpoint.predict(
    [
        {
            "Type": "Cat",
            "Age": "3",
            "Breed1": "Tabby",
            "Gender": "Male",
            "Color1": "Black",
            "Color2": "White",
            "MaturitySize": "Small",
            "FurLength": "Short",
            "Vaccinated": "No",
            "Sterilized": "No",
            "Health": "Healthy",
            "Fee": "100",
            "PhotoAmt": "2",
        }
    ]
)

print(prediction)

## Undeploy the model

To undeploy your model resource from the serving endpoint resource, use the endpoint's undeploy method with the following parameter:

deployed_model_id: The model deployment identifier returned by the prediction service when the model resource is deployed. You can retrieve the deployed_model_id using the prediction object's deployed_model_id property.

In [None]:
endpoint.undeploy(deployed_model_id=prediction.deployed_model_id)

Through the steps in this notebook, we were able to acheive the following tasks for Classification of tabular data:

- Create a Vertex AI model training job.
- Train an AutoML Tabular model.
- Deploy the model resource to a serving endpoint resource.
- Make a prediction by sending data.
- Undeploy the model resource.

## Cleaning up
To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Training Job
- Model
- Endpoint
- Cloud Storage Bucket

#### Note: You must delete any model resources deployed to the endpoint resource before deleting the endpoint resource.

In [None]:
# Warning: Setting this to true will delete everything in your bucket
delete_bucket = False

# Delete the training job
job.delete()

# Delete the model
model.delete()

# Delete the endpoint
endpoint.delete()

if delete_bucket:
    ! gsutil -m rm -r $BUCKET_URI