# Hello ML engineers!

In this small tutorial, we will learn to build an AutoML model to perform a regression on the *House Prices* dataset.

In [None]:
from google.cloud import aiplatform
from google.cloud import bigquery
import json
import pandas as pd

## Initialize Vertex AI SDK

- `PROJECT_ID` is something you can find in the Google Cloud Console, by going to the menu -> Cloud Overview -> Dashboard.
- `LOCATION` is up to your choice, but since your BigQuery dataset is in Europe, you would probably like your models and versionized dataset to be located in Europe as well.

In [None]:
# Task 1: adapt the following variables
PROJECT_ID = ...
LOCATION = ...

Under, we initialize the Vertex AI SDK to work in the correct project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=LOCATION)

## Create a tabular dataset

Choose a BigQuery source for your dataset. The format should be `bq://{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}`.

In [None]:
# Task 2: adapt the following variables
BQ_SOURCE = ...
DATASET_DISPLAY_NAME = "my-small-dataset"

If you do not wish to use the dataset you created during the data part, feel free to use the following source:
`bq://data-night-2023-bigquery.ML_house_prices.training_set`

Then, you need to create the tabular dataset using the Vertex AI SDK:

In [None]:
# Task 3: create a tabular dataset
# clue: https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview
dataset = ...

And wait for the job to complete...

In [None]:
dataset.wait()

print(f'\tDataset: "{dataset.display_name}"')
print(f'\tname: "{dataset.resource_name}"')

## Create an AutoML training job

Now, you need to adapt the variable to create an AutoML training job:

In [None]:
# Task 4: adapt the following variables
TRAINING_JOB_DISPLAY_NAME = "my-small-training-job"
TARGET_COLUMN = ...
OPTIMIZATION_PREDICTION_TYPE = "regression"
OPTIMIZATION_OBJECTIVE = ...
TRAINING_FRACTION_SPLIT = ...
VALIDATION_FRACTION_SPLIT = (1.0 - TRAINING_FRACTION_SPLIT) / 2.0
TEST_FRACTION_SPLIT = VALIDATION_FRACTION_SPLIT
BUDGET_MILLI_NODE_HOURS = 1000 # one hour
MODEL_DISPLAY_NAME = "my-small-model"
DISABLE_EARLY_STOPPING = False
SYNC = False

And finally, create the AutoML training job:

In [None]:
# Task 5: create an AutoML training job
# clue: https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview
tabular_regression_job = ...
model = ...

In [None]:
model.wait()

## (Partial) congrats!

You have successfully trained your model! Maybe without even looking at the documentation? ;-)
The training will continue for quite some time. To speed things up, we will use a pre-trained model for the next part.

Be sure to **stop the training**, otherwise you won't be able to run the next steps!

Now, will you be able to generate explainable predictions? Let's see... MWAH AH AH AH AH!

## Generate explainable predictions

The model that we will use in this section is deployed to an endpoint. 

In [None]:
endpoint_id = "projects/data-night-2023-bigquery/locations/europe-west1/endpoints/6646099189161787392"

endpoint = aiplatform.Endpoint(endpoint_id)

Let's retrieve the validation set from BigQuery:

In [None]:
bigquery_client = bigquery.Client(project=PROJECT_ID)

df = bigquery_client.query("""
    SELECT * FROM
    `data-night-2023-bigquery.ML_house_prices.validation_set`
""").to_dataframe()

In [None]:
df.describe()

We will only use one row from the hold-out set.

In [None]:
row = df.sample()
row = row.drop("SalesPrice", axis=1)
for column in row.columns:
    row[column] = row[column].astype(str)

In [None]:
# Task 6: create an instance dict
# it complies with the following schema:
# [{"COLUMN_A": VALUE_A, "COLUMN_B": VALUE_B, ..., "COLUMN_Z": VALUE_Z}]
instance_dict = [{...}]

Let's generate explanations for the row we have selected.

In [None]:
response = endpoint.explain(instances=instance_dict, parameters={})

for explanation in response.explanations:
    attributions = explanation.attributions
    for attribution in attributions:
        print("  attribution")
        print("   baseline_output_value:", attribution.baseline_output_value)
        print("   instance_output_value:", attribution.instance_output_value)
        print("   approximation_error:", attribution.approximation_error)

feature_attributions = pd.Series(dict(attribution.feature_attributions))

for column in feature_attributions.index:
    feature_attributions[f"{column} ({row.iloc[0].at[column]})"] = feature_attributions[column]
    feature_attributions = feature_attributions.drop(column)

IMPORTANCE_THRESHOLD = 181_085.181 * 0.01
feature_attributions = feature_attributions[abs(feature_attributions) > IMPORTANCE_THRESHOLD].sort_values()

We can plot the feature attributions to check how each individual feature influences the prediction positively or negatively.

In [None]:
feature_attributions.plot.barh()

## Conclusion

Congrats! You have now completed this tutorial. Thanks to Google Cloud, you were able to create an AutoML model to perform regression on the House Prices dataset. You were then able to activate the explainable predictions from the model and understood what feature attributions were.