# Train a scikit-learn model with Vertex AI SDK 2.0 and Bigframes

## Overview

This tutorial demonstrates how to train a scikit-learn model using Vertex AI local-to-remote training with Vertex AI SDK 2.0 and BigQuery Bigframes as the data source.

Learn more about [bigframes](https://cloud.google.com/bigquery/docs/).

### Objective

In this tutorial, you learn to use `Vertex AI SDK 2.0` with Bigframes as input data source.


This tutorial uses the following Google Cloud ML services:

- `Vertex AI Training`
- `Vertex AI Remote Training`


The steps performed include:

- Initialize a dataframe from a BigQuery table and split the dataset
- Perform transformations as a Vertex AI remote training.
- Train the model remotely and evaluate the model locally

**Local-to-remote training**

``` python
import vertexai
from my_module import MyModelClass

vertexai.preview.init(remote=True, project="my-project", location="my-location", staging_bucket="gs://my-bucket")

# Wrap the model class with `vertex_ai.preview.remote`
MyModelClass = vertexai.preview.remote(MyModelClass)

# Instantiate the class
model = MyModelClass(...)

# Optional set remote config
model.fit.vertex.remote_config.display_name = "MyModelClass-remote-training"
model.fit.vertex.remote_config.staging_bucket = "gs://my-bucket"

# This `fit` call will be executed remotely
model.fit(...)
```

### Dataset

This tutorial uses the <a href="https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html">IRIS dataset</a>, which predicts the iris species.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
[BigQuery pricing](https://cloud.google.com/bigquery/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Setup
- Follow the steps in [confluence](https://confluence.e-loreal.com/display/BTDPPAIML/2.2.1.0+Setup)

## Installation

(In case you do not use the pre installed kernel)

In [7]:
# Install the packages
! pip3 install bigframes

[31mERROR: Ignored the following yanked versions: 0.0.0[0m[31m
[0m[31mERROR: Ignored the following versions that require a different python version: 0.1.0 Requires-Python >=3.9; 0.1.1 Requires-Python >=3.9; 0.10.0 Requires-Python >=3.9; 0.11.0 Requires-Python >=3.9; 0.12.0 Requires-Python >=3.9; 0.13.0 Requires-Python >=3.9; 0.14.0 Requires-Python >=3.9; 0.14.1 Requires-Python >=3.9; 0.15.0 Requires-Python >=3.9; 0.16.0 Requires-Python >=3.9; 0.17.0 Requires-Python >=3.9; 0.18.0 Requires-Python >=3.9; 0.19.0 Requires-Python >=3.9; 0.19.1 Requires-Python >=3.9; 0.19.2 Requires-Python >=3.9; 0.2.0 Requires-Python >=3.9; 0.20.0 Requires-Python >=3.9; 0.20.1 Requires-Python >=3.9; 0.21.0 Requires-Python >=3.9; 0.22.0 Requires-Python >=3.9; 0.23.0 Requires-Python >=3.9; 0.24.0 Requires-Python >=3.9; 0.25.0 Requires-Python >=3.9; 0.26.0 Requires-Python >=3.9; 0.3.0 Requires-Python >=3.9; 0.4.0 Requires-Python >=3.9; 0.5.0 Requires-Python >=3.9; 0.6.0 Requires-Python >=3.9; 0.7.0 Require

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [1]:
PROJECT_ID = ""

In [2]:
# Set the project id
! gcloud config set project {PROJECT_ID}

[1;31mERROR:[0m (gcloud.config.set) argument VALUE: Must be specified.
Usage: gcloud config set SECTION/PROPERTY VALUE [optional flags]
  optional flags may be  --help | --installation

For detailed information on this command and its flags, run:
  gcloud config set --help


#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [3]:
REGION = "europe-west1" 

### Authenticate your Google Cloud account


**Inside the workstation terminal, type following command:** </br>

<code>! gcloud auth login</code>

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [4]:
BUCKET_URI = f"gs://{PROJECT_ID}-big-frames-bucket"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [5]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

CommandException: The mb command requires at least 1 argument. Usage:

  gsutil mb [-b (on|off)] [-c <class>] [-k <key>] [-l <location>] [-p <project>]
            [--autoclass] [--retention <time>] [--pap <setting>]
            [--placement <region1>,<region2>]
            [--rpo (ASYNC_TURBO|DEFAULT)] gs://<bucket_name>...

For additional help run:
  gsutil help mb


### Import libraries and define constants

In [9]:
import vertexai
import vertex_utils as vertex_utils

REMOTE_JOB_NAME = "sdk2-bigframes-sklearn"
REMOTE_JOB_BUCKET = f"{BUCKET_URI}/{REMOTE_JOB_NAME}"


## Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
vertexai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=BUCKET_URI,
)

In [None]:
vertex_helper = vertex_utils.VertexHelper(staging_bucket=BUCKET_URI, remote_job_name=REMOTE_JOB_NAME)

## Prepare the dataset

Now load the Iris dataset and split the data into train and test sets.

In [None]:
import bigframes.pandas as bf
df = bf.read_gbq("bigquery-public-data.ml_datasets.iris")

species_categories = {
    "versicolor": 0,
    "virginica": 1,
    "setosa": 2,
}
df["species"] = df["species"].map(species_categories)

# Assign an index column name
index_col = "index"
df.index.name = index_col


pdf = df.to_pandas()

In [None]:
pdf

In [None]:
feature_columns = df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
label_columns = df[["species"]]
train_X, test_X, train_y, test_y = bf_train_test_split(
    feature_columns, label_columns, test_size=0.2
)

print("X_train size: ", train_X.size)
print("X_test size: ", test_X.size)

## Feature transformation

Next, you do feature transformations on the data using the Vertex AI remote training service.

First, you re-initialize Vertex AI to enable remote training.

In [None]:
# Switch to remote mode for training
vertexai.preview.init(remote=True)

### Execute remote job for fit_transform() on training data

Next, indicate that the `StandardScalar` class is to be executed remotely. Then set up the data transform and call the `fit_transform()` method is executed remotely.

### The vanilla way

In [None]:
from sklearn.preprocessing import StandardScaler

# Wrap classes to enable Vertex remote execution
StandardScaler = vertexai.preview.remote(StandardScaler)

# Instantiate transformer
transformer = StandardScaler()

# Set training config
transformer.fit_transform.vertex.remote_config.display_name = (
    f"{REMOTE_JOB_NAME}-fit-transformer-bigframes"
)
transformer.fit_transform.vertex.remote_config.staging_bucket = REMOTE_JOB_BUCKET
transformer.fit_transform.vertex.remote_config.machine_type = 'e2-standard-8'

# Execute transformer on Vertex (train_X is bigframes.dataframe.DataFrame, X_train is np.array)
X_train = transformer.fit_transform(train_X)

### Remote transform on test data

In [None]:
# Transform test dataset before calculate test score
transformer.transform.vertex.remote_config.display_name = (
    REMOTE_JOB_NAME + "-transformer"
)
transformer.transform.vertex.remote_config.staging_bucket = REMOTE_JOB_BUCKET

# Execute transformer on Vertex (test_X is bigframes.dataframe.DataFrame, X_test is np.array)
X_test = transformer.transform(test_X)

### Using vertex helper

In [None]:
from sklearn.preprocessing import StandardScaler

# wrap the model
StandardScaler = vertex_helper.wrap_model(StandardScaler)

# init as usual
transformer = StandardScaler()

# enables the remote on all the functions
transformer = vertex_helper.enable_for_remote(transformer)

X_train = transformer.fit_transform(train_X)

In [None]:
X_test = transformer.transform(test_X)

## Remote training

First, train the scikit-learn model as a remote training job:

- Set LogisticRegression for the remote training job.
- Invoke LogisticRegression locally which will launch the remote training job.

### Vanilla way

In [None]:
from sklearn.linear_model import LogisticRegression

# Wrap classes to enable Vertex remote execution
LogisticRegression = vertexai.preview.remote(LogisticRegression)

# Instantiate model, warm_start=True for uptraining
model = LogisticRegression(warm_start=True)

# Set training config
model.fit.vertex.remote_config.display_name = REMOTE_JOB_NAME + "-sklearn-model"
model.fit.vertex.remote_config.staging_bucket = REMOTE_JOB_BUCKET

# Train model on Vertex, train_X, train_y are bigframes  
model.fit(train_X, train_y)

### Vertex helper way

In [None]:
from sklearn.linear_model import LogisticRegression
# wrap the class
LogisticRegression = vertex_helper.wrap_model(LogisticRegression)

# instantiate as usual
model = LogisticRegression(warm_start=True)

# enable the instance for remote execution
model = vertex_helper.enable_for_remote(model)

# Train model on Vertex, train_X, train_y are bigframes  
model.fit(train_X, train_y)

## Remote prediction

Obtain predictions from the trained model.

In [None]:
# Remote evaluation
vertexai.preview.init(remote=True)

# Evaluate model's accuracy score
predictions = model.predict(test_X)

print(f"Remote predictions: {predictions}")

## Local evaluation

Score model results locally.

In [None]:
# User must convert bigframes to pandas dataframe for local evaluation
train_X_pd = train_X.to_pandas().reset_index(drop=True)
train_y_pd = train_y.to_pandas().reset_index(drop=True)

test_X_pd = test_X.to_pandas().reset_index(drop=True)
test_y_pd = test_y.to_pandas().reset_index(drop=True)

In [None]:
# Switch to local mode for testing
vertexai.preview.init(remote=False)

# Evaluate model's accuracy score
print(f"Train accuracy: {model.score(train_X_pd, train_y_pd)}")

print(f"Test accuracy: {model.score(test_X_pd, test_y_pd)}")

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
import os

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI