Author: Luke Shulman 

Version Date: 12/22/2022 
# Build a DataRobot ML model and deploy from Google Cloud Platform

<img src="https://storage.googleapis.com/public-artifacts-datarobot/e2e_logos/DR%20and%20GCP%20Better%20Together.svg" width=200 />

In this notebook, you will build an ML model using a combination of GCP services and DataRobot. It covers an end-to-end workflow that includes sourcing the data through exploratory data analysis, model development, and deployment.

DataRobot recommends running this notebook in Google Colaboratory or Vertex AI Workbench, which both provide hosted notebooks with automatic configuration of Google services. Everything else in this notebook should work in any Jupyter environment with properly configured authentication. 

### Import libraries

In [1]:
import os
import pandas as pd

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/google")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"
    


In [None]:
!pip install {USER_FLAG} --upgrade  google-cloud-resource-manager google-cloud-bigquery google-cloud-storage datarobot pandas altair google-cloud-secret-manager google-auth

### Configure a Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute and storage costs.

2. [Ensure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Big Query API, Secrets Manager, and Cloud Storage APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,storage_component,secretmanager.googleapis.com).

4. If you are running this notebook locally, then install the [Cloud SDK](https://cloud.google.com/sdk).

5. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the correct project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands and it interpolates Python variables prefixed with `$` into these commands.

In [5]:
# Set project constants for Google
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB: 
  from google.colab import auth
  auth.authenticate_user()
  credentials, _ = google.auth.default()
  #@title Enter GCP/BigQuery Project ID
  PROJECT_ID = 'datarobot-vertex-pipelines' #@param{type:"string"}
elif IS_GOOGLE_CLOUD_NOTEBOOK: # Likely using vertex or dataproc
  import google
  credentials, project = google.auth.default()
  PROJECT_ID = project
else: # Project running locally 
  from google import auth
  credentials, project = auth.default()
  PROJECT_ID = project

if IN_COLAB:
  #@title Enter GCP/BigQuery Project Number
  PROJECT_NUMBER = 'ENTER YOUR PROJECT NUMBER HERE' #@param{type:"string"}
else:
  PROJECT_NUMBER = 'ENTER YOUR PROJECT NUMBER HERE'   # The ID number for you project

## Import data

This example uses loan data from a public dataset. To facilitate this demonstration, you will first load the data into a BigQuery table that will be used as the DataSource for DataRobot modeling.

In [6]:
from google.cloud import bigquery
import requests
from tempfile import TemporaryFile
import pandas as pd

# Construct a BigQuery client object
client = bigquery.Client(project=PROJECT_ID)

# Create the dataset if needed
dataset_name = "dr_sample_data" 

client.create_dataset(dataset_name, exists_ok=True)

full_table_name = client.project + "." + dataset_name + "." + "lending_club"

print(f'''Data will be written to {full_table_name}''')

job_config = bigquery.LoadJobConfig(
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=True
)


with TemporaryFile() as tmpfile: 
    r = requests.get("https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv")
    tmpfile.write(r.content)
    tmpfile.seek(0)
    load_job = client.load_table_from_file(
        tmpfile, full_table_name, job_config=job_config
    )  # Make an API request



load_job.result()

destination_table = client.get_table(full_table_name)
print("Loaded {} rows.".format(destination_table.num_rows))

Data will be written to datarobot-vertex-pipelines.dr_sample_data.lending_club
Loaded 10000 rows.


In [7]:
from google.cloud import secretmanager
import datarobot as dr

api_secret =  f"projects/{PROJECT_NUMBER}/secrets/DR_API_KEY/versions/1"
endpoint = f"projects/{PROJECT_NUMBER}/secrets/DR_ENDPOINT/versions/1"
secrets = secretmanager.SecretManagerServiceClient(credentials=credentials)

DR_API_KEY = secrets.access_secret_version(name=api_secret).payload.data.decode('UTF-8')
DR_ENDPOINT = secrets.access_secret_version(name=endpoint).payload.data.decode('UTF-8')


client = dr.Client(
    token=DR_API_KEY, 
    endpoint=DR_ENDPOINT,
    user_agent_suffix='AIA-E2E-GCP-6' #Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client

<datarobot.rest.RESTClientObject at 0x7fb77649ffa0>

### Register data in the AI Catalog

To register the data with DataRobot, you will need to authorize DataRobot to access BigQuery data. As this requires user authorization, it must be enabled via the GUI. To authorize DataRobot to access data in BigQuery, follow these steps: 

1. In the AI Catalog, select **Add New Data Connection** and choose BigQuery.

<img src="https://storage.googleapis.com/public-artifacts-datarobot/e2e_logos/dr_new_data_connection.jpg" width=300 />

<span style="font-size:7;font-weight:100;"><i>Create a new Data connection in DataRobot</i></span>

<img src="https://storage.googleapis.com/public-artifacts-datarobot/e2e_logos/BigQueryEnjoy.jpg" width=300 />

<span style="font-size:7;font-weight:100;"><i>Select the BigQuery connection</i></span>

2. Name the connection "BigQuery," select the driver, and then enter your GCP project ID (saved in this notebook) in the text field as shown: 

<img src="https://storage.googleapis.com/public-artifacts-datarobot/e2e_logos/BigQuery.jpg" width=300 />

3. Once the connection is saved, select **Test Data Connection**. This prompts you to authorize the DataRobot connection to BigQuery using your GCP Account. 

More information on this process can be found in the [DataRobot BigQuery Documentation](https://app.datarobot.com/docs/data/connect-data/data-sources/dc-bigquery.html).

Once this process is complete, you can use the DataRobot API to access BigQuery datasets. 

To facilitate data access, DataRobot defines the following entities:
    
- *Data store:* The system with the data in this case BigQuery. You created this in the previous step. 
- *Data source:* The query or table with the data. In this case, `dr_sample_data.lending_club`. 
- *Dataset:* A registered dataset for ML projects.

The following snippet creates all three of these assets.

In [8]:
# Access the newly created DataStore that was named "BigQuery"
from IPython.display import display, HTML


DATA_STORE_NAME = 'DataRobot BigQuery Vertex'
data_store = [ds for ds in dr.DataStore.list() if ds.canonical_name == DATA_STORE_NAME][0]
credential = [cred for cred in dr.Credential.list() if cred.name == 'bigquery-oauth'][0]
# now we will register the table as a data soruce. 


params = dr.DataSourceParameters(
   table=full_table_name, #from creating the table above
    data_store_id=data_store.id
 )

data_source = dr.DataSource.create(data_source_type='jdbc', canonical_name='Test BigQuery', params=params)

data_set = dr.Dataset.create_from_data_source(data_source_id=data_source.id, credential_id=credential.credential_id)

HTML(f'''<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{data_set.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">Open Dataset in DataRobot</a>
</div>''')


With the dataset logged in the AI Catalog, you can quickly see key statistics about all of the features. 

In [None]:
features_from_dr = data_set.get_all_features()

pd.DataFrame(
    [
        {
            "Feature Name": f.name,
            "Feature Type": f.feature_type,
            "Unique Count": f.unique_count,
            "NA Count": f.na_count,
            "Mean": f.mean,
            "Median": f.median,
        }
        for f in features_from_dr
    ]
)

Unnamed: 0,Feature Name,Feature Type,Unique Count,NA Count,Mean,Median
0,addr_state,Categorical,50,0.0,,
1,annual_inc,Categorical,1901,1.0,,
2,delinq_2yrs,Categorical,10,5.0,,
3,desc,Text,6761,3230.0,,
4,dti,Numeric,2585,0.0,13.34,13.41
5,earliest_cr_line,Date,463,5.0,1997-05-30,1998-07-01
6,emp_length,Categorical,11,259.0,,
7,emp_title,Text,8214,592.0,,
8,funded_amnt,Numeric,727,0.0,10765.97,9250
9,grade,Categorical,7,0.0,,


## Initiate Autopilot

With the dataset logged in the AI Catalog, you can go ahead and kick off a project to predict `is_bad`, an indicator that the loan was not paid.

In [None]:
project = dr.Project.create_from_dataset(
    dataset_id=data_set.id,
)


try:
    project.analyze_and_model(target="is_bad")
except dr.errors.AsyncTimeoutError:
    print("Don't worry if it times out, the process is async and will continue to run")


HTML(
    f"""
<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{project.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">Open Project in DataRobot</a>
</div>"""
)


## Evaluate the model 

As DataRobot runs Autopilot, you can access the models on the Leaderboard using the `get_models` method. By default, this function returns models sorted by their performance so it is easy to find the top performing model. You can also call the `get_top_model` helper.   

In [None]:
top_model = project.get_top_model()

display(
    HTML(
        f"""
<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{top_model.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">{top_model.model_type}</a>
</div>"""
    )
)


pd.DataFrame(top_model.metrics)


Unnamed: 0,AUC,Area Under PR Curve,FVE Binomial,Gini Norm,Kolmogorov-Smirnov,LogLoss,Max MCC,RMSE,Rate@Top10%,Rate@Top5%,Rate@TopTenth%
validation,0.69996,0.26553,0.0745,0.39992,0.30703,0.3576,0.22021,0.32611,0.2875,0.2875,1.0
crossValidation,0.6767,0.252758,0.058966,0.3534,0.272936,0.362702,0.199336,0.327324,0.28375,0.33,0.9
holdout,,,,,,,,,,,
training,,,,,,,,,,,
backtestingScores,,,,,,,,,,,
backtesting,,,,,,,,,,,


### Build an ROC curve

Beyond the Leaderboard, you can access any analysis DataRobot does out-of-the-box for every model. In the following cell, reproduce the ROC curve by calling the `get_roc_curve` function from the top model.

In [None]:
import altair as alt

roc_object = top_model.get_roc_curve(source="crossValidation")
roc = pd.DataFrame(roc_object.roc_points)


base_line = pd.DataFrame({"x": [0, 1], "y": [0, 1]})

curve = (
    alt.Chart(roc, title="ROC Curve For DataRobot Top Model")
    .mark_line()
    .encode(x="false_positive_rate:Q", y="true_positive_rate:Q")
)

ref_line = (
    alt.Chart(base_line)
    .mark_line(color="black", strokeDash=[8, 4])
    .encode(x="x:Q", y="y:Q")
)

curve + ref_line

### Feature Impact 

To demonstrate model explainability, you can trigger and get the feature impact values of any model with the `get_or_request_feature_impact` function.

In [None]:
#### Retrieve Feature Impact ####
feature_impacts = (
    top_model.get_or_request_feature_impact()
)  # Will trigger Feature Impact calculations if not done.
FI_df = pd.DataFrame(feature_impacts)  # Convert to dataframe

FI_df = FI_df.sort_values(by="impactNormalized", ascending=False).head(
    10
)  # fist ten features

alt.Chart(
    FI_df, title="Feature Impact Chart for Top DataRobot Model"
).mark_bar().encode(x="impactNormalized:Q", y=alt.X("featureName:N", sort="-x"))


## Deploy a model

Once selected, your top model can be easily deployed. 

In [None]:
prediction_server = dr.PredictionServer.list()[0] # Deploy to the first prediction server 

deployment = dr.Deployment.create_from_learning_model(
    model_id=top_model.id,
    description="Test Google End to End Deployment",
    prediction_threshold=0.5,
    label="Test Google End to End",
    default_prediction_server_id=prediction_server.id
)

deployment.update_drift_tracking_settings(
    target_drift_enabled=True, feature_drift_enabled=True
)


HTML(
    f"""
<div style="text-aligh:center;padding:.75rem;"> 
    <a href="{deployment.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">Open Deployment in DataRobot</a>
</div>"""
)


### Run batch predictions

There are two ways of making batch predictions with the deployment. The first is to use the User OAuth JDBC connection you created in previous steps. The data will be saved to DataRobot and it can be accessed directly. 

In [None]:
from tempfile import TemporaryFile

intake_settings = {
    'type': 'jdbc',
    'query': f'''SELECT * from {full_table_name};''', 
    'data_store_id': data_store.id,
    'credential_id': credential.credential_id,
}



job = dr.BatchPredictionJob.score(
    deployment.id, 
    intake_settings=intake_settings

)


with TemporaryFile() as tmpfile:
    job_csv = job.get_result_when_complete()
    tmpfile.write(job_csv)
    tmpfile.seek(0)
    result = pd.read_csv(tmpfile)
    
result

You can also use a service account to write data back to GCP directly. A service account is preferred here because it allows these jobs to be scheduled to happen automatically server to server. 

In [None]:
import json 
from pathlib import Path
json_credential = json.loads(Path("PATH TO YOUR JSON SERVICE CREDENTIAL").read_text()) # You can obtain your service credentials in a number of ways

# google_cloud_credential = dr.Credential.create_gcp(name='GCP Key Credential Test', gcp_key=json_credential, description="For GCP Batch Access")

job = dr.BatchPredictionJob.score(
    deployment.id, #this is the deployment id 
    intake_settings = {
    'type': 'bigquery',
    'dataset': 'dr_sample_data',
    'table': 'lending_club',
    'bucket': 'model-staging-dr-demo', # a bucket is required
    'credential_id': google_cloud_credential.credential_id,
},
    output_settings = {
    'type': 'bigquery',
    'dataset': 'dr_sample_data',
    'table': 'lending_club_predictions',
    'bucket': 'model-staging-dr-demo', # a bucket is required
    'credential_id': google_cloud_credential.credential_id,
}

)

job.get_result_when_complete()

In [None]:
query =f"""
    SELECT count(*) as total_rows, avg(cast(is_bad_PREDICTION as numeric)) as avg_prediction from dr_sample_data.lending_club_predictions
"""
query_job = client.query(query)  # Make an API request

print("The query data:")
for row in query_job:
    # Row values can be accessed by field name or index
    print(f"Result: {row['total_rows']} rows with avg of {row['avg_prediction']}")

In [9]:
# CLEAN UP
# Uncomment and run this cell to remove everything you added during this session

data_set.delete(data_set.id)
data_source.delete()
# deployment.delete()
# project.delete()
# google_cloud_credential.delete()