# End to end modeling workflow with Azure

Author: Brent Hinks (2023-01-27)

## Overview

This notebook illustrates an end-to-end data science workflow using DataRobot. The workflow ingests a dataset hosted in an Azure blob container, trains a series of models using DataRobot's AutoML capabilities, deploys a recommended model, and sets up a batch prediction job that writes predictions back to the original container.

In this notebook you'll cover the following steps:

- Acquiring a training dataset from an Azure storage container
- Building a new DataRobot project
- Deploying a recommended model
- Scoring via batch prediction API
- Writing results back to a source Azure container

## Setup

Prior to execution, ensure that the following dependencies are available in your notebook environment:

- **datarobot**, provided via PyPi (Python library used to communicate with the DataRobot platform)
- **azure.storage.blob**, provided via PyPi (Python library used to access Azure storage services)
- **pandas**, provided via PyPi (common data science library)
- **Azure CLI**, used to authenticate to Azure. You can reference [installation instructions](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) for more information.

### Import libraries

The first cell of the notebook imports necessary packages, and sets up the connection to the DataRobot platform. There are also optional values that can be provided to use an existing project and deployment - if they are omitted then a new autopilot session will be kicked off and a new deployment will be created using DataRobot's recommended model.

In [None]:
import datarobot as dr
import pandas as pd

from azure.storage.blob import BlobServiceClient
from io import StringIO

### Connect to DataRobot

In [None]:
# Set DataRobot connection info here
DATAROBOT_API_TOKEN = ""
DATAROBOT_ENDPOINT = "https://app.datarobot.com/api/v2"

client =dr.Client(
    token=DATAROBOT_API_TOKEN, 
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix='AIA-E2E-AZURE-78' #Optional but helps DataRobot improve this workflow
)

### Bind credentials

In [None]:
# Set Azure connection blob info here
AZURE_STORAGE_ACCOUNT = ""
AZURE_STORAGE_CONTAINER = ""

# Find this value by following the "Access keys" link from your storage account in the Azure console
AZURE_STORAGE_ACCESS_KEY = ""

# Provide dataset filenames and the modeling target feature
AZURE_INPUT_FILE = "input.csv"
AZURE_OUTPUT_FILE = "scored.csv"
AZURE_INPUT_TARGET = "target"

# Set name for Azure credentials in DataRobot
DR_CREDENTIAL_NAME = "Azure_{}".format(AZURE_STORAGE_ACCOUNT)

project_id = None
deployment_id = None

Before running the next cell, which creates the storage service client, you should run `az login` from your terminal to establish an authenticated session to Azure.

In [None]:
account_url = "https://{}.blob.core.windows.net".format(AZURE_STORAGE_ACCOUNT)
blob_service_client = BlobServiceClient(account_url)

### Import data

Load the dataset stored in your Azure container into a pandas dataframe.

In [None]:
container_client = blob_service_client.get_container_client(container=AZURE_STORAGE_CONTAINER) 
downloaded_blob = container_client.download_blob(AZURE_INPUT_FILE)

df = pd.read_csv(StringIO(downloaded_blob.content_as_text()))

Ensure that proper Azure credentials are stored in DataRobot. This credential can be used in the future to automate data reads and writes in scoring jobs. Check for an existing credential matching the name we provided above. If none is found, then create a new one.

In [None]:
# Use this code to look up the ID of the credential object created.
credential = None
for cred in dr.Credential.list():
    if cred.name == DR_CREDENTIAL_NAME:
        credential = cred

if credential == None:
    credential = dr.Credential.create_azure(
        name=DR_CREDENTIAL_NAME,
        azure_connection_string="DefaultEndpointsProtocol=https;AccountName={};AccountKey={};".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_ACCESS_KEY)
    )

credential

## Modeling

### Create a project

Create a new project in DataRobot and upload the data stored in your dataframe. After that you will set the target and start the AutoML process.

If a `project_id` was supplied above, skip these steps.

In [None]:
# Create a project without setting the target
if project_id == None:
    project = dr.Project.create(project_name="New Test Project (Azure)", sourcedata=df)
    print(project.id)

### Initate Autopilot

In [None]:
if project_id == None:
    mode = dr.enums.AUTOPILOT_MODE.QUICK

    project.analyze_and_model(
        target = AZURE_INPUT_TARGET,
        mode = mode,
        worker_count = -1, # Setting the worker count to -1 will ensure that you use the maximum number of modeling workers available to your account
        max_wait = 600
    )
    # When you get control back, that means EDA is finished and model jobs are in flight

In [None]:
if project_id == None:
    # This is helpful if you want to keep execution serial:
    project.wait_for_autopilot()

    # Otherwise you can periodically ask the project for its current Autopilot status:
    #project.stage
    #project.get_model_jobs()

## Select and deploy a model

Review DataRobot's model recommendations and select one for deployment. If `deployment_id` was supplied above, skip this step.

In [None]:
print(dr.ModelRecommendation.get_all(project.id))
rec = dr.ModelRecommendation.get(
    project_id=project.id, 
    recommendation_type=dr.enums.RECOMMENDED_MODEL_TYPE.RECOMMENDED_FOR_DEPLOYMENT
)
selection = rec.get_model()

When you are happy with your model you can automate deployment.

In [None]:
if deployment_id == None:
    prediction_server = dr.PredictionServer.list()[0] # This line of code is only needed if you are using the DataRobot multi-tenant SaaS environment
    deployment = dr.Deployment.create_from_learning_model(
        model_id = selection.id,
        label = "New Test Deployment (Azure)",
        description = "Some extra data that I can use to search later.",
        default_prediction_server_id = prediction_server.id # This line of code is only needed if you are using the DataRobot multi-tenant SaaS environment
    )
    deployment.update_association_id_settings(
        column_names = ["id"],
        required_in_prediction_requests = False
    )
    deployment.update_drift_tracking_settings(
        target_drift_enabled = True,
        feature_drift_enabled = True
    )
else:
    deployment = dr.Deployment.get(deployment_id)
    
print(deployment.id)

## Make batch predictions

Create a batch prediction job that will read in your training dataset, produce scores with optional explanations, and write the results back to the original container. If any errors occur along the way, get details from `job.get_status()` to assist in troubleshooting.

In [None]:
job = dr.BatchPredictionJob.score(
    deployment=deployment.id,
    intake_settings={
        'type': 'azure',
        'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_INPUT_FILE),
        "credential_id": credential.credential_id
    },
    output_settings={
        'type': 'azure',
        'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_OUTPUT_FILE),
        "credential_id": credential.credential_id
    },
    # Uncomment the next line to include prediction explanations.
    # max_explanations=3,
    passthrough_columns_set='all'
)
job.wait_for_completion()
job.get_status()