<center><H1>End to End DataRobot AutoML workflow with Amazon S3</H1></center>

<table border="0" cellspacing="0" cellpadding="0">
<td><img src="https://www.datarobot.com/wp-content/uploads/2021/08/DataRobot-logo-color.svg" height=200px width=200px>
</td>
<td><font size=10> + </font> </td>
<td> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/bc/Amazon-S3-Logo.svg/1712px-Amazon-S3-Logo.svg.png" height=100px width=100px> </td>

Author: Biju Krishnan

[API reference documentation](https://docs.datarobot.com/en/docs/api/reference/index.html)


<font>
This example notebook outlines the following tasks: <p>
<ol>
<li> Read PARQUET files from an Amazon S3 bucket into a pandas dataframe using AWS Wrangler Python library </li>
<li> Upload a dataset in a dataframe to DataRobot's AI Catalog </li>
<li> Initiate a DataRobot AutoML project with the dataset</li>
<li> Deploy the top performing model to a DataRobot prediction server. </li>
<li> Make batch predictions with a test dataset. </li>
</ol>
<p>
The files stored in S3 used for training can be in any format supported by the AWS Wrangler Python library. For batch predictions, DataRobot supports Parquet and CSV.
</font>

## Setup

### Import libraries

In [None]:
import datarobot as dr 
import pandas as pd
from io import StringIO
import boto3
import awswrangler as wr # This notebooks uses AWS Wrangler because its easy to read multiple files from the S3 bucket

### Bind variables

In [None]:
# Bind variables
# These variables can aso be fetched from a secret store or config files

DATAROBOT_ENDPOINT="https://app.eu.datarobot.com/api/v2"
# The URL may vary depending on your hosting preference, the above example is for DataRobot EU Managed AI Cloud

DATAROBOT_API_TOKEN="<INSERT YOUR DataRobot API Token>"
# The API Token can be found by click the avatar icon and then </> Developer Tools

client =dr.Client(
    token=DATAROBOT_API_TOKEN, 
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix='AIA-E2E-AWS-14' #Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client

AWS_KEY = '<INSERT YOUR AWS ACCESS KEY>' # Enter your AWS Key ID
AWS_SECRET = '<INSERT YOUR AWS SECRETS>' # Enter your AWS Secret  

### Connect to DataRobot

You can read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html).

In [None]:
dr.Client(
token=DATAROBOT_API_TOKEN,
endpoint=DATAROBOT_ENDPOINT
)

In [None]:
# Instantiate a BOTO3 connection for connection to AWS 
# This session will be used in the next cell to read files from S3

my_session = boto3.Session(
    aws_access_key_id=AWS_KEY,
    aws_secret_access_key=AWS_SECRET,
    # aws_session_token = <Optional>
)

## Import data

<font>
<p>
For illustration purposes, the training dataset containing patient visits to a hospital is stored in an S3 bucket named e2eaccelerator09122022 under the path <code>s3://e2eaccelerator09122022/training/input/</code> .
<pre><code><font color=grey size=1>
aws s3 ls s3://e2eaccelerator09122022/training/input/
2022-12-09 09:55:47          0
2022-12-09 09:56:15     267017 10k_diabetes.parquet
</font></code></pre>
<p>
The input folder contains only one file in this scenario, however the code will also work in case of multiple files.
</font>

In [None]:
# Read parquet files from an S3 bucket into a pandas dataframe using AWS Wrangler

s3_training_input = "s3://e2eaccelerator09122022/training/input/"
df = wr.s3.read_parquet(path=s3_training_input,dataset=True,boto3_session=my_session) 
# Specifying dataset=True allows reading multiple files
df.head()

### Create a dataset

Create a dataset in the AI Catalog to use it for project creation.

In [None]:
datarobot_dataset = dr.Dataset.create_from_in_memory_data(data_frame=df,fname="10K diabetes E2E accelerator")
datarobot_dataset.id

### Create a project and initiate Autopilot

In [None]:
# This cell will take several minutes to complete execution
# Creates an AutoML project named "E2E Demo Amazon S3" with "readmitted" as the target column
# Quick mode is the designated training mode in this example, however other modes are also available


EXISTING_PROJECT_ID = None # If you've already created a project, replace None with the ID here

if EXISTING_PROJECT_ID is None:
    # Create project and pass in data
    project = dr.Project.create_from_dataset(datarobot_dataset.id,
                                project_name = 'E2E Demo Amazon S3')

    # Set the project target to the appropriate feature. Use the LogLoss metric to measure performance
    project.analyze_and_model(target='readmitted',
                       mode=dr.AUTOPILOT_MODE.QUICK, 
                       worker_count='-1')
else:
    # Fetch the existing project
    project = dr.Project.get(EXISTING_PROJECT_ID)

project.wait_for_autopilot(check_interval=30)

Once the AutoML project is complete, select the top-performing model on the Leaderboard based on the chosen metric for deployment.

In [None]:
def sorted_by_metric(models, test_set, metric):
    models_with_score = [model for model in models if
                         model.metrics[metric][test_set] is not None]
    
    return sorted(models_with_score,
                  key=lambda model: model.metrics[metric][test_set])

models = project.get_models()

metric = project.metric

# Get the top-performing model
model_top = sorted_by_metric(models, 'crossValidation', metric)[0]

print('''The top performing model is {model} using metric, {metric}'''.format(model = str(model_top), metric = metric))

### Deploy a model

Note that steps in the following sections require DataRobot MLOps licensed features. Contact your DataRobot account representatives if you are missing some licensed MLOps features.

In [None]:
# Get the prediction server
prediction_server = dr.PredictionServer.list()[0]

# Create a deployment
deployment = dr.Deployment.create_from_learning_model(
    model_top.id, label='E2E Amazon S3 Test', description='Model trained on 10k diabetes dataset',
    default_prediction_server_id=prediction_server.id)
deployment.id

### Make predictions

<font family=verdana>
DataRobot's batch predictions API is capable of directly reading and writing to Amazon S3 storage. 
<p>
<i>Note: Parquet support for batch predictions is still in preview mode. Contact your DataRobot representative to enable the feature flags for trial.</i>
</font>

In [None]:
# To run a batch prediction job you need to store the AWS Credentials in the DataRobot credentials manager
# The AWS key and secret should be unique
# If they are already stored in the Credentials manager this code will throw an error

DR_CREDENTIAL_NAME = "AWS S3 Credentials" # Choose a name as per your convenience
for cred in dr.Credential.list():
    if cred.name == DR_CREDENTIAL_NAME:
        cred_flag = False
        credential_id = cred.credential_id
        break
    else:
        cred_flag = True

if cred_flag:
    credential = dr.Credential.create_s3(
    name=DR_CREDENTIAL_NAME,
    aws_access_key_id = AWS_KEY,
    aws_secret_access_key= AWS_SECRET,
    #aws_session_token= <Optional>
    )
    credential_id = credential.credential_id      

print(credential_id)

### Batch predictions snippet

The snippet below provides sample code to demonstratehow to make batch predictions to and from Amazon S3

In [None]:
dr.BatchPredictionJob._s3_settings = dr.BatchPredictionJob._s3_settings.allow_extra("*")

# Use the manipulated batch job class to score:
job = dr.BatchPredictionJob.score(
    deployment=deployment.id,
    intake_settings={
        'type': 's3',
        'credential_id': credential_id,
        'format': 'csv', # Can also be Parquet
        "url": "s3://e2eaccelerator09122022/predictions/input/10k_diabetes_test.csv", ## This can be a path or a file depending on the format chosen
    },
    output_settings={
        'type': 's3',
        'credential_id': credential_id,
        'format': 'parquet', # Can also be CSV
        'url': 's3://e2eaccelerator09122022/predictions/output/10k_diabetes_test.parquet', ## This should point to a file not a path    
    },
)

job.wait_for_completion()
job.get_status()

<font family=verdana>
<p>
The output of the batch predictions is thus available under the path s3://e2eaccelerator09122022/predictions/output/
<pre><code><font color=grey size=1>
aws s3 ls s3://e2eaccelerator09122022/predictions/output/
2022-12-09 11:35:32          0
2022-12-09 14:09:28      21244 10k_diabetes_test.parquet
</font></code></pre>
</font>
