## Scoring Big Datasets - Batch Prediction API

### Scope

The scope of this notebook is to provide instructions on how to use DataRobot's batch prediction script to get predictions out of a DataRobot deployed model

### Background

The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API and can be consumed using a REST-enabled client or the DataRobot Python Public API bindings.

The main features of the API include:

- Flexible options for intake and output.
- Support for streaming local files and the ability to start scoring while still uploading—while simultaneously downloading the results.
- Ability to score large datasets from, and to, Amazon S3 buckets.
- Connection to external data sources using JDBC with bidirectional streaming of scoring data and results.
- A mix of intake and output options; for example, scoring from a local file to an S3 target.
- Protection against prediction server overload with a concurrency control level option.
- Inclusion of Prediction Explanations (with an option to add thresholds).
- Support for passthrough columns to correlate scored data with source data.
- Addition of prediction warnings in the output.

(If you are a DataRobot customer, you can find more information in the in-app documentation for "Batch Prediction API.")

### Requirements

- Python version 3.7.3
-  DataRobot API version 2.19.0. 

Small adjustments might be needed depending on the Python version and DataRobot API version you are using.

Full documentation of the Python package can be found here: https://datarobot-public-api-client.readthedocs-hosted.com

It is assumed you already have a DataRobot <code>Deployment</code> object.

#### Scoring a local CSV file

For the below script to work, you will have to provide DataRobot with the <code>api_key</code> (of your account), the <code>deployment_id</code>, and the <code>api_endpoint</code> (which should be https://app.datarobot.com/api/v2/batchPredictions/ for NA Managed AI Cloud users and https://app.eu.datarobot.com/api/v2/batchPredictions/ for EU Managed AI Cloud users).

Lastly, you need to provide the CSV file that will be used as input.

By default, DataRobot will expect a CSV file with this format:

- delimiter Default: , (comma).
- quotechar: "
- encoding: utf-8

In [None]:
import datarobot as dr

dr.Client(
    endpoint="YOUR_ENDPOINT",
    token="YOUR_TOKEN",
)

deployment_id = "..."

input_file = "to_predict.csv"
output_file = "predicted.csv"

job = dr.BatchPredictionJob.score_to_file(
    deployment_id,
    input_file,
    output_file,
    passthrough_columns_set="all"
)

print("started scoring...", job)
job.wait_for_completion()

#### Requesting Prediction Explanations
In order to get Prediction Explanations alongside predictions, you need to change the job configuration.

In [None]:
job = dr.BatchPredictionJob.score_to_file(
    deployment_id,
    input_file,
    output_file,
    max_explanations=10,
    threshold_high=0.5,
    threshold_low=0.15,
)

#### Custom CSV Format
If your CSV file does not match the custom CSV format, you can modify the expected CSV format by setting <code>csvSettings</code>:

In [None]:
job = dr.BatchPredictionJob.score_to_file(
    deployment_id,
    input_file,
    output_file,
    csv_settings={
        'delimiter': ';',
        'quotechar': '\'',
        'encoding': 'ms_kanji',
    },
)

#### End-to-end scoring of CSV files on S3 using Python requests

To score data that sits in an S3 Bucket:

In [None]:
import datarobot as dr

dr.Client(
    endpoint="YOUR_ENDPOINT",
    token="YOUR_TOKEN",
)

deployment_id = "..."
credential_id = "..."

s3_csv_input_file = 's3://my-bucket/data/to_predict.csv'
s3_csv_output_file = 's3://my-bucket/data/predicted.csv'

job = dr.BatchPredictionJob.score_s3(
    deployment_id,
    source_url=s3_csv_input_file,
    destination_url=s3_csv_output_file,
    credential=credential_id
)

print("started scoring...", job)
job.wait_for_completion()

The same functionality is available for `score_azure` and `score_gcp`. You can also specify the credential object itself, instead of a credential ID:

In [None]:
credentials = dr.Credential.get(credential_id)

job = dr.BatchPredictionJob.score_s3(
    deployment_id,
    source_url=s3_csv_input_file,
    destination_url=s3_csv_output_file,
    credential=credentials,
)

#### Prediction Explanations

You can include Prediction Explanations by adding the desired Prediction Explanation parameters to the job configuration:

In [None]:
job = dr.BatchPredictionJob.score_s3(
    deployment_id,
    source_url=s3_csv_input_file,
    destination_url=s3_csv_output_file,
    credential=credential_id,
    max_explanations=10,
    threshold_high=0.5,
    threshold_low=0.15,
)

#### End-to-end scoring from a JDBC PostgreSQL database using Python requests 
The following reads a scoring dataset from the table `public.scoring_data` and saves the scored data back to `public.scored_data` (assuming that table already exists).

In [None]:
import datarobot as dr

dr.Client(
    endpoint="YOUR_ENDPOINT",
    token="YOUR_TOKEN",
)

deployment_id = "..."
credential_id = "..."
datastore_id = "..."

intake_settings = {
    'type': 'jdbc',
    'table': 'scoring_data',
    'schema': 'public',
    'data_store_id': datastore_id,
    'credential_id': credential_id,
}

output_settings = {
    'type': 'jdbc',
    'table': 'scored_data',
    'schema': 'public',
    'data_store_id': datastore_id,
    'credential_id': credential_id,
    'statement_type': 'insert'
}

job = dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings=intake_settings,
    output_settings=output_settings,
)

print("started scoring...", job)
job.wait_for_completion()

#### AI Catalog to CSV File Scoring
When using the AI Catalog for intake, you need the `dataset_id` of an already created dataset.

In [None]:
import datarobot as dr

dr.Client(
    endpoint="YOUR_ENDPOINT",
    token="YOUR_TOKEN",
)

deployment_id = "..."
credential_id = "..."
dataset_id = "..."

dataset = dr.Dataset.get(dataset_id)

job = dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'dataset',
        'dataset_id': dataset,
    },
    output_settings={
        'type': 'localFile',
    },
)

job.wait_for_completion()