## Batch Prediction API

### Scope

The scope of this notebook is to provide instructions on how to use DataRobot's Batch Prediction API to get predictions out of a DataRobot deployed model

### Background

The Batch Prediction API provides flexible options for intake and output when scoring large datasets using the prediction servers you have already deployed. The API is exposed through the DataRobot Public API and can be consumed using a REST-enabled client or the DataRobot Python Public API bindings.

The main features of the API include:

- Flexible options for intake and output.
- Support for streaming local files and the ability to start scoring while still uploading—while simultaneously downloading the results.
- Ability to score large datasets from, and to, Amazon S3 buckets.
- Connection to external data sources using JDBC with bidirectional streaming of scoring data and results.
- A mix of intake and output options; for example, scoring from a local file to an S3 target.
- Protection against prediction server overload with a concurrency control level option.
- Inclusion of Prediction Explanations (with an option to add thresholds).
- Support for passthrough columns to correlate scored data with source data.
- Addition of prediction warnings in the output.

### Requirements

- Python version 3.7.3
-  DataRobot API version 2.26.0. 

Small adjustments might be needed depending on the Python version and DataRobot API version you are using.

Full documentation of the Python package can be found here: https://docs.datarobot.com/en/docs/predictions/batch/batch-prediction-api/index.html

It is assumed you already have a DataRobot <code>Deployment</code> object.

### Step 1: Connecting to DataRobot

To inititate scoring jobs through the Batch Prediction API, you need two things:

- Connect to DataRobot through the `datarobot.Client` command
- Have your `DEPLOYMENT_ID` string. Easiest way to find that is to just go through the User Interface and Copy the ID from the URL. For example in the below example, everything after `deployments/` is the ID of the deployment: `https://app.eu.datarobot.com/deployments/232315iijdfsafw`

In [1]:
import datarobot as dr

dr.Client(endpoint='YOUR_ENDPOINT/api/v2', token='YOUR_TOKEN')
deployment_id = "YOUR_DEPLOYMENT_ID"

### Step 2: Confirming Ingestion and Output 

DataRobot's Batch Prediction API allows you to score data from and to multiple sources. You should take advantage of the `credentials` and `data sources` you have already established previously through the UI for easy scoring. `Credentials` are basically usernames and passwords while `data sources` are the database that you have previously established a connection, like snowflake.

Below is some example code on how to query the `credentials` and `data sources`.


Full list of [input options](https://docs.datarobot.com/en/docs/predictions/batch/batch-prediction-api/intake-options.html)

Full list of [output options](https://docs.datarobot.com/en/docs/predictions/batch/batch-prediction-api/output-options.html)

In [2]:
# List all credentials
dr.Credential.list()

[Credential('6064670wqdww', 'DATAROBOT', 'basic'),
 Credential('606c17dfdwwdwwd', 'github-application-oauth', 'oauth'),
 Credential('607efadbaddwwdwwd', 'TheoPetropoulos', 's3'),
 Credential('6156f50adwdwdwdwdwdw', 'SnowflakeCredentials', 'basic')]

On the above example, you can see that I have quite a few credentials. I have my `GitHub` Credentials, some `SnowflakeCredentials` and `s3 credentials. The alphanumerics on the left is just the ID of the credential. I can use that ID to access the credentials through the API.

In [6]:
# List all datastores
dr.DataStore.list()
print(dr.DataStore.list()[0].id)

60646dsddsdsaffa


On the above example, you can see a list of all the datastores (I only have a snowflake connection), and with a little bit of manipulation, I can also access the ID of each datastore.

## Examples

Below, we show some examples on how to use the Batch Prediction API Script. The `intake_settings` and `output_settings` can change to your needs. This means that you can *mix and match* as much as you want to to get to the outcome you prefer. Syntax only needs to change to one part of the equation to achieve this.

### Scoring from CSV to CSV

In [None]:
#Scoring without Prediction Explanations
dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'localFile',
        'file': 'inputfile.csv' #Path or Pandas or file-like object
    },
    output_settings={
        'type': 'localFile',
        'file': 'outputfile.csv'
    }
)

#Scoring With Prediction Explanations
dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 'localFile',
        'file': 'inputfile.csv' #Path or Pandas or file-like object
    },
    output_settings={
        'type': 'localFile',
        'file': 'outputfile.csv'
    },
    
    max_explanations=3 #Compute prediction explanations for this amount of features
    
)

### Scoring from S3 to S3

In [None]:
dr.BatchPredictionJob.score(
    deployment_id,
    intake_settings={
        'type': 's3',
        'url': 's3://theos-test-bucket/lending_club_scoring.csv',
        'credential_id': 'YOUR_CREDENTIAL_ID_FROM_ABOVE',
    },
    output_settings={
        'type': 's3',
        'url': 's3://theos-test-bucket/lending_club_scored2.csv',
        'credential_id': 'YOUR_CREDENTIAL_ID_FROM_ABOVE'
    }
)

### Scoring from JDBC to JDBC

In [None]:
dr.BatchPredictionJob.score(
    deployment_id,
    
    intake_settings = {
    'type': 'jdbc',
    'table': 'table_name',
    'schema': 'public',
    'dataStoreId': data_store.id, #Put the Id of the datastore you want
    'credentialId': cred.credential_id #put the credentials you want
    },
    
    output_settings = {
        'type': 'jdbc',
        'table': 'table_name',
        'schema': 'public',
        'statementType': 'insert',
        'dataStoreId': data_store.id,
        'credentialId': cred.credential_id
    }
)