![title](media/DataRobot.png)

### DataRobot provides R and Python package to access different functionalities in the API
1 - Project   
2 - Model    
3 - Predict    
4 - Partitioning      
5 - Feature List

Full documentation of the Python package can be found here: https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.18.0/

Full documentation of the R package can be found here: https://cran.r-project.org/web/packages/datarobot/index.html

## Getting started
You can install datarobot using pip from any computer with internet access!

In [None]:
!pip install datarobot

### Loading the libraries

In [1]:
import pandas as pd
import datarobot as dr

### Credentials
To access the DataRobot API user need to connect to it. To make sure authorize users are accessing the DataRobot API user need to use their username, password or API token.
You also need to ensure your "API Access" configuration is ON (please ask your administrator if not).

To find your API Token, visit https://app.eu.datarobot.com/ , log in and follow the instructions below:

![title](media/credentials_1.png)

![title](credentials_2.png)

![title](media/credentials_3.png)

In [2]:
endpoint = 'https://app.eu.datarobot.com/api/v2'
# Put your API token here
api_token = 'NWRiMGI2ZGRhMmE1YzUzNWM4YmNiYWUxOnVCenYzQm1JVUdxeUV4KzZXbHlVMnQxR21XRCtkTU95YVA3bnVodUVkNlE9'
dr.Client(token=api_token, endpoint=endpoint)

<datarobot.rest.RESTClientObject at 0x1159fd0b8>

## Read the dataset

In [40]:
diabetes_data = pd.read_csv('data/Diabetes_train.csv', encoding='ISO-8859-1')

In [4]:
diabetes_data.head(5)

Unnamed: 0,readmitted,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,...,glyburide.metformin,glipizide.metformin,glimepiride.pioglitazone,metformin.rosiglitazone,metformin.pioglitazone,change,diabetesMed,diag_1_desc,diag_2_desc,diag_3_desc
0,No,Caucasian,Female,[50-60],?,Elective,Discharged to home,Physician Referral,1,CP,...,No,No,No,No,No,No,No,Spinal stenosis in cervical region,Spinal stenosis in cervical region,"Effusion of joint, site unspecified"
1,No,Caucasian,Female,[20-30),[50-75),Urgent,Discharged to home,Physician Referral,2,UN,...,No,No,No,No,No,No,No,"First-degree perineal laceration, unspecified ...","Diabetes mellitus of mother, complicating preg...",Sideroblastic anemia
2,Yes,Caucasian,Male,[80-90),?,Not Available,Discharged/transferred to home with home healt...,,7,MC,...,No,No,No,No,No,No,Yes,Pneumococcal pneumonia [Streptococcus pneumoni...,"Congestive heart failure, unspecified",Hyperosmolality and/or hypernatremia
3,No,AfricanAmerican,Female,[50-60),?,Emergency,Discharged to home,Transfer from another health care facility,4,UN,...,No,No,No,No,No,No,Yes,Cellulitis and abscess of face,Streptococcus infection in conditions classifi...,Diabetes mellitus without mention of complicat...
4,No,AfricanAmerican,Female,[50-60),?,Emergency,Discharged to home,Emergency Room,5,?,...,No,No,No,No,No,Ch,Yes,"Bipolar I disorder, single manic episode, unsp...",Diabetes mellitus without mention of complicat...,Depressive type psychosis


In [5]:
diabetes_data.describe(include=['O'])

Unnamed: 0,readmitted,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,payer_code,medical_specialty,...,glyburide.metformin,glipizide.metformin,glimepiride.pioglitazone,metformin.rosiglitazone,metformin.pioglitazone,change,diabetesMed,diag_1_desc,diag_2_desc,diag_3_desc
count,10000,10000,10000,10000,10000,9279,9531,9064,10000,10000,...,10000,10000,10000,10000,10000,10000,10000,9998,9941,9792
unique,2,6,2,11,8,6,21,10,16,53,...,4,2,1,1,1,2,2,457,429,460
top,No,Caucasian,Female,[70-80),?,Emergency,Discharged to home,Emergency Room,?,?,...,No,No,No,No,No,No,Yes,Coronary atherosclerosis of unspecified type o...,Diabetes mellitus without mention of complicat...,Diabetes mellitus without mention of complicat...
freq,6035,7359,5398,2595,9592,4905,6056,4940,5341,4100,...,9944,9998,10000,10000,10000,5724,7478,735,684,1276


### Start a DataRobot project!

In [8]:
project = dr.Project.start(diabetes_data,                #Pandas dataframe with data. Could also pass the folder path
                           project_name='Diabetes Model',#Name of the project
                           target='readmitted',          #Target of the project
                           metric = 'LogLoss',           #Optimization metric (Default value)
                           worker_count = -1,            #Amount of workers to use. -1 means every worker available
                           autopilot_on=True)            #Run on autopilot  (Default value)

print(project.id, project.project_name)

5dd16fbe435de51a560954d0 Diabetes Model


### Interacting with autopilot

In [None]:
project.pause_autopilot() #Pausing autopilot

In [None]:
project.unpause_autopilot() #Unpausing autopilot

In [7]:
#Wait till autopilot is complete - Note this will 
#make notebook unresponsive till autopilot is complete
project.wait_for_autopilot()

In progress: 18, queued: 22 (waited: 0s)
In progress: 18, queued: 22 (waited: 1s)
In progress: 18, queued: 22 (waited: 1s)
In progress: 18, queued: 22 (waited: 2s)
In progress: 18, queued: 22 (waited: 4s)
In progress: 18, queued: 22 (waited: 6s)
In progress: 18, queued: 22 (waited: 9s)
In progress: 18, queued: 22 (waited: 16s)
In progress: 18, queued: 20 (waited: 30s)
In progress: 18, queued: 10 (waited: 50s)
In progress: 19, queued: 3 (waited: 71s)
In progress: 18, queued: 3 (waited: 91s)
In progress: 9, queued: 1 (waited: 112s)


KeyboardInterrupt: 

### Where to find the project ID?
![title](media/model_id.png)

### What if I don't want to use my browser

In [17]:
print(dr.Project.list())

[Project(Diabetes Model), Project(Diabetes API)]


In [18]:
for p in dr.Project.list():
    print(p, p.id)

Project(Diabetes Model) 5dd16fbe435de51a560954d0
Project(Diabetes API) 5dd16e337a1f0d19f059c0cf


### Interacting with the leaderboard

In [None]:
project.open_leaderboard_browser()

In [None]:
#  Set worker count higher. This will fail if you don't have 20 workers.
project.set_worker_count(20)

In [None]:
#  More jobs will go in the queue in each stage of autopilot.
#  This gets the currently inprogress and queued jobs
project.get_model_jobs()

### Pick another project

In [None]:
project = dr.Project.get('5dd16fbe435de51a560954d0') # I choose a project that has already finished autopilot

### Take a look at finished models

In [None]:
print(project.get_models())

In [None]:
#Pick the best non-blender model
best_model = project.get_models()[2]

print(best_model)
print(best_model.metrics['AUC'])
print(best_model.metrics['Gini Norm'])

In [None]:
#Visualize the ROC curve
roc = best_model.get_roc_curve('crossValidation')
roc_df = pd.DataFrame(roc.roc_points)

### Creating a custom feature list

In [28]:
inform_features = feature_lists[1]

print('How many features are informative?')
print(len(inform_features.features))

# What are those features?
inform_features.features

#Create a new feature list
without_weight = list(set(inform_features.features) - {'weight'})
without_weight_fl = project.create_featurelist('without_weight',
                                              without_weight)

How many features are informative?
40


['readmitted',
 'race',
 'gender',
 'age',
 'weight',
 'admission_type_id',
 'discharge_disposition_id',
 'admission_source_id',
 'time_in_hospital',
 'payer_code',
 'medical_specialty',
 'num_lab_procedures',
 'num_procedures',
 'num_medications',
 'number_outpatient',
 'number_emergency',
 'number_inpatient',
 'diag_1',
 'diag_2',
 'diag_3',
 'number_diagnoses',
 'max_glu_serum',
 'A1Cresult',
 'metformin',
 'repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'glipizide',
 'glyburide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'insulin',
 'glyburide_metformin',
 'change',
 'diabetesMed',
 'diag_1_desc',
 'diag_2_desc',
 'diag_3_desc']

### Retraining a model on the custom feature list

In [35]:
#Retrain the best model using the custom feature list
best_model.train(featurelist_id=without_weight_fl.id)

ClientError: 422 client error: {'message': 'The endpoint does not support retraining blender on a different featurelist'}

### Which one is the best?

In [36]:
best_model.blueprint_id

'ff0fd70aff222fc804a803438f7f2ed7'

In [21]:
models_rerun = [model for model in project.get_models() 
                    if model.blueprint_id == best_model.blueprint_id and model.sample_pct > 63]

NameError: name 'best_single_model' is not defined

In [None]:
models_rerun

In [None]:
accuracies = [[model.featurelist.name, model.metrics['AUC']['validation']] for model in models_rerun]
accuracies

### Train on 100% of Data

In [None]:
project.unlock_holdout()
id_for_best_retrained_model = best_model.train(sample_pct=100)

## Predictions
There are 2 ways to generate predictions from any model using the DataRobot API:
#### Modelling API
You can use the modelling API if you use Python or R. There are multiple functions and objects to interact with it the same way as above.
#### Prediction API
Any project can be called with the Prediction API. This is a simple REST API. Click on a model in the UI, then "Deploy Model" and "Activate now". You'll have access to a Python code snippet to help you interact with it.

### Using the Modelling API

In [43]:
test_df = pd.read_csv('data/Diabetes_test.csv', encoding='ISO-8859-1') #Load testing data
test_df.drop('readmitted', inplace=True, axis=1) #Dropping target column

prediction_data = project.upload_dataset(test_df)
predict_job = best_model.request_predictions(prediction_data.id)
result = predict_job.get_result_when_complete()

In [None]:
result

### Using the prediction API

![title](media/prediction_api.png)

See demonstration.

### Batch scoring script
If you're predicting on large amount of batch data, a single API call might not be efficient and you might want to split your data into multiple chunks before generating predictions.

DataRobot provides a Python script to help you do it: https://pypi.python.org/pypi/datarobot_batch_scoring

Note this is only available in Python 3.