# About this Notebook

This notebook provides an introduction to the `datarobot` Python package, highlighting the following key details of its use:

* connecting to the DataRobot modeling engine from a Python session
* creating a new modeling project in the DataRobot modeling engine
* retrieving the results from a DataRobot modeling project
* generating predictions from any DataRobot model

To illustrate this, we will focus on the problem of predicting airline delays.

---

# Table of Contents

1. [The DataRobot Modeling Engine](#The-DataRobot-Modeling-Engine)
2. [Connecting to DataRobot](#Connecting-to-DataRobot)
3. [Sample Data for this Exercise](#Sample-Data-for-this-Exercise)
4. [Creating a new Project](#Creating-a-new-Project)
5. [Making Predictions with DataRobot](#Making-Predictions-with-DataRobot)
6. [Conclusion](#Conclusion)

---
## The DataRobot Modeling Engine

The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments.

The DataRobot modeling engine is organized around *modeling projects*, each based on a single data source, a single target variable to be predicted, and a single metric to be optimized in fitting and ranking project models. This information is sufficient to create a project, identified by a unique alphanumeric **project_id** label, and start the DataRobot Autopilot.

This notebook uses our official Python client package that wraps the DataRobot REST API in an easy to use library. The Python package supports Python 2 and 3 and is hosted on [PyPi](https://pypi.python.org/pypi/datarobot). This tutorial makes heavy use of this library so we will take a moment to make sure it is installed:

In [1]:
# You only need to run this cell if you did **not** install the datarobot
# package via the SageMaker Lifecycle Configuration
try:
    import datarobot
except ImportError:
    import sys
    !{sys.executable} -m pip install datarobot

<div class="alert alert-box alert-info">
<em>Note</em>: if are <strong>not</strong> using DataRobot Cloud, we recommend you pin the package version to correspond to the version that matches your DataRobot Enterprise install version. For example:
<br>
<code>!{sys.executable} -m pip install datarobot==2.8.1</code>
<br>
<br>
Your support representative can help you determine what the recommended version you should use (as the package version does <strong>not</strong> have the same versioning scheme as the DataRobot modeling engine).

</div>

---
## Connecting to DataRobot

To access the DataRobot modeling engine, it is necessary to establish an authenticated connection. The necessary information is an **endpoint** - the URL address of the specific DataRobot server being used - and a **token**, a previously validated access token.

### Endpoint 
**endpoint** depends on the DataRobot modeling engine installation (cloud-based or on-prem) you are using. Contact your DataRobot admin for the correct endpoint to use. 

Please update the variable below accordingly.
<div class="alert alert-box alert-info">
<em>Note</em>: If you are using DataRobot Cloud you do <strong>not</strong> need to make any changes
</div>

In [2]:
endpoint = 'https://app.datarobot.com/api/v2'

### Token
**token** is unique for each DataRobot modeling engine account and can be obtained by logging into the DataRobot webapp browsing to the account profile section. It looks like a string of letters and numbers. Your token can be found in the DataRobot UI at "API Token", in "Settings", in the person icon (top right). You can also go to https://app.datarobot.com/account/me to see it, replacing _app.datarobot.com_ with your endpoint if you are not hosted on the cloud.

Enter your token below.

In [3]:
token = 'YOUR_TOKEN_HERE'

assert token != 'YOUR_TOKEN_HERE'

### Setting the configuration in code
The next cell will configure the datarobot package to use the provided endpoint and token. This only needs to be done once per session.

In [4]:
import datarobot as dr 
dr.Client(endpoint=endpoint, token=token)

<datarobot.rest.RESTClientObject at 0x7ff232a7c940>

The connection to DataRobot should now be ready. 

<div class="alert alert-box alert-info">
<em>Note</em>: Setting these values in code is not the only way to configure the DataRobot package. See the package documentation for additional configuration options.
</div>

## Sample Data for this Exercise

### Background

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether it rained the day of the flight.

---

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

We have collected and stored this data in Amazon S3, and we can read that data into a Pandas dataframe for analysis.

In [5]:
import pandas as pd
pd.options.display.max_rows = 10  # default display is too verbose

dataset_loc = 'https://s3.amazonaws.com/datarobot-public-datasets-redistributable'

df = pd.read_csv(f"{dataset_loc}/logan-US-2013.csv")
df

Unnamed: 0,was_delayed,daily_rainfall,did_rain,Carrier Code,Date (MM/DD/YYYY),Flight Number,Tail Number,Destination Airport,Scheduled Departure Time
0,False,0.0,False,US,02/01/2013,225,N662AW,PHX,16:20
1,False,0.0,False,US,02/01/2013,280,N822AW,PHX,06:00
2,False,0.0,False,US,02/01/2013,303,N653AW,CLT,09:35
3,True,0.0,False,US,02/01/2013,604,N640AW,PHX,09:55
4,False,0.0,False,US,02/01/2013,722,N715UW,PHL,18:30
...,...,...,...,...,...,...,...,...,...
18294,False,0.0,False,US,07/31/2013,2137,N948UW,LGA,17:00
18295,False,0.0,False,US,07/31/2013,2139,N963UW,LGA,18:00
18296,False,0.0,False,US,07/31/2013,2141,N956UW,LGA,19:00
18297,False,0.0,False,US,07/31/2013,2143,N947UW,LGA,20:00


### Dataset Structure

Each row in the assembled dataset contains the following columns:

- was_delayed
    - boolean
    - whether the flight was delayed
- daily_rainfall
    - float
    - the amount of rain, in inches, on the day of the flight
- did_rain
    - bool
    - whether it rained on the day of the flight
- Carrier Code
    - str
    - the carrier code of the airline - US for all entries in assembled dataset
- Date
    - str (MM/DD/YYYY format)
    - the date of the flight
- Flight Number
    - str
    - the flight number for the flight
- Tail Number
    - str
    - the tail number of the aircraft
- Destination Airport
    - str
    - the three-letter airport code of the destination airport
- Scheduled Departure Time
    - str
    - the 24-hour scheduled departure time of the flight, in the origin airport's timezone


We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data:

In [6]:
def prepare_modeling_dataset(df):
    date_column_name = 'Date (MM/DD/YYYY)'
    date = pd.to_datetime(df[date_column_name])
    modeling_df = df.drop(date_column_name, axis=1)
    days = {0: 'Mon', 1: 'Tues', 2: 'Weds', 3: 'Thurs', 4: 'Fri', 5: 'Sat',
            6: 'Sun'}
    modeling_df['day_of_week'] = date.apply(lambda x: days[x.dayofweek])
    modeling_df['month'] = date.dt.month
    return modeling_df

modeling_df = prepare_modeling_dataset(df)
modeling_df

Unnamed: 0,was_delayed,daily_rainfall,did_rain,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,day_of_week,month
0,False,0.0,False,US,225,N662AW,PHX,16:20,Fri,2
1,False,0.0,False,US,280,N822AW,PHX,06:00,Fri,2
2,False,0.0,False,US,303,N653AW,CLT,09:35,Fri,2
3,True,0.0,False,US,604,N640AW,PHX,09:55,Fri,2
4,False,0.0,False,US,722,N715UW,PHL,18:30,Fri,2
...,...,...,...,...,...,...,...,...,...,...
18294,False,0.0,False,US,2137,N948UW,LGA,17:00,Weds,7
18295,False,0.0,False,US,2139,N963UW,LGA,18:00,Weds,7
18296,False,0.0,False,US,2141,N956UW,LGA,19:00,Weds,7
18297,False,0.0,False,US,2143,N947UW,LGA,20:00,Weds,7


---
## Creating a new Project

One of the most common and important uses of the **datarobot** package is the creation of a new modeling project. This task is supported by the following three functions:

* __dr.Project.start__ creates a new project, generating a unique alphanumeric project identifier (__projectId__), uploading the modeling data, and allowing the specification of a project name and the target to model with;
* __project.wait_for_autopilot__ lets us wait for DataRobot to finish building models.
* __project.get_models__ lets us retrieve information on the models DataRobot made, once autopilot is complete.

The **DataRobot Autopilot** builds, evaluates, and summarizes a collection of models. While the Autopilot is running, intermediate results are saved in a list that is updated until the project completes. The last stage of the modeling process constructs *blender* models, ensemble models that combine two or more of the best-performing individual models in various different ways. These models are ranked in the same way as the individual models and are included in the final project list.

In [7]:
project = dr.Project.start(modeling_df,                  # Specify the dataframe we want to model with. (This can also be a path to a CSV or URL.)
                           project_name='Airline Delay', # Give the project a name.
                           target='was_delayed')         # Give the name of the variable specifying the target (the value we want to predict).
project.id

'5ace47d5e3cd9b53aacbd799'

# Configuring the number of workers
The DataRobot platform can run multiple modeling tasks in parallel by leveraging our pool of workers. You can set the number of workers which should be utilized by a project

<div class="alert alert-box alert-warning">
Depending on the number of workers available to you per your license with DataRobot, the following cell might not succeed. It may also not be using all of your available workers.
</div>

In [8]:
# Set the worker count to the max your account allows to speed up training
project.set_worker_count(4)

Project(Airline Delay)

You can view the progress of auto-pilot in from the DataRobot Web UI or explore more aspects of the project you just created. The cell below should output a link to the newly created project:

In [9]:
print(project.get_leaderboard_ui_permalink())

https://app.datarobot.com/projects/5ace47d5e3cd9b53aacbd799/models


You can also wait for the modeling to finish via the API...

In [10]:
%%time
project.wait_for_autopilot()

In progress: 4, queued: 29 (waited: 0s)
In progress: 4, queued: 29 (waited: 1s)
In progress: 4, queued: 29 (waited: 1s)
In progress: 4, queued: 29 (waited: 2s)
In progress: 4, queued: 29 (waited: 3s)
In progress: 4, queued: 29 (waited: 5s)
In progress: 4, queued: 29 (waited: 9s)
In progress: 4, queued: 29 (waited: 16s)
In progress: 4, queued: 28 (waited: 29s)
In progress: 4, queued: 24 (waited: 49s)
In progress: 4, queued: 21 (waited: 70s)
In progress: 4, queued: 18 (waited: 90s)
In progress: 4, queued: 16 (waited: 111s)
In progress: 4, queued: 12 (waited: 131s)
In progress: 4, queued: 11 (waited: 152s)
In progress: 4, queued: 8 (waited: 172s)
In progress: 4, queued: 5 (waited: 193s)
In progress: 4, queued: 3 (waited: 213s)
In progress: 4, queued: 2 (waited: 233s)
In progress: 1, queued: 0 (waited: 254s)
In progress: 4, queued: 12 (waited: 274s)
In progress: 4, queued: 10 (waited: 294s)
In progress: 4, queued: 8 (waited: 314s)
In progress: 4, queued: 5 (waited: 335s)
In progress: 4, qu

### Retrieving project results

We can then use the API to interact with the `project` object to get data, such as the list of models built.

In [11]:
models = project.get_models()
len(models)

61

Here we can see DataRobot built approximately 60 different predictive models automatically in under 30 minutes. Cool! Let's get some more information on what happened, such as getting all the models, their unique IDs, the logarithmic loss for each model on the cross validation segment, the percent of train data the model was trained on, and the type of model.

Some models don't have a root mean squared error for cross validation, as DataRobot only does cross validation for the best models found via a simple holdout set. `LogLoss CV` tells us the logarithmic loss on a five-fold cross validation and `LogLoss` tells us the logarithmic loss on just the validation set (first fold). Below we will look at the top 10 models in the _leaderboard_.

In [12]:
model_list = [{'id': m.id,
               'LogLoss CV': m.metrics['LogLoss']['crossValidation'],
               'LogLoss': m.metrics['LogLoss']['validation'],
               'type': m.model_type,
               'samplePct': m.sample_pct} for m in models]
model_df = pd.DataFrame(model_list)
model_df.sort_values(by='LogLoss CV', inplace=True)
model_df.reset_index(drop=True, inplace=True)
model_df.head(10)

Unnamed: 0,LogLoss,LogLoss CV,id,samplePct,type
0,0.27244,0.27147,5ace4af8c2674ebe4a33bf64,64.0035,ENET Blender
1,0.27244,0.271474,5ace4af7c2674ebe4a33bf5e,64.0035,AVG Blender
2,0.27256,0.271688,5ace4af8c2674ebe4a33bf62,64.0035,ENET Blender
3,0.27206,0.27174,5ace4af8c2674ebe4a33bf60,64.0035,Advanced AVG Blender
4,0.27228,0.272182,5ace49c4c2674e86aedf5e28,64.0035,Light Gradient Boosted Trees Classifier with E...
5,0.27357,0.272282,5ace49c4c2674e86aedf5e27,64.0035,eXtreme Gradient Boosted Trees Classifier with...
6,0.27417,0.272626,5ace49c4c2674e86aedf5e22,64.0035,eXtreme Gradient Boosted Trees Classifier with...
7,0.27421,0.273586,5ace49c4c2674e86aedf5e25,64.0035,eXtreme Gradient Boosted Trees Classifier with...
8,0.27501,0.274132,5ace49c4c2674e86aedf5e29,64.0035,Light Gradient Boosting on ElasticNet Predicti...
9,0.27678,0.274976,5ace49c4c2674e86aedf5e26,64.0035,Gradient Boosted Trees Classifier with Early S...


In [13]:
best_model_id = model_df.iloc[0, :]['id']
best_model_id

'5ace4af8c2674ebe4a33bf64'

In [14]:
best_model = dr.Model.get(project.id, best_model_id)
best_model

Model('ENET Blender')

As we said before, DataRobot uses built-in cross-validation and holdout to judge models. Prior to predicting with our holdout set, we will want to train the DataRobot model on the maximum amount of data it has. To do this, we unlock the holdout set using `project.unlock_holdout()` and then we retrain the model on 100% of the data given to DataRobot using `model.train`. Once we start a training, we use `dr.models.modeljob.wait_for_async_model_creation` to pause until the model has been built.

In [15]:
%%time
project.unlock_holdout()
job_id_for_retraining_best_model = best_model.train(sample_pct=100)
best_model = dr.models.modeljob.wait_for_async_model_creation(project.id, job_id_for_retraining_best_model)

CPU times: user 184 ms, sys: 8 ms, total: 192 ms
Wall time: 55.9 s


---
## Making Predictions with DataRobot

Now that we have some basic information on all the models, we can make predictions with the best model.

Now let's load some test data. We'll predict for data from the year 2014. We will need to munge the predict data the same way as the train data, and then we can upload it to DataRobot and get predictions.

In [16]:
test_df = pd.read_csv(f"{dataset_loc}/logan-US-2014.csv")
test_df

Unnamed: 0,was_delayed,daily_rainfall,did_rain,Carrier Code,Date (MM/DD/YYYY),Flight Number,Tail Number,Destination Airport,Scheduled Departure Time
0,False,0.0,False,US,02/01/2014,450,N809AW,PHX,10:00
1,False,0.0,False,US,02/01/2014,553,N814AW,PHL,07:00
2,False,0.0,False,US,02/01/2014,582,N820AW,PHX,06:10
3,False,0.0,False,US,02/01/2014,601,N678AW,PHX,16:20
4,False,0.0,False,US,02/01/2014,657,N662AW,CLT,09:45
...,...,...,...,...,...,...,...,...,...
18437,False,0.0,False,US,07/31/2014,2155,N950UW,LGA,16:00
18438,False,0.0,False,US,07/31/2014,2157,N955UW,LGA,17:00
18439,True,0.0,False,US,07/31/2014,2159,N948UW,LGA,18:00
18440,False,0.0,False,US,07/31/2014,2161,N958UW,LGA,19:00


In [17]:
%%time
test_df = prepare_modeling_dataset(test_df)
prediction_dataset = project.upload_dataset(test_df)
predict_job = best_model.request_predictions(prediction_dataset.id)
predictions = predict_job.get_result_when_complete()

CPU times: user 3.86 s, sys: 48 ms, total: 3.9 s
Wall time: 44.9 s


In [18]:
results_df = pd.DataFrame(predictions)
results_df

Unnamed: 0,positive_probability,prediction,row_id,class_0.0,class_1.0
0,0.045607,0.0,0,0.954393,0.045607
1,0.016567,0.0,1,0.983433,0.016567
2,0.013047,0.0,2,0.986953,0.013047
3,0.069339,0.0,3,0.930661,0.069339
4,0.035004,0.0,4,0.964996,0.035004
...,...,...,...,...,...
18437,0.125973,0.0,18437,0.874027,0.125973
18438,0.203456,0.0,18438,0.796544,0.203456
18439,0.195740,0.0,18439,0.804260,0.195740
18440,0.261509,0.0,18440,0.738491,0.261509


Predictions come back letting us know the probability of both the binary classes, the overall positive probability (same as the probability of class = 1.0), and the overall class prediction (0 or 1, based on the probability). We can then combine the class prediction with the original data.

In [19]:
pd.concat([test_df.reset_index(), results_df['prediction'].reset_index()],
          axis=1).drop('index', axis=1)

Unnamed: 0,was_delayed,daily_rainfall,did_rain,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,day_of_week,month,prediction
0,False,0.0,False,US,450,N809AW,PHX,10:00,Sat,2,0.0
1,False,0.0,False,US,553,N814AW,PHL,07:00,Sat,2,0.0
2,False,0.0,False,US,582,N820AW,PHX,06:10,Sat,2,0.0
3,False,0.0,False,US,601,N678AW,PHX,16:20,Sat,2,0.0
4,False,0.0,False,US,657,N662AW,CLT,09:45,Sat,2,0.0
...,...,...,...,...,...,...,...,...,...,...,...
18437,False,0.0,False,US,2155,N950UW,LGA,16:00,Thurs,7,0.0
18438,False,0.0,False,US,2157,N955UW,LGA,17:00,Thurs,7,0.0
18439,True,0.0,False,US,2159,N948UW,LGA,18:00,Thurs,7,0.0
18440,False,0.0,False,US,2161,N958UW,LGA,19:00,Thurs,7,0.0


---
## Conclusion
This concludes our brief overview of the `datarobot` Python client. If you continue on in this series of notebooks, we will dig deeper into analyzing the models we just built in this tutorial. We will also explore prediction results in more detail. In addition, we have a complete [API Reference](https://datarobot-public-api-client.readthedocs-hosted.com/) of the Python client.

To avoid duplicating the work you've accomplished in this notebook (i.e. creating a project and training an optimal model), the code below will save away some important details that other notebooks in this series will rely on.

In [20]:
import json

output_file = 'intro-notebook-output.json'  # the other notebooks will look for this file
output = {
    'project-id': project.id,
    'best-model-id': best_model.id,
    'predict-dataset-id': prediction_dataset.id,
}

with open(output_file, 'w') as fp:
    json.dump(output, fp)

Congratulations! You are now ready to move on to the other notebooks in this series.

### Next Steps
To continue learning about DataRobot and its API please continue on to the next notebook in this tutuorial: [Diving Deeper into Models](Diving Deeper into DataRobot Models.ipynb)