### 4. Python API Training - Beyond AutoPilot [Solution]

<b>Author:</b> Thodoris Petropoulos <br>
<b>Contributors:</b> Rajiv Shah

This is the 4th exercise to complete in order to finish your `Python API Training for DataRobot` course! This exercise will help you learn how to use the repository and advanced tuning to create models that are better than the autopilot.

Here are the actual sections of the notebook alongside time to complete: 

1. Connect to DataRobot. [3min]<br>
2. Retrieve the Project created during the `Feature Selection Curves` exercise. [5min]<br>
3. Retrieve validation and cross-validation AUC score for best current model. [7min]
4. Run models with `Keras` within their `model_type` (64 `sample_pct`). [15min]
5. Check whether you created a model with a better validation score. [10min]
6. Sort all models by cross validation score. [15min]
7. Retrieve a specific model and change a specific hyperparameter.

Each section will have specific instructions so do not worry if things are still blurry!

As always, consult:

- [API Documentation](https://datarobot-public-api-client.readthedocs-hosted.com)
- [Samples](https://github.com/datarobot-community/examples-for-data-scientists)
- [Tutorials](https://github.com/datarobot-community/tutorials-for-data-scientists)

The last two links should provide you with the snippets you need to complete most of these exercises.

<b>Data</b>

The dataset we will be using throughout these exercises is the well-known `readmissions dataset`. You can access it or directly download it through DataRobot's public S3 bucket [here](https://s3.amazonaws.com/datarobot_public_datasets/10k_diabetes.csv).

### Import Libraries
Import libraries here as you start finding out what libraries are needed. The DataRobot package is already included for your convenience.

In [1]:
import datarobot as dr

#Proposed Libraries needed
import pandas as pd
import time

### 1. Connect to DataRobot. [3min]<br>

In [2]:
#Possible solution
dr.Client(config_path='../../github/config.yaml')

<datarobot.rest.RESTClientObject at 0x116ad9cc0>

### 2. Retrieve the Project created during the `Feature Selection Curves` exercise. [5min]<br>

Retrieve the project you created using the readmissions dataset and save it into a variable called `project`.

**Hint**: To use a project created in DataRobot you can either list all of the available projects using the Python api or find the ID from the web interface. For example, if you are logged into DataRobot, your browser will be pointing to a link such as this: `https//:YOUR_HOSTNAME/projects/PROJECT_ID/models/MODEL_ID`. Just copy paste the `PROJECT_ID`.

In [4]:
#Proposed Solution
project = dr.Project.get('YOUR_PROJECT_ID')

### 3. Retrieve validation and cross-validation AUC score for best current model. [7min]

In [13]:
#Proposed Solution
best_model = project.get_models()[0] #First one will be the best model based on validation score.

print(best_model)
print(best_model.metrics['AUC']['validation'])
print(best_model.metrics['AUC']['crossValidation'])

Model('Advanced AVG Blender')
0.70851
0.7043820000000001


### 4. Run models with `Keras` within their `model_type` (64 `sample_pct`). [15min]
Run the first 3 available Keras blueprints.

**Hint** To see models that are in the repository, call the method `get_blueprints` on the DataRobot Project object.

In [12]:
blueprints = project.get_blueprints()[0:3]

for blueprint in blueprints:
    if 'Keras' in blueprint.model_type:
        project.train(blueprint,sample_pct=64)
    
while len(project.get_all_jobs()) > 0:
    time.sleep(5)

### 5. Check whether you created a model with a better validation score. [10min]

**Hint**": You will have to ask DataRobot to send you the latest models again to see which is the current best model.

In [14]:
#Proposed Solution
best_model = project.get_models()[0] #First one will be the best model based on validation score.

print(best_model)
print(best_model.metrics['AUC']['validation'])
print(best_model.metrics['AUC']['crossValidation'])

Model('Advanced AVG Blender')
0.70851
0.7043820000000001


### 6. Sort all models by cross validation score. [15min]

**Hint 1**: Use the `search_params` variable in the `get_models` method to retrieve the models you are looking for.

**Hint 2**: Cross validation will not be calculated for all of the models. You can choose to calculate cross validation for all models or just take crossValidation for the ones that it is available (to save time).

In [18]:
# Proposed Solution

#Get all models
models = project.get_models()

#Create empty dictionary that will hold cross validation values.
results = {}

for model in models:
    results[model.model_type] = model.metrics['AUC']['crossValidation']
    
results_df = pd.DataFrame.from_dict(results,'index',columns = ['crossValidationAucScore']).reset_index()
results_df.sort_values(by='crossValidationAucScore',ascending = False).head()

### 7. Retrieve a specific model and change a specific hyperparameter.

**Instructions**:

Find a model with the below characteristics:

- Model Type = `eXtreme Gradient Boosted Trees Classifier with Early Stopping`
- Feature list = `Informative Features`

Tune:

- Change `learning_rate` to 0.2

**Hint**: There is a script in [Samples](https://github.com/datarobot-community/examples-for-data-scientists) that can help you with hyperparameter tuning.

In [31]:
# Proposed Solution

for model in models:
    if (model.model_type == 'eXtreme Gradient Boosted Trees Classifier with Early Stopping') &\
       (model.featurelist_name == 'Informative Features'):
        model_to_change = model
        break
    
#Start tuning procedure
tune = model_to_change.start_advanced_tuning_session()

#Identify the task that this hyperparameter belongs to ('Gradient Boosted Greedy Trees Classifier with Early Stopping')
task = tune.get_task_names()[2]

#List all of the paramters for the specific task
print(tune.get_parameter_names(task))

#Now that we know the name of the variable we want to change is `learning_rate`, 
#we can use the set_paramter method together with the run method to start modeling.
tune.set_parameter(
    task_name=task,
    parameter_name='learning_rate',
    value=0.08)

#Initiate tuning and get job
job = tune.run()

#Get model
new_model = job.get_result_when_complete(max_wait=10000)
print(new_model.metrics['AUC']['validation'])

['base_margin_initialize', 'colsample_bylevel', 'colsample_bytree', 'interval', 'learning_rate', 'max_bin', 'max_delta_step', 'max_depth', 'min_child_weight', 'min_split_loss', 'missing_value', 'n_estimators', 'num_parallel_tree', 'random_state', 'reg_alpha', 'reg_lambda', 'scale_pos_weight', 'smooth_interval', 'subsample', 'tree_method']
0.71061
