<a href="https://colab.research.google.com/github/datarobot-community/DRU-MLOps/blob/master/10Jun2022 - API Python I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introductions

1. Any preferred name you'd like to go by?
2. What experience do you have with Python?
3. What experience do you already have with using the Datarobot Python package?

# What to expect

- The class is broken into two sessions each covering one half-day block.
    - Each session is comprised of two modules.
    - Each module should cover around 90 minutes including a 10 minute break near the top of each hour.
    - Within each module, multiple topics will be covered.
- While the specific timings of each module may differ, overall we will plan to follow this format:
    - Motivation for the topic covered will first be given.
    - We'll examine how this is done in Python using the datarobot Python package and other packages like `pandas`, `matplotlib`, etc. You'll write the code as it is presented to help you start to become fluent in the syntax.
    - Then you'll practice what you've just learned on a different problem/context by programming yourself.
        - You can decide whether you'd like to try the beginner, intermediate, or advanced option.
        - Ideally, you try all three!
    - We'll go over ways to solve the particular problem and discuss solutions.
    - We'll then head into covering the next topic in the module.
- Overall, we want you to be able to practice your learning immediately to check for understanding and get help from us as needed to get things working.

# Agenda

Session A
- [Module 1.1 - Preliminaries](#Module-1.1:-Preliminaries)
- [Module 1.2 - Model interpretation basics](#Module-1.2:-Model-interpretation-basics)

Session B
- [Module 2.1 - Further model interpretation and understanding](#Module-2.1:-Further-model-interpretation-and-understanding)
- [Module 2.2 - Advanced feature selection](#Module-2.2:-Advanced-feature-selection)

You'll see that we have some overall themes in this class. First, we'll go over how to connect Python to the DataRobot client. Then we'll see how to make a DataRobot project. Inside that project, we'll look into featurelists and blueprints before diving into building models and evaluating and understanding them with the use of some visualizations. These main ideas are also covered in the DataRobot Python cheatsheet that has been included in the course ZIP file. We don't go over everything on that sheet in this class but it can be a useful resource. Now that we've talked about the general concepts of the course I'd like to cover the main expectations.

## Learning objectives

By the end of this mission, you will be able to:

- Connect to the DataRobot client using an API key
- Create a project in DataRobot programmatically
- Set the target feature for a DataRobot project
- Download and review DataRobot created featurelists
- Extract and customize the features of a featurelist
- Review the properties of a DataRobot project Repository
- Build a new model from a blueprint and custom featurelist
- Start autopilot with the maximum number of workers


- List all models trained during autopilot
- Create a custom function in Python to extract Leaderboard results
- Get training predictions for a model
- Create a custom lift chart to aggregate predicted and actual results
- Retrieve Feature Impact for top performing models
- Build a model with reduced features based on Feature Impact
- Make predictions on scoring data

Overall, the purpose of what is shown in this class are ways to access DataRobot in a programmatic way. These examples may not necessarily be the best use cases for you directly, but our goal is for you to be able to directly apply your learning here to your own projects.

---

## Let's begin by uploading a few resources we will need into the Colab environment:

1. Training dataset: **shot_logs_wed.csv**
2. Scoring dataset: **shot_logs_sat.csv**
3. Configuration file: **drconfig.yaml**
4. Requirements file: **colab_requirements.txt**

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
!ls

Now we will install the Python modules we need.

In [None]:
!pip install -r colab_requirements.txt -q

---

# SESSION A: DataRobot Modeling Basics with the API

### Connect to the DataRobot client using the API key

To use Python with DataRobot you first need to establish a connection between your machine and the DataRobot instance. The recommended way to do that is by creating a `.yaml` file with your credentials. Here, we have chosen the filename `drconfig.yaml`; we have already placed this file in the Colab environment.

The `.yaml` file is basically a text file containing two lines:

`token: "YOUR_API_TOKEN"`  
`endpoint: "YOUR_HOSTNAME"`

For this class we will be using `endpoint: https://app.datarobot.com/api/v2`

In [None]:
# Show the YAML file contents after updating
# Not necessary but a nice check
!cat drconfig.yaml

In [None]:
# import the datarobot package
import datarobot as dr

In [None]:
# connect to datarobot using your saved credentials
dr.Client(config_path = 'drconfig.yaml')

In [None]:
# Another option (not needed if running the previous cell works)
# Fill in the token argument with your created token
dr.Client(token = , endpoint = "https://app.datarobot.com/api/v2")

### Explore data using `pandas`

In [None]:
# load pandas
import pandas as pd

To make it easier to follow the difference between user-defined names and API-defined functionality, we will use shorter **camelCase** notation for the former and **snake_case** for the latter.
So, **userDef** vs **api_defined**.

In [None]:
# load data from text file
shotData = pd.read_csv('shot_logs_wed.csv')

# view first few rows of DataFrame
shotData.head()

### Question A

- What is the Unit of Analysis for this DataFrame?

ANSWER: A shot taken (by an NBA player)

### Question B

- Which feature would be a good target for us to use for modeling?

### Start preliminary modeling steps


There a few different ways to get a project started with the Python package. One way this can be done is using the `datarobot.Project.create()` method. You can specify the `sourcedata` as a URL, as a file, or as a Python DataFrame as we will do here. Let's also import the `date` class from the `datetime` module to get today's date.

In [None]:
# Create name of project
from datetime import date
shotProjectName = 'NBA Shots ' + date.today().strftime(format = "%Y-%m-%d")
shotProjectName

In [None]:
# Create a project in DataRobot
shotProject = ___.___.___(
    sourcedata = ___,
    project_name = ___
)

Go to <https://app.datarobot.com/manage-projects> and find the "NBA Shots" project that was just created. Click on it and then go to the **Data** tab. Many of the things you've done in the app, like starting a new project, are also available via the API! You can also keep the app open to check for different tasks being started programmatically.

### Question C


In [3]:
- What error is caused by uncommenting and running the above code? Why?

Object `Why` not found.


- What error is caused by uncommenting and running the above code? Why?

In [None]:
# Show available optimization metrics
shotProject.___(feature_name = '___') #['available_metrics']

Now that we have the data uploaded and we've examined potential metrics, we can get the second round of exploratory data analysis and preliminary modeling steps kicked off using the `.set_target()` method. We'll use manual mode for now with the `mode` argument below. We'll kick off autopilot at the end of this Session A. 

We also set the `positive_class` to `"made"` here so that we can interpret the model predictions to be in terms of the chance of a shot being made instead of the chance of a shot being missed.

In [None]:
# Prepare for modeling in manual mode
shotProject.___(
    target = "___",
    metric = "___",
    positive_class = "___",
    mode = "___"
)

### Getting help


In addition to the API package documentation you saw before, remember to use `?` too. This will open up the docstring and signature for a function/method in its own pane in Jupyter.

In [None]:
?dr.Project.create

In [None]:
?shotProject.set_target

You can also investigate the code for the function/method by using two question marks:

In [None]:
??dr.Project.create

You can also access help by pressing Shift + Tab twice on your keyboard on a particular line of code in Jupyter. In other words, hold down the Shift key and then press Tab twice. Try this Shift + Tab sequence on the next line.

In [None]:
dr.Project.

In Jupyter you can also bring up many potential options for methods/attributes of an object by pressing the Tab key after you enter a period. Press Tab on the next line (and wait a few seconds) to see what's available with the datarobot package.

In [None]:
dr.

So that we can see all of the output a particular Jupyter cell will produce and not just the last output, we can change the following setting. This will be particularly helpful as you complete your exercises.

### YOUR TASK (Exercise 1)

### Download and review DataRobot created featurelists

In [None]:
# Get a list of all current featurelists for the `shotProject`
shotFeaturelists = shotProject.___()
shotFeaturelists

Note that `Informative Features - Leakage Removed` has been created here. If you haven't noticed this already, there are two features that are detected as target leakage if the target is `SHOT_RESULT`.

QUESTION: Look in https://app.datarobot.com at the `shotProject` to identify them. Which two are they?

### Extract the features of a featurelist

In [None]:
# Extract name and features of each feature list as a dictionary
featurelistsDict = {featurelist.name: featurelist.features 
                      for featurelist in shotFeaturelists}

# Examine this dictionary
featurelistsDict

We will be building a featurelist in this part of the class and so we'd like to be able to extract the features that correspond to a particular featurelist. We could choose a different attribute here to have as the value corresponding to the key of `name`.

In [None]:
leakageRemovedFeatures = featurelistsDict.get(
    "Informative Features - Leakage Removed"
)
leakageRemovedFeatures

We now have a list in Python laying out the names of the features in the Informative Features - Leakage Removed featurelist in DataRobot.

If the overall goal is to predict if a shot in a game will be made or missed, there are a couple other features in our data currently that we won't know at the time of prediction and are thus target leakage. Can you identify which ones they are?

In [None]:
leakageRemovedFeatures.remove('___')
leakageRemovedFeatures.remove('___')
leakageRemovedFeatures

In [None]:
# Create new featurelist in DataRobot
modFeatlist = shotProject.___(
    name = 'Modified Informative Features - Leakage Removed', 
    features = ___
)
modFeatlist

### YOUR TASK (Exercise 2)

___

## Module 1.2: Model interpretation basics

## Learning objectives

By the end of this mission, you will be able to:

- ~~Connect to the DataRobot client using an API key~~
- ~~Create a project in DataRobot programmatically~~
- ~~Set the target feature for a DataRobot project~~
- ~~Download and review DataRobot created featurelists~~
- ~~Extract and customize the features of a featurelist~~
- Review the properties of a DataRobot project Repository
- Build a new model from a blueprint and custom featurelist
- Start autopilot with the maximum number of workers


- List all models trained during autopilot
- Create a custom function in Python to extract Leaderboard results
- Get training predictions for a model
- Create a custom lift chart to aggregate predicted and actual results
- Retrieve Feature Impact for top performing models
- Build a model with reduced features based on Feature Impact
- Make predictions on scoring data

### Review properties of the Repository

Recall that the Repository contains candidate blueprints corresponding to what DataRobot has determined to be recipes that should create high performing models when combined with data.

In [None]:
# Retrieve all blueprints for the NBA shots project
shotProjectBlueprints = shotProject.___()

In [None]:
# See what a few model algorithm names are for blueprints
shotProjectBlueprints[4:8]

In [None]:
# Review the structure of the Blueprint class
?dr.Blueprint

In [None]:
# Create DataFrame of some information about blueprints
bpDF = pd.DataFrame(
    [(blueprint.id, blueprint.model_type, blueprint.processes) 
     for blueprint in shotProjectBlueprints],
    columns = ["blueprint_id", "model_type", "processes"]
)
bpDF

In [None]:
# Filter to focus only on the Regularized Logistic Regression blueprints
rlrBpDF = bpDF[bpDF['model_type'].str.contains("Regularized Logistic")]
rlrBpDF

For the sake of class time, let's just go with the blueprint in the last row here since its preprocessing steps look relatively simple. We are almost to building a model with this blueprint and to do so we'll need to keep track of the `blueprint_id`:

In [None]:
# Get blueprint_id from the last row
# (iloc is "index locator")
bpIdToBuild = rlrBpDF.iloc[___].___
bpIdToBuild 

In [None]:
# Review the documentation of the `.train()` method.
?shotProject.train

Notice that `featurelist_id` is an argument to `.train()`:

In [None]:
# Show the id of the newly created Featurelist
___

In [None]:
# Build a model using the last blueprint given above
# and the Modified Informative Features - Leakage Removed 
# featurelist we created
# Note that this will take a few seconds to complete
rlrModelJobId = shotProject.___(
    trainable = ___,
    featurelist_id = ___,
    sample_pct = 64,
    scoring_type = "validation"
)
rlrModelJobId

In [None]:
# Get the Job object from the job ID
rlrModelJob = dr.models.Job.get(
    project_id = shotProject.id, 
    job_id = ___
)
rlrModelJob

In [None]:
# Retrieve model when finished building
rlrModel = ___.get_result_when_complete()

Note that what is returned from `.train()` is a model job id. On the Leaderboard, this corresponds to the number after the M. We've only built one model here, but if we went to the Leaderboard right now we would see the number above next to the M for this Regularized Logistic Regression model.

### YOUR TASK (Exercise 3)

Now that we have a model trained and validated, we can start to evaluate the fit of the model and understand which features are most impactful to the model. We'll start by looking at the Feature Impact for this model.

### Download Feature Impact data

In [None]:
# Request Feature Impact be calculated and then download it
# If Feature Impact has already been calculated for a model, this
# function will just get it too
# Note this may take a bit to compute
fi = ___

# Save Feature Impact in pandas DataFrame
fiDF = pd.DataFrame(fi)

# View the Feature Impact for this model
fiDF

We also get `impactUnnormalized` here, which can be useful in seeing the absolute Feature Impact of a particular feature instead of relative to the top one.

### YOUR TASK (Exercise 4)

### Kicking off autopilot

To help us prepare for the content in Session B, let's start autopilot on this NBA Shots project. We'll continue reviewing many of the model interpretation tools in Session B and explore how to look at the Leaderboard as a pandas DataFrame as well.

In [None]:
# Start autopilot using the Featurelist we created
shotProject.___(
    featurelist_id = ___
)

In [None]:
# Set number of workers
shotProject.set_worker_count(
    worker_count = -1
)

QUESTION: What do you think `worker_count = -1` specifies for the number of workers?

**Important note**

It is best practice after you kick off modeling with Quick mode or Autopilot to also run the `.wait_for_autopilot()` method. This will ensure that other tasks that depend on modeling being finished aren't run until the modeling phase is done in the phase of autopilot chosen. Run the code below to ensure this is done for your project AFTER you've started the model building.

Running `.wait_for_autopilot()` will also block any other Python commands from being run so use it cautiously. Oftentimes it's best to do it just before you take a break.

In [None]:
shotProject.wait_for_autopilot()

### YOUR TASK (Exercise 5)

___

___

# SESSION B: DataRobot Model Interpretation and Understanding with the API

## Learning objectives

By the end of this mission, you will be able to:

- ~~Connect to the DataRobot client using an API key~~
- ~~Create a project in DataRobot programmatically~~
- ~~Set the target feature for a DataRobot project~~
- ~~Download and review DataRobot created featurelists~~
- ~~Extract and customize the features of a featurelist~~
- ~~Review the properties of a DataRobot project Repository~~
- ~~Build a new model from a blueprint and custom featurelist~~
- ~~Start autopilot with the maximum number of workers~~


- List all models trained during autopilot
- Create a custom function in Python to extract Leaderboard results
- Get training predictions for a model
- Create a custom lift chart to aggregate predicted and actual results
- Retrieve Feature Impact for top performing models
- Build a model with reduced features based on Feature Impact
- Make predictions on scoring data

### Resuming

Let's make sure we are back at where we left off at the end of Session A.

## Let's upload resources into the Colab environment:

1. Training dataset: **shot_logs_wed.csv**
2. Scoring dataset: **shot_logs_sat.csv**
3. Configuration file: **drconfig.yaml**
4. Requirements file: **colab_requirements.txt**

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
!ls

Let's install the Python modules we need.

In [None]:
!pip install -r colab_requirements.txt -q

In [None]:
# re-import packages
import datarobot as dr
import pandas as pd
from datetime import date

In [None]:
# reconnect to dr
dr.Client(config_path = 'drconfig.yaml')

You will need a way to get the Project id from DataRobot to continue your progress. One way to retrieve the Project id is by finding your current project at https://app.datarobot.com under Manage Projects. If you examine the URL now, you can get the Project id.

`https://app.datarobot.com/projects/<PROJECT ID>/`

In [None]:
shotProject = dr.Project.get(
    project_id = "___"
)

## Module 2.1: Further model interpretation and understanding 

### Explore models that have been built in Autopilot

We concluded Session A by running autopilot to build out many different models for this NBA Shots project. Let's now start to investigate these models.

In [None]:
# Get list of trained models
shotProjectModels = shotProject.___()

len(shotProjectModels)

In [None]:
# Pick an arbitrary model, e.g., the 7th model (index 6)
someModel = shotProjectModels[___]

In [None]:
# What's the name of this model?
someModel.___

In [None]:
# What are the available metric scores for this model?
someModel.___

In [None]:
# What is the optimization metric used for this project?
shotProject.___

In [None]:
# How do we get a specific model's metrics for the current project metric?
someModel.___[shotProject.___]

In [None]:
# What is the name of the featurelist for this model?
someModel.___

In [None]:
# What percentage of the data was used for training with this model?
someModel.___

This is fine for one model, but will be harder to analyze for comparing multiple models. Let's build a function to extract the Leaderboard as a DataFrame!

### Create a custom function to extract Leaderboard results

In [None]:
def retrieveLeaderboard(project, metric = None):
    
    # Set default metric if not passed
    if metric is None:
        metric = project.metric
        
    # Create an empty list to store the leaderboard to start
    leaderboard = []
    
    # Get all of the models trained so far as we did above
    models = project.___()
    
    # Iterate over each of the models extracting different
    # pieces of relevant information as we did above
    for model in models:
        
        # Store the results for a specific metric
        temp = model.metrics[metric]
        
        # Store the name of the model algorithm
        temp["algorithm"] = model.model_type
        
        # Store the id of the model
        temp["model_id"] = model.___
        
        # Store the feature list used to create the model
        temp["featurelist"] = model.___
        
        # Store what % was used for training
        temp["sample_pct"] = model.sample_pct
        
        # Append this list to leaderboard and move to next model
        leaderboard.append(temp)
        
    # Store leaderboard list as a pandas DataFrame    
    leaderboardDf = pd.DataFrame(leaderboard)[
        ["algorithm", "model_id", "featurelist", "sample_pct", 
         "validation", "crossValidation", "holdout"]
    ]
    
    # Return this leaderboard to explore further
    return ___  

To make the resulting table a little easier to read, we can specify a pandas option for 5 decimal point values and extend the maximum column width.

In [None]:
# Change pandas display format for floating point numbers
pd.options.display.float_format = '{:,.5f}'.format

In [None]:
# Increase maximum column width
pd.set_option('max_colwidth', 70)

### Identify key pieces of information about the Leaderboard

In [None]:
shotLb = ___(project = ___)
shotLb

In [None]:
# To sort based on a different metric than the default (for this project, LogLoss)
retrieveLeaderboard(shotProject, '___')

Since the 80% and 100% models use different techniques to calculate approximations for validation and cross-validation scores, let's turn our focus only to models using 64% or smaller for training sample size and examine the top performing models there. For the sake of computation time throughout class, let's also remove Blender models going forward. This also gives us a chance to play a bit with pandas DataFrames.

In [None]:
# Note the use of:
# & for "and" (use | for "or") 
# ~ (the invert operator) for "not"
shot64 = shotLb[
    (shotLb.___ <= 64) &  ~(shotLb.___.str.contains("___"))
]
shot64

Next, we can get the model ID of the top model and then retrieve it using the `.get()` method.

In [None]:
top64ModelId = ___.___.___
top64ModelId

In [None]:
top64Model = ___.___.___(
    project = ___,
    model_id = ___
)
top64Model

### YOUR TASK (Exercise 6)

### Download validation predictions

While there are functions available in the datarobot Python package to download the data to create many different model interpretation plots you see in DataRobot such as `get_lift_chart()`, `get_confusion_chart()`, etc., we can also create our own custom plots that can directly relate to our business use case. To do so, we'll need to use the `.request_training_predictions()` method with `.get_result_when_complete()` and specify which subset of our historical data we'd like predictions for. We'll focus on downloading predictions from the validation partition but there are other options too.

In [None]:
?top64Model.request_training_predictions

In [None]:
trainPredJob = top64Model.request_training_predictions(
    data_subset = 'validationAndHoldout'
)

In [None]:
top64ModelPreds = ___.get_result_when_complete()
top64ModelPreds

In [None]:
top64PredsDF = ___.___()

In [None]:
top64PredsDF

Now that we have both the Validation and Holdout predictions for each row, let's focus on the subset of this data we are most interested in.

In [None]:
validationPreds = top64PredsDF[top64PredsDF.partition_id != "Holdout"][["row_id", "class_made"]]
validationPreds

In [None]:
len(validationPreds.index)

This value of 842 rows can be used as a check for when we filter `shotData` to focus only on the `row_id` values given above.

### (Re)load actual shot results

In [None]:
# Load data again from text file (using row id as index)
shotData = pd.read_csv('shot_logs_wed.csv')
# Focus on columns of interest
shotDataSmall = shotData[["SHOT_DIST", "FGM"]]
shotDataSmall

### Join prediction and actual data together

Now we are ready to link our two tables together so that we can look at actual values (`SHOT_RESULT`) and how that compares to the predicted values for each row in the validation set.

In [None]:
predAndAct = pd.merge(
    left = ___,
    right = ___,
    left_index = True,
    right_on = "row_id"
)
predAndAct

### YOUR TASK (Exercise 7)

### Create binned version of shot distance

Let's use our knowledge (maybe newfounded!) of the NBA basketball court and types of shots to help us understand how the percentage of made shots changes as we bin the values of `SHOT_DIST` into categories in our defined way. 

Next we can create some bins and give them names corresponding to different shot distances. 

In [None]:
shotLabels = ["Layup 2", "Mid-range 2", "Long-range 2", "3", "4"]
cutRanges = [0, 4, 15, 23.75, 30, 94]

In [None]:
predAndAct["shotRangeCat"] = pd.cut(
    x = predAndAct["___"],
    bins = ___,
    labels = ___
)
predAndAct

### Find average predicted and actual values across each bin

Now we are ready to compute the average predicted and actual values across each shot range.

In [None]:
binned = predAndAct.groupby("___").agg(
    predicted = ("___", "___"),
    actual = ("___", "mean"),
    count = ("FGM", "count")
)
binned

### Create visualization of model interpretation results

In [None]:
# Increase size of matplotlib plots
plt.rcParams['figure.figsize'] = [10, 5]

In [None]:
plt.plot(
    ___,
    ___,
    '+-'
);

In [None]:
plt.plot(
    binned.index,
    binned.predicted,
    '+-'
)
plt.plot(
    binned.index,
    binned.___,
    '___',
    fillstyle = 'none'
);

In [None]:
plt.plot(
    binned.index,
    binned.predicted,
    '+-'
)
plt.plot(
    binned.index,
    binned.___,
    'o-',
    fillstyle = 'none'
)
plt.legend(labels = ['___', '___']);

### YOUR TASK (Exercise 8)

___

## Module 2.2: Advanced feature selection

By the end of this mission, you will be able to:

- ~~Connect to the DataRobot client using an API key~~
- ~~Create a project in DataRobot programmatically~~
- ~~Set the target feature for a DataRobot project~~
- ~~Download and review DataRobot created featurelists~~
- ~~Extract and customize the features of a featurelist~~
- ~~Review the properties of a DataRobot project Repository~~
- ~~Build a new model from a blueprint and custom featurelist~~
- ~~Start autopilot with the maximum number of workers~~



- ~~List all models trained during autopilot~~
- ~~Create a custom function in Python to extract Leaderboard results~~
- ~~Get training predictions for a model~~
- ~~Create a custom lift chart to aggregate predicted and actual results~~
- Retrieve Feature Impact for top performing models
- Build a model with reduced features based on Feature Impact
- Make predictions on scoring data

### Retrieve Feature Impact for top performing models

Feature Impact provides a way to see which features are most impactful for a particular model. A more advanced way of identifying which features might be most important to predicting the target of interest is to review Feature Impact across many different models. That will be the goal of this module.

Let's first retrieve the top three models from those built with 64% training that are not blenders trained with our modified leakage removed featurelist. These will be sorted based on validation score. This could be done by filtering the `nonblendersDF`, but it's sometimes helpful to work directly with the model objects.

In [None]:
# Get all models
allModels = shotProject.get_models()

# Focus on top 3 models
filteredModels = [
    model 
    for model in allModels 
    if ("Blender" not in model.model_type) &
       (model.sample_pct < 65)
][0:3]
filteredModels

Next let's retrieve the Feature Impact scores and rankings for these three models.

In [None]:
# Create DataFrame to store results
allImpact = pd.DataFrame()

# Walk through a for loop iterating over each model
for model in ___:
    
    # This can take a minute (for each)
    featureImpact = model.___(max_wait = 600)
    
    # Ready to be converted to DF
    df = pd.___(featureImpact)
    # Track model name and ID for bookkeeping purposes
    df['model_type'] = model.model_type
    df['model_id'] = model.id
    
    # By sorting and re-indexing, the new index becomes our 'ranking'
    df = df.sort_values(by = 'impactUnnormalized', ascending = False)
    df = df.reset_index(drop = True)
    df['rank'] = df.index.values
    
    # Add to our master list of all models' feature ranks
    allImpact = pd.___([allImpact, df], ignore_index = True)

### YOUR TASK (Exercise 9)

### Build a new model with the top 10 features from across these models

In [None]:
# Examine allImpact DataFrame
allImpact

First let's determine the top 10 features from the collection of models in the `allImpact` DataFrame.

In [None]:
# Step 1: Rank impact across models with a grouping feature and an aggregation function
# Step 2: Sort the values with the most impactful at the top
# Step 3: Grab only the top X features (here, we'll use 10)
# Step 4: Pull just the values of the indexing column (the grouping feature)

allImpact#.groupby('___').median()#.sort_values('___')#.head(___)#.index.valuesb

Then let's extract  a list of these features from the `allImpact` DataFrame.

In [None]:
# Turn the array made by the steps above into a list
top10Feats = list(
    allImpact
    .groupby('___').median()
    .sort_values('___').head(___)
    .index.values
)
top10Feats

We can now create a featurelist from these features with the goal to see the performance of a newly created model on this featurelist.

In [None]:
top10Aggregated = ___.___(
    name = 'top 10 Aggregated', 
    features = ___
)

Let's see how a newly created model based on the `bestModel` in `filteredModels` performs on this `top 10 Aggregated` featurelist.

In [None]:
bestModel = filteredModels[0]
bestModel.___(
    featurelist_id = ___.___,
    scoring_type = "crossValidation"
)

After the new model finishing building with cross-validation, we can go to the Leaderboard in the app to compare the model built with the **top 10 Aggregated** featurelist to the similar one built with **Modified Informative Features - Leakage Removed**. 

Alternatively, we can view the Leaderboard with our `retrieveLeaderboard()` function:

In [None]:
retrieveLeaderboard(___).head(___)

We see that the validation and cross-validation scores are similar. If we are concerned with speed of prediction more than just model performance, we may want to consider using this newly created model to make predictions going forward.

### YOUR TASK (Exercise 10)

### Make predictions on new data

The `shot_logs_sat.csv` file that you uploaded to the Colab environment can be used for making predictions. First, we load the dataset:

In [None]:
datasetFromPath = shotProject.___("shot_logs_sat.csv")

Let's go into https://app.datarobot.com to get the `model_id` from the URL of the newly created model from `bestModel` in `filteredModels` on the `top 10 Aggregated` featurelist. We'll make predictions using this model next.

In [None]:
prodModel = dr.Model.get(
    project = ___,
    model_id = "___"
)

Next, we set the prediction threshold to whichever value we feel comfortable saying is a made shot:

In [None]:
___.set_prediction_threshold(___)

Lastly, we request predictions and get the results:

In [None]:
predictJob = prodModel.___(datasetFromPath.id)
predictions = predictJob.get_result_when_complete()
predictions

___

### YOUR TASK (Exercise 11)

___

### Learning objectives

By now, you should be able to:

- Connect to the DataRobot client using an API key
- Create a project in DataRobot programmatically
- Set the target feature for a DataRobot project
- Download and review DataRobot created featurelists
- Extract and customize the features of a featurelist
- Review the properties of a DataRobot project Repository
- Build a new model from a blueprint and custom featurelist
- Start autopilot with the maximum number of workers


- List all models trained during autopilot
- Create a custom function in Python to extract Leaderboard results
- Get training predictions for a model
- Create a custom lift chart to aggregate predicted and actual results
- Retrieve Feature Impact for top performing models
- Build a model with reduced features based on Feature Impact
- Make predictions on scoring data

# What's Next?

_Final project_ : Extend the idea of feature selection by working through this example on FIRE (Feature Importance Rank Ensembling) from [this](https://github.com/datarobot-community/examples-for-data-scientists/blob/master/Feature%20Lists%20Manipulation/Python/FeatureSelection_using_Feature_Importance_Rank_Ensembling.ipynb) DataRobot Community Jupyter notebook on GitHub.

*Course Survey*: on university.datarobot.com [here](https://university.datarobot.com/advanced-datarobot-with-python/548199) or directly on SurveyMonkey [here](https://www.surveymonkey.com/r/APIIPython). Thanks for your feedback!

### Bonus: For NBA Shots - separating out Long 2s from Corner 3s

In [None]:
For shots between 22 and 23.75, in order to separate long 2's from short 3's,
# we need to know what type of shot it is. So we need to add PTS_TYPE to shotDataSmall.

import numpy as np

#shot_data = pd.read_csv("shot_logs_wed.csv")
shotDataSmall = shotData[["SHOT_DIST", "FGM", "PTS_TYPE"]]

# Now we need to re-merge with the predictions to include PTS_TYPE
predAndAct = pd.merge(
    left = shotDataSmall,
    right = validationPreds,
    left_index = True,
    right_on = "row_id"
)

# Now split long-range 2s based on PTS_TYPE and create the corner 3
predAndAct["shortRangeCat"] = np.select(
    [
        predAndAct["SHOT_DIST" <= 4],
        predAndAct["SHOT_DIST" <= 15],
        np.where((predAndAct["SHOT_DIST" <= 23.75) & (predAndAct["PTS_TYPE" == 2), True, False),
        np.where((predAndAct["SHOT_DIST" <= 23.75) & (predAndAct["PTS_TYPE" == 3), True, False),
        predAndAct["SHOT_DIST" <= 30],
        predAndAct["SHOT_DIST" <= 94]
    ],
    [
        "a Layup 2",
        "b Mid-range 2",
        "c Long-range 2",
        "d Corner 3",
        "e Standard 3",
        "f 4"
    ],
    default = "unknown"
)
predAndAct.head(20)                             