In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv("../../data/credit-card-full.csv")
X = df[['LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3',
       'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']]
y = df[['SEX']].values.ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Build Grid Search functions
### In data science it is a great idea to try building algorithms, models and processes 'from scratch' so you can really understand what is happening at a deeper level. Of course there are great packages and libraries for this work (and we will get to that very soon!) but building from scratch will give you a great edge in your data science work.

### In this exercise, you will create a function to take in 2 hyperparameters, build models and return results. You will use this function in a future exercise.

### You will have available the X_train, X_test, y_train and y_test datasets available.

### Instructions
-    Build a function that takes two parameters called learning_rate and max_depth for the learning rate and maximum depth.
-    Add capability in the function to build a GBM model and fit it to the data with the input hyperparameters.
-    Have the function return the results of that model and the chosen hyperparameters (learning_rate and max_depth).

In [2]:
# Create the function
def gbm_grid_search(learning_rate, max_depth):

	# Create the model
    model = GradientBoostingClassifier(learning_rate=learning_rate, max_depth=max_depth)
    
    # Use the model to make predictions
    predictions = model.fit(X_train, y_train).predict(X_test)
    
    # Return the hyperparameters and score
    return([learning_rate, max_depth, accuracy_score(y_test, predictions)])

## Iteratively tune multiple hyperparameters
### In this exercise, you will build on the function you previously created to take in 2 hyperparameters, build a model and return the results. You will now use that to loop through some values and then extend this function and loop with another hyperparameter.

### The function gbm_grid_search(learn_rate, max_depth) is available in this exercise.

### If you need to remind yourself of the function you can run the function print_func() that has been created for you

### Instructions 1/3
-    Write a for-loop to test the values (0.01, 0.1, 0.5) for the learning_rate and (2, 4, 6) for the max_depth using the function you created gbm_grid_search and print the results.

In [3]:
# Create the relevant lists
results_list = []
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2, 4, 6]

# Create the for loop
for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
        results_list.append(gbm_grid_search(learn_rate,max_depth))

# Print the results
print(results_list)   

[[0.01, 2, 0.6004040404040404], [0.01, 4, 0.6139393939393939], [0.01, 6, 0.62], [0.1, 2, 0.6218181818181818], [0.1, 4, 0.6213131313131313], [0.1, 6, 0.6182828282828283], [0.5, 2, 0.6206060606060606], [0.5, 4, 0.6101010101010101], [0.5, 6, 0.5949494949494949]]


### Instructions 2/3
-    Extend the gbm_grid_search function to include the hyperparameter subsample. Name this new function gbm_grid_search_extended.

In [4]:
results_list = []
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2,4,6]

# Extend the function input
def gbm_grid_search_extended(learn_rate, max_depth, subsample):

	# Extend the model creation section
    model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth, subsample=subsample)
    
    predictions = model.fit(X_train, y_train).predict(X_test)
    
    # Extend the return part
    return([learn_rate, max_depth, subsample, accuracy_score(y_test, predictions)])       

### Instructions 3/3
-    Extend your loop to call gbm_grid_search (available in your console), then test the values [0.4 , 0.6] for the subsample hyperparameter and print the results. max_depth_list & learn_rate_list are available in your environment.

In [5]:
results_list = []

# Create the new list to test
subsample_list = [0.4 , 0.6]

for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
    
    	# Extend the for loop
        for subsample in subsample_list:
        	
            # Extend the results to include the new hyperparameter
            results_list.append(gbm_grid_search_extended(learn_rate, max_depth, subsample))
            
# Print results
print(results_list)            

[[0.01, 2, 0.4, 0.6062626262626263], [0.01, 2, 0.6, 0.6025252525252526], [0.01, 4, 0.4, 0.6138383838383838], [0.01, 4, 0.6, 0.6135353535353535], [0.01, 6, 0.4, 0.6181818181818182], [0.01, 6, 0.6, 0.6194949494949495], [0.1, 2, 0.4, 0.6197979797979798], [0.1, 2, 0.6, 0.6188888888888889], [0.1, 4, 0.4, 0.6170707070707071], [0.1, 4, 0.6, 0.6176767676767677], [0.1, 6, 0.4, 0.6106060606060606], [0.1, 6, 0.6, 0.6129292929292929], [0.5, 2, 0.4, 0.6074747474747475], [0.5, 2, 0.6, 0.6097979797979798], [0.5, 4, 0.4, 0.6003030303030303], [0.5, 4, 0.6, 0.5993939393939394], [0.5, 6, 0.4, 0.5751515151515152], [0.5, 6, 0.6, 0.584949494949495]]


## GridSearchCV with Scikit Learn
### The GridSearchCV module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a GridSearchCV object with certain parameters.

### The desired options are:
###### - A Random Forest Estimator, with the split criterion as 'entropy'
###### - 5-fold cross validation
###### - The hyperparameters max_depth (2, 4, 8, 15) and max_features ('auto' vs 'sqrt')
###### - Use roc_auc to score the models
###### - Use 4 cores for processing in parallel
###### - Ensure you refit the best model and return training scores
### You will have available X_train, X_test, y_train & y_test datasets.

### Instructions
-    Create a Random Forest estimator as specified in the context above.
-    Create a parameter grid as specified in the context above.
-    Create a GridSearchCV object as outlined in the context above, using the two elements created in the previous two instructions.

In [7]:
from sklearn.model_selection import GridSearchCV

In [11]:
# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parameter grid
param_grid = {'max_depth': [2, 4, 8, 15], 'max_features': ['auto', 'sqrt']} 

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True, return_train_score=True)
print(grid_rf_class)

GridSearchCV(cv=5, estimator=RandomForestClassifier(criterion='entropy'),
             n_jobs=4,
             param_grid={'max_depth': [2, 4, 8, 15],
                         'max_features': ['auto', 'sqrt']},
             return_train_score=True, scoring='roc_auc')


## Exploring the grid search results
### You will now explore the cv_results_ property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook.

### A reminder of the different column types in this property:

###### - time_ columns
###### - param_ columns (one for each hyperparameter) and the singular params column (with all hyperparameter settings)
###### - a train_score column for each cv fold including the mean_train_score and std_train_score columns
###### - a test_score column for each cv fold including the mean_test_score and std_test_score columns
###### - a rank_test_score column with a number from 1 to n (number of iterations) ranking the rows based on their mean_test_score
### Instructions
-    Read the cv_results_ property of the grid_rf_class GridSearchCV object into a data frame & print the whole thing out to inspect.
-    Extract & print the singular column containing a dictionary of all hyperparameters used in each iteration of the grid search.
-    Extract & print the row that had the best mean test score by indexing using the rank_test_score column.

In [23]:
rf_class = RandomForestClassifier(criterion='entropy')
param_grid = {'max_depth': [2, 4, 8, 15], 'max_features': ['sqrt']} 

grid_rf_class = GridSearchCV(
    estimator=rf_class, param_grid=param_grid,
    scoring='roc_auc', n_jobs=4, cv=5,
    refit=True, return_train_score=True)

grid_rf_class.fit(X_train, y_train)

In [24]:
# Read the cv_results property into a dataframe & print it out
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df)

# Extract and print the column with a dictionary of hyperparameters used
column = cv_results_df.loc[:, ['params']]
print(column)

# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1 ]
print(best_row)

   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.894760      0.027684         0.026475        0.001236   
1       1.458642      0.028673         0.028903        0.001102   
2       2.572547      0.033429         0.038203        0.002381   
3       4.856046      0.217442         0.067870        0.008103   

  param_max_depth param_max_features  \
0               2               sqrt   
1               4               sqrt   
2               8               sqrt   
3              15               sqrt   

                                      params  split0_test_score  \
0   {'max_depth': 2, 'max_features': 'sqrt'}           0.603618   
1   {'max_depth': 4, 'max_features': 'sqrt'}           0.623729   
2   {'max_depth': 8, 'max_features': 'sqrt'}           0.632343   
3  {'max_depth': 15, 'max_features': 'sqrt'}           0.633297   

   split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
0           0.601778           0.596628  ...  

## Analyzing the best results
### At the end of the day, we primarily care about the best performing 'square' in a grid search. Luckily Scikit Learn's gridSearchCv objects have a number of parameters that provide key information on just the best square (or row in cv_results_).

### Three properties you will explore are:

###### - best_score_ – The score (here ROC_AUC) from the best-performing square.
###### - best_index_ – The index of the row in cv_results_ containing information on the best-performing square.
###### - best_params_ – A dictionary of the parameters that gave the best score, for example 'max_depth': 10
### The grid search object grid_rf_class is available.

### A dataframe (cv_results_df) has been created from the cv_results_ for you on line 6. This will help you index into the results.

### Instructions
-    Extract and print out the ROC_AUC score from the best performing square in grid_rf_class.
-    Create a variable from the best-performing row by indexing into cv_results_df.
-    Create a variable, best_n_estimators by extracting the n_estimators parameter from the best-performing square in grid_rf_class and print it out.

In [25]:
# Print out the ROC_AUC score from the best-performing square
best_score = grid_rf_class.best_score_
print(best_score)

# Create a variable from the row related to the best-performing square
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
best_row = cv_results_df.loc[[grid_rf_class.best_index_]]
print(best_row)

# Get the n_estimators parameter from the best-performing square and print
best_n_estimators = grid_rf_class.best_params_["n_estimators"]
print(best_n_estimators)

0.6261485126962087
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
3       4.856046      0.217442          0.06787        0.008103   

  param_max_depth param_max_features  \
3              15               sqrt   

                                      params  split0_test_score  \
3  {'max_depth': 15, 'max_features': 'sqrt'}           0.633297   

   split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
3           0.623308           0.619782  ...         0.626149        0.005353   

   rank_test_score  split0_train_score  split1_train_score  \
3                1            0.975617            0.971523   

   split2_train_score  split3_train_score  split4_train_score  \
3            0.972765            0.977121            0.974948   

   mean_train_score  std_train_score  
3          0.974395         0.002007  

[1 rows x 22 columns]


KeyError: 'n_estimators'

## Using the best results
### While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.

### We can access this object through the best_estimator_ property of our grid search object.

### Let's take a look inside the best_estimator_ property, make predictions, and generate evaluation scores. We will firstly use the default predict (giving class predictions), but then we will need to use predict_proba rather than predict to generate the roc-auc score as roc-auc needs probability scores for its calculation. We use a slice [:,1] to get probabilities of the positive class.

### You have available the X_test and y_test datasets to use and the grid_rf_class object from previous exercises.

### Instructions
-    Check the type of the best_estimator_ property.
-    Use the best_estimator_ property to make predictions on our test set.
-    Generate a confusion matrix and ROC_AUC score from our predictions.

In [26]:
# See what type of object the best_estimator_ property is
print(type(grid_rf_class.best_estimator_))

# Create an array of predictions directly using the best_estimator_ property
predictions = grid_rf_class.best_estimator_.predict(X_test)

# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])

# Now create a confusion matrix 
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))

# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:,1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))

<class 'sklearn.ensemble._forest.RandomForestClassifier'>
[2 2 2 2 1]
Confusion Matrix 
 [[ 893 3063]
 [ 695 5249]]


NameError: name 'roc_auc_score' is not defined