# Tuning a CART's Hyperparameters  
Hyperparameter tuning involves searching for a set of optimal hyperparameters for a learning algorithm. Followed by finding a set of optimal hyperparameters that results in an optimal model. The optimal model yields an optimal score. The score function measures the agreement between true labels and a model's predictions. In sklearn, it defaults to accuracy for classifiers and r-squared for regressors. A model's generalization performance is evaluated using cross-validation.  
Why bother tuning hyperparameters? Well, in scikit-learn, a model's default hyperparameters are not optimal for all problems. Hyperparameters should be tuned to obtain the best model performance. There are many approaches for hyperparameter tuning including: grid-search, random-search, and so on.  
  
**Parameters**: Learned from data
- CART example: split-point of a node, split-feature of a node, etc.  
  
**Hyperparameters**: Not learned from data
- CART example: max_depth=, min_samples_leaf, splitting criterion, etc.  
  
---
  
Grid search cross validation
- In grid-search cross-validation, first you manually set a grid of discrete hyperparameter values.
- Then, you pick a metric for scoring model performance and you search exhaustively through the grid. For each set of hyperparameters, you evaluate each model's score. 
- The optimal hyperparameters are those for which the model achieves the best cross-validation score. 
- Note that grid-search suffers from the curse of dimensionality. Put in other words: "The bigger the grid, the longer it takes to find the solution."
  
Getting the list of any models Hyperparameters
- This can be accomplished by using **print(model.get_params())**
  
Tuning can be expensive
- Computationally expensive
- Sometimes leads to very slight improvements, weight the impact of tuning on the whole project
  

## Tuning Hyperparameters for a DecisionTreeClassifier

### Load, Pre-process, and List Avaliable Hyperparameters

In [2]:
# Importing pandas
import pandas as pd
# Import train_test_split
from sklearn.model_selection import train_test_split
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier


# Loading df
data = pd.read_csv('../_datasets/indian-liver-patient/indian_liver_patient_preprocessed.csv')
data = data.drop('Unnamed: 0', axis=1)

# Selecting data
X = data.iloc[:,:9].values
y = data.iloc[:,10].values

# Set seed for reproducibility
SEED = 1

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=SEED)

# Instantiate dt
dt = DecisionTreeClassifier(random_state=SEED)

# Show models hyperparameters
print(dt.get_params())

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 1, 'splitter': 'best'}


### Setting Up the Hyperparameter Grid

In [3]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV


# Define params_dt
params_dt = {
    'max_depth' : [2,3,4],
    'min_samples_leaf' : [0.12, 0.14, 0.16, 0.18]
}

# Instantiate grid_dt
grid_dt = GridSearchCV(
    estimator= dt,
    param_grid= params_dt,
    scoring= 'roc_auc',
    cv= 5,
    n_jobs= -1
    )

# Fitting the grid to the training data
grid_dt.fit(X_train, y_train)                    

### Evaluating the Optimal Model

In [4]:
# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print('Best hyerparameters: {}'.format(best_hyperparams))

# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print('Best CV accuracy: {}'.format(best_CV_score))

# Extract the best estimator
best_model = grid_dt.best_estimator_
print('Best estimator: {}'.format(best_model))

Best hyerparameters: {'max_depth': 3, 'min_samples_leaf': 0.12}
Best CV accuracy: 0.7273653424150937
Best estimator: DecisionTreeClassifier(max_depth=3, min_samples_leaf=0.12, random_state=1)


In [5]:
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.610


An untuned classification-tree would achieve a ROC AUC score of 0.610

## Tuning Hyperparameters for a RandomForest

### Load, Pre-process, and List Avaliable Hyperparameters

In [6]:
# Import pandas
import pandas as pd
# Import RandomForest
from sklearn.ensemble import RandomForestRegressor


# Load df
data = pd.read_csv('../_datasets/bikes.csv')

# Seed
SEED = 1

# X,y split
X = data.loc[:, data.columns != 'cnt'].values  # Selecting all columns except the target
y = data['cnt'].values

# Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= SEED)

# Instantiate rf
rf = RandomForestRegressor(random_state=SEED)

# Show models hyperparameters
print(rf.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 1, 'verbose': 0, 'warm_start': False}


### Setting Up the Hyperparameter Grid

In [7]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV


# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators' : [100, 350, 500],
    'max_features' : ['log2', 1.0, 'sqrt'],  # 'auto' was depriciated pending removal, replaced with None or 1.0
    'min_samples_leaf' : [2, 10, 30]
}

# Instantiate grid_rf
grid_rf = GridSearchCV(
    estimator= rf,
    param_grid= params_rf,
    scoring= 'neg_mean_squared_error',
    cv= 3,
    verbose= 3,
    n_jobs= -1
    )

# Fitting the grid to the training data
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV 1/3] END max_features=log2, min_samples_leaf=2, n_estimators=100;, score=-10008.581 total time=   0.7s
[CV 2/3] END max_features=log2, min_samples_leaf=2, n_estimators=100;, score=-10820.669 total time=   0.7s
[CV 3/3] END max_features=log2, min_samples_leaf=2, n_estimators=100;, score=-11709.936 total time=   0.7s
[CV 1/3] END max_features=log2, min_samples_leaf=2, n_estimators=350;, score=-10040.637 total time=   2.3s
[CV 3/3] END max_features=log2, min_samples_leaf=2, n_estimators=350;, score=-11447.147 total time=   2.3s
[CV 2/3] END max_features=log2, min_samples_leaf=2, n_estimators=350;, score=-10429.290 total time=   2.4s
[CV 1/3] END max_features=log2, min_samples_leaf=10, n_estimators=100;, score=-15132.092 total time=   0.6s
[CV 1/3] END max_features=log2, min_samples_leaf=2, n_estimators=500;, score=-9817.009 total time=   3.3s
[CV 2/3] END max_features=log2, min_samples_leaf=10, n_estimators=100;, score=-1713

### Evaluating the Optimal Model

In [8]:
# Extract best hyperparameters from 'grid_rf'
best_hyperparams = grid_rf.best_params_
print('Best hyerparameters: {}'.format(best_hyperparams))

# Extract best CV score from 'grid_rf'
best_CV_score = grid_rf.best_score_
print('Best CV score: {}'.format(best_CV_score))

# Extract the best estimator
best_model = grid_rf.best_estimator_
print('Best estimator: {}'.format(best_model))

Best hyerparameters: {'max_features': 1.0, 'min_samples_leaf': 2, 'n_estimators': 100}
Best CV score: -3086.4992029105765
Best estimator: RandomForestRegressor(min_samples_leaf=2, random_state=1)


In [9]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Test RMSE of best model: 51.779
