# Model Tuning

## Tuning a CART's Hyperparameters

To obtain a better perforance, the hyperparameters of a machine learning model should be tuned.

Optimal model yields an optimal score. Score in sklearn defaults to accuracy (classification) and R2 (regression). A model's generalization performance is evaluated using cross-validation.

Grid_search is one of the hyperparameter tuning methods. Suffers from the curse of dimensionality, the bigger the grid the longer it takes to find the solution. 

In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
cancer = pd.read_csv("cancer.csv")
X = cancer.drop(["id", "Unnamed: 32", "diagnosis"], axis=1)
y = cancer["diagnosis"]
le = LabelEncoder()
y = le.fit_transform(y)
y = pd.Series(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                   stratify=y, random_state=1)

dt = DecisionTreeClassifier(random_state=1)
print(dt.get_params())

params_dt = {"max_depth":[3, 4, 5, 6],
            "min_samples_leaf":[0.04, 0.06, 0.08],
            "max_features":[0.2, 0.4, 0.6, 0.8]
            }

grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring="accuracy",cv=10,n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_hyperparams = grid_dt.best_params_
print("Best hyperparameters: ", best_hyperparams)
best_CV_score = grid_dt.best_score_
print("Best CV accuracy {:.3f}".format(best_CV_score))

best_model = grid_dt.best_estimator_
test_acc = best_model.score(X_test, y_test)
print("Test set accuracy of best model: {:.3f}".format(test_acc))

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 1, 'splitter': 'best'}
Best hyperparameters:  {'max_depth': 4, 'max_features': 0.2, 'min_samples_leaf': 0.06}
Best CV accuracy 0.935
Test set accuracy of best model: 0.906


### Set the tree's hyperparameter grid

In [28]:
params_dt = {
    "max_depth": [2,3,4],
    "min_samples_leaf": [0.12, 0.14, 0.16, 0.18]
}

### Search for the optimal tree

In [39]:
liver = pd.read_csv("indian_liver_patient_preprocessed.csv", index_col=0)
X = liver.drop("Liver_disease", axis=1)
y = liver["Liver_disease"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) 
dt = DecisionTreeClassifier(random_state=1)

grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring="roc_auc",cv=5,n_jobs=-1, refit=True)


### Evaluate the optimal tree

In [40]:
from sklearn.metrics import roc_auc_score
grid_dt.fit(X_train, y_train)

best_model = grid_dt.best_estimator_
y_pred_proba = best_model.predict_proba(X_test)[:,1]
test_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('Test set ROC AUC score of grid_dt: {:.3f}'.format(test_roc_auc))

dt.fit(X_train, y_train)
y_pred_proba = dt.predict_proba(X_test)[:,1]
test_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('Test set ROC AUC score of dt: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score of grid_dt: 0.731
Test set ROC AUC score of dt: 0.598


An untuned classification-tree would achieve a ROC AUC score of 0.54

## Tuning a RF's Hyperparameters

Computational expensive and sometimes leads to very little improvement.

In [56]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=1)
print(rf.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 1, 'verbose': 0, 'warm_start': False}


### Set the hyperparameter grid of RF

In [69]:
params_rf = {
    "n_estimators":[100, 350, 500],
    "max_features":["log2", 'auto', 'sqrt' ],
    "min_samples_leaf":[2, 10, 30]
}

### Search for the optimal forest

In [70]:
rf = RandomForestRegressor(random_state=1)
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring="neg_mean_squared_error",
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

### Evaluate the optimal forest

In [71]:
from sklearn.metrics import mean_squared_error as MSE

bike = pd.read_csv("bike.csv")
X = bike.drop("cnt", axis=1)
y = bike["cnt"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=6)
grid_rf.fit(X_train, y_train)
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred) ** 0.5

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.7s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   17.4s finished


Test RMSE of best model: 61.598
