### Tree hyperparameters
In the following exercises you'll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.

Your task is to tune the hyperparameters of a classification tree. **Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.**

We have instantiated a DecisionTreeClassifier and assigned to dt with sklearn's default hyperparameters. You can inspect the hyperparameters of dt in your console.

Which of the following is not a hyperparameter of dt?

Search for the optimal tree
In this exercise, you'll perform grid search using 5-fold cross validation to find dt's optimal hyperparameters.
Note that because grid search is an exhaustive process, it may take a lot time to train the model.
Here you'll only be instantiating the GridSearchCV object without fitting it to the training set.
As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:
```
grid_object.fit(X_train, y_train)
```
An untuned classification tree dt as well as the dictionary params_dt that you defined in the previous exercise are available in your workspace.

In [1]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Import train_test_split
from sklearn.model_selection import train_test_split

# Read 'indian_liver_patient_preprocessed.csv'
df = pd.read_csv('datasets/indian_liver_patient_preprocessed.csv')

# Create arrays for features and target variable
y = df['Liver_disease'].values
X = df.drop('Liver_disease', axis=1).values

# Set seed to 1 for reproducibility
SEED = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)

In [9]:
# Define params_dt
params_dt = {
    'max_depth': [2, 3, 4],
    'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]
}

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

dt.fit(X_train, y_train)

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='roc_auc', cv=5, n_jobs=-1)

In [10]:
%%time
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

grid_dt.fit(X_train, y_train)

# Extract the best estimator
best_model = grid_dt.estimator

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.564
CPU times: total: 78.1 ms
Wall time: 74 ms


In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators': [100, 350, 500],
    'max_features': ['log2', 'auto', 'sqrt'],
    'min_samples_leaf': [2, 10, 30]
}

rf = RandomForestClassifier(random_state=SEED)

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf, param_grid=params_rf, scoring='neg_mean_squared_error', cv=3, verbose=1, n_jobs=-1)

In [12]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit 'grid_rf' to the training set
grid_rf.fit(X_train, y_train)

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict the test set labels
y_pred = best_model.predict(X_test)

# Compute mse_test
mse_test = MSE(y_test, y_pred)

# Compute rmse_test
rmse_test = mse_test ** (1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))

Fitting 3 folds for each of 27 candidates, totalling 81 fits
Test RMSE of best model: 0.533
