In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Decision trees, the most commonly used XGBoost base learners, are unique in the machine learning landscape. Instead of multiplying column values by numeric weights, as in linear regression and logistic regression, decision trees split the data by asking questions about the columns.This process of splitting data into new groups via branching continues until the algorithm reaches a desired level of accuracy.
Decision trees are prone to overfitting the data. In other words, decision trees can map too closely to the training data, a problem in terms of variance and bias. Hyperparameter fine-tuning is one solution to prevent overfitting. Another solution is to aggregate the predictions of many trees, a strategy that Random Forests and XGBoost employ.

### Decision Tree Model

In [3]:
#Load data (file was saved in local directory)
df_census = pd.read_csv('census_cleaned.csv')

In [4]:
#declare your predictor and target columns, X and y:
X = df_census.iloc[:,:-1]
y = df_census.iloc[:,-1]

In [2]:
from sklearn.model_selection import train_test_split

In [6]:
# train-test-split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
#random_state: choosing the seed of a pseudo-random number generator to ensure reproducible results

In [7]:
# The accuracy_score determines the number of correct predictions divided by the total number of predictions. 
# import the DecisionTreeClassifier and accuracy_score:
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

In [8]:
#Initialize a machine learning model with random_state=2 to ensure consistent results:
DecTree= DecisionTreeClassifier(random_state=2)

In [9]:
#Fit the model on the training set:
DecTree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=2)

In [10]:
#Make predictions for the test set:
y_pred = DecTree.predict(X_test)

In [11]:
#Compare predictions with the test set:
accuracy_score(y_pred, y_test)

0.8131679154894976

##### Gini Criterion:
Gini is the error method the decision tree uses to decide how splits should be made. The goal is to find a split that leads to the lowest error. A Gini index of 0 means 0 errors. A gini index of 1 means all errors. A gini index of 0.5, which shows an equal distribution of elements, means the predictions are no better than random guessing. The closer to 0, the lower the error. At the root, a gini of 0.364 means the training set is imbalanced with 36.4 percent of class 1.

##### Bias:
A straight line generally has high bias. In machine learning bias is a mathematical term that comes from estimating the error when applying the model to a real-life problem. The bias of the straight line is high because the predictions are restricted to the line and fail to account for changes in the data.
In many cases, a straight line is not complex enough to make accurate predictions. When this happens, we say that the machine learning model has underfit the data with high bias. 

##### Variance: 
In machine learning, variance is a mathematical term indicating how much a model will change given a different set of training data. Formally, variance is the measure of the squared deviation between a random variable and its mean. Given nine different data points in the training set, the eighth-degree polynomial will be completely different, resulting in high variance.
Models with high variance often overfit the data. These models do not generalize well to new data points because they have fit the training data too closely.

#### Low bias, low variance:
Low variance means that a different training set will not result in a curve that differs by a significant amount. Low bias indicates that the error when applying this model to a real-world situation will not be too high. In machine learning, the combination of low variance and low bias is ideal. 

## Tuning decision tree hyperparameters

In machine learning, parameters are adjusted when the model is being tuned. The weights in linear and Logistic Regression, for example, are parameters adjusted during the build phase to minimize errors. Hyperparameters, by contrast, are chosen in advance of the build phase. If no hyperparameters are selected, default values are used.

#### Decision Tree Regressor
Before selecting hyperparameters, let's start by finding a baseline score using a DecisionTreeRegressor and cross_val_score with the following steps:

In [3]:
# Download the 'bike_rentals_cleaned' dataset and split it into X_bikes (predictor columns) and y_bikes (training columns):
df_bikes = pd.read_csv('bike_rentals_cleaned.csv')
X_bikes = df_bikes.iloc[:,:-1]
y_bikes = df_bikes.iloc[:,-1]

In [21]:
# Import the DecisionTreeRegressor and cross_val_score:
from sklearn.tree import DecisionTreeRegressor 
from sklearn.model_selection import cross_val_score

In [22]:
# Initialize DecisionTreeRegressor and fit the model in cross_val_score:
reg = DecisionTreeRegressor(random_state=2)

scores = cross_val_score(reg, X_bikes, y_bikes, scoring='neg_mean_squared_error', cv=5)

In [23]:
# Compute the root mean squared error (RMSE) and print the results:
rmse = np.sqrt(-scores)

print('RMSE mean: %0.2f' % (rmse.mean()))

RMSE mean: 1233.36


###### The RMSE is 1233.36. This is worse than the 972.06 obtained from Linear Regression, and from the 887.31 obtained by XGBoost.

Is the model overfitting the data because the variance is too high?
We can check how well decision tree makes predictions on training set alone:

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
# train-test-split

X_train, X_test, y_train, y_test = train_test_split(X_bikes, y_bikes, random_state=2)

In [24]:
#The following code checks the error of the training set, before it makes predictions on the test set:
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_train)

from sklearn.metrics import mean_squared_error 
reg_mse = mean_squared_error(y_train, y_pred)
reg_rmse = np.sqrt(reg_mse)
reg_rmse

0.0

In [None]:
# A RMSE of 0.0 means that the model has perfectly fit every data point! 
#This perfect score combined with a cross-validation error of 1233.36 is proof that the decision tree is overfitting the data with high variance. 
#The training set fit perfectly, but the test set missed badly.

# Hyperparameters may rectify the situation.

##### max_depth:
max_depth defines the depth of the tree, determined by the number of times splits are made.By limiting max_depth to smaller numbers, variance is reduced, and the model generalizes better to new data.

How can you choose the best number for max_depth?

You can always try max_depth=1, then max_depth=2, then max_depth=3, and so on, but this process would be exhausting. Instead, you may use a tool called GridSearchCV.

##### GridSearchCV:
GridSearchCV searches a grid of hyperparameters using cross-validation to deliver the best results.

GridSearchCV functions as any machine learning algorithm, meaning that it's fit on a training set, and scored on a test set. The primary difference is that GridSearchCV checks all hyperparameters before finalizing a model.

In [25]:
#Import GridSearchCV and define a list of hyperparameters for max_depth as follows:
from sklearn.model_selection import GridSearchCV 
params = {'max_depth':[None,2,3,4,6,8,10,20]}

In [26]:
#initialize a DecisionTreeRegressor, and place it inside of GridSearchCV along with params and the scoring metric:
reg = DecisionTreeRegressor(random_state=2)
grid_reg = GridSearchCV(reg, params, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)
grid_reg.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(random_state=2), n_jobs=-1,
             param_grid={'max_depth': [None, 2, 3, 4, 6, 8, 10, 20]},
             scoring='neg_mean_squared_error')

In [27]:
best_params = grid_reg.best_params_
print("Best params:", best_params)

Best params: {'max_depth': 6}


In [28]:
# training score:
best_score = np.sqrt(-grid_reg.best_score_)
print("Training score: {:.3f}".format(best_score))

Training score: 951.398


In [29]:
# test score:
best_model = grid_reg.best_estimator_
y_pred = best_model.predict(X_test)
rmse_test = mean_squared_error(y_test, y_pred)**0.5
print('Test score: {:.3f}'.format(rmse_test))

Test score: 864.670


##### min_samples_leaf
min_samples_leaf provides a restriction by increasing the number of samples that a leaf may have. As with max_depth, min_samples_leaf is designed to reduce overfitting. Increasing min_samples_leaf reduces variance.

In [32]:
# write a function that displays the best parameters, training score, and test score using GridSearchCV with DecisionTreeRegressor(random_state=2) assigned to reg as a default parameter:
def grid_search(params, reg=DecisionTreeRegressor(random_state=2)):
    grid_reg = GridSearchCV(reg, params, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_    
    print("Best params:", best_params)    
    best_score = np.sqrt(-grid_reg.best_score_)    
    print("Training score: {:.3f}".format(best_score))

    y_pred = grid_reg.predict(X_test)    
    rmse_test = mean_squared_error(y_test, y_pred)**0.5
    print('Test score: {:.3f}'.format(rmse_test))

In [33]:
X_train.shape

(548, 12)

In [34]:
#Let's try [1, 2, 4, 6, 8, 10, 20, 30] as the input of our grid_search:
grid_search(params={'min_samples_leaf':[1, 2, 4, 6, 8, 10, 20, 30]})

Best params: {'min_samples_leaf': 8}
Training score: 896.083
Test score: 855.620


###### Since the test score is better than the training score, variance has been reduced.

In [35]:
#put min_samples_leaf and max_depth together:
grid_search(params={'max_depth':[None,2,3,4,6,8,10,20],'min_samples_leaf':[1,2,4,6,8,10,20,30]})

Best params: {'max_depth': 6, 'min_samples_leaf': 2}
Training score: 870.396
Test score: 913.000


##### Even though the training score has improved, the test score has not. min_samples_leaf has decreased from 8 to 2, while max_depth has remained the same. Hyperparameters should not be chosen in isolation!!!

In [36]:
#limit min_samples_leaf to values greater than 3:
grid_search(params={'max_depth':[6,7,8,9,10],'min_samples_leaf':[3,5,7,9]})

Best params: {'max_depth': 9, 'min_samples_leaf': 7}
Training score: 888.905
Test score: 878.538


###### the test score has improved.

max_leaf_nodes
max_leaf_nodes is similar to min_samples_leaf. Instead of specifying the number of samples per leaf, it specifies the total number of leaves. So, max_leaf_nodes=10 means that the model cannot have more than 10 leaves. It could have fewer.

max_features
max_features is an effective hyperparameter for reducing variance. Instead of considering every possible feature for a split, it chooses from a select number of features each round.
It's standard to see max_features with the following options:

'auto' is the default, which provides no limitations.
'sqrt' is the square root of the total number of features.
'log2' is the log of the total number of features in base 2. 32 columns resolves to 5 since 2 ^5 = 32.

min_samples_split
Another splitting technique is min_samples_split. As the name indicates, min_samples_split provides a limit to the number of samples required before a split can be made. The default is 2, since two samples may be split into one sample each, ending as single leaves. If the limit is increased to 5, no further splits are permitted for nodes with five samples or fewer.

splitter
There are two options for splitter, 'random' and 'best'. Splitter tells the model how to select the feature to split each branch. The 'best' option, the default, selects the feature that results in the greatest gain of information. The 'random' option, by contrast, selects the split randomly.

Changing splitter to 'random' is a great way to prevent overfitting and diversify trees.

criterion
The criterion for splitting decision tree regressors and classifiers are different. The criterion provides the method the machine learning model uses to determine how splits should be made. It's the scoring method for splits. For each possible split, the criterion calculates a number for a possible split and compares it to other options. The split with the best score wins.

The options for decision tree regressors are mse (mean squared error), friedman_mse, (which includes Friedman's adjustment), and mae (mean absolute error). The default is mse.  

For classifiers, gini, which was described earlier, and entropy usually give similar results.

min_impurity_decrease
Previously known as min_impurity_split, min_impurity_decrease results in a split when the impurity is greater than or equal to this value.

Impurity is a measure of how pure the predictions are for every node. A tree with 100% accuracy would have an impurity of 0.0. A tree with 80% accuracy would have an impurity of 0.20.

Impurity is an important idea in Decision Trees. Throughout the tree-building process, impurity should continually decrease. Splits that result in the greatest decrease of impurity are chosen for each node.

The default value is 0.0. This number can be increased so that trees stop building when a certain threshold is reached.

min_weight_fraction_leaf
min_weight_fraction_leaf is the minimum weighted fraction of the total weights required to be a leaf. According to the documentation, Samples have equal weight when sample_weight is not provided.

For practical purposes, min_weight_fraction_leaf is another hyperparameter that reduces variance and prevents overfitting. The default is 0.0. Assuming equal weights, a restriction of 1%, 0.01, would require at least 5 of the 500 samples to be a leaf. 

## Predicting heart disease – a case study

Develop a model and highlight two to three important features that doctors and nurses can focus on to improve patient health.
Use a decision tree classifier with fine-tuned hyperparameters. 
After the model has been built, interpret results using feature_importances_, an attribute that determines the most important features in predicting heart disease.

In [37]:
#Load data (file was saved in local directory)
df_heart = pd.read_csv('heart_disease.csv')
df_heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


'target' is binary, with 1 indicating that the patient has heart disease and 0 indicating that they do not.

In [38]:
# Split the data into training and test sets:
X = df_heart.iloc[:,:-1]
y = df_heart.iloc[:,-1]
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

#### Decision Tree Classifier

In [41]:
# import the DecisionTreeClassifier and accuracy_score:
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

In [43]:
#Use cross_val_score with a DecisionTreeClassifier:
model = DecisionTreeClassifier(random_state=2)
scores = cross_val_score(model, X, y, cv=5)
print('Accuracy:', np.round(scores, 2))
print('Accuracy mean: %0.2f' % (scores.mean()))

Accuracy: [0.74 0.85 0.77 0.73 0.7 ]
Accuracy mean: 0.76


#### RandomizedSearchCV

RandomizedSearchCV works in the same way as GridSearchCV, but instead of trying all hyperparameters, it tries a random number of combinations. It's meant to find the best combinations in limited time.

In [62]:
# Function that uses RandomizedSearchCV to return the best model along with the scores:
# The inputs are params (a dictionary of hyperparameters to test), runs (number of hyperparameter combinations to check), and DecisionTreeClassifier
def randomized_search_clf(params, runs=20, clf=DecisionTreeClassifier(random_state=2)):    
    rand_clf = RandomizedSearchCV(clf, params, n_iter=runs, cv=5, n_jobs=-1, random_state=2)    
    rand_clf.fit(X_train, y_train)
    best_model = rand_clf.best_estimator_
    best_score = rand_clf.best_score_  
    print("Training score: {:.3f}".format(best_score))

    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('Test score: {:.3f}'.format(accuracy))
    return best_model

##### Choosing hyperparameters
Numbers have been chosen with the aim of reducing variance and trying an expansive range:

In [46]:
from sklearn.model_selection import RandomizedSearchCV 

In [47]:
randomized_search_clf(params={'criterion':['entropy', 'gini'],'splitter':['random', 'best'], 'min_weight_fraction_leaf':[0.0, 0.0025, 0.005, 0.0075, 0.01],'min_samples_split':[2, 3, 4, 5, 6, 8, 10],'min_samples_leaf':[1, 0.01, 0.02, 0.03, 0.04],'min_impurity_decrease':[0.0, 0.0005, 0.005, 0.05, 0.10, 0.15, 0.2],'max_leaf_nodes':[10, 15, 20, 25, 30, 35, 40, 45, 50, None],'max_features':['auto', 0.95, 0.90, 0.85, 0.80, 0.75, 0.70],'max_depth':[None, 2,4,6,8],'min_weight_fraction_leaf':[0.0, 0.0025, 0.005, 0.0075, 0.01, 0.05]})


Training score: 0.798
Test score: 0.855


DecisionTreeClassifier(criterion='entropy', max_depth=8, max_features=0.8,
                       max_leaf_nodes=45, min_samples_leaf=0.04,
                       min_samples_split=10, min_weight_fraction_leaf=0.05,
                       random_state=2)

##### Narrowing the range
Narrowing the range is one strategy to improve hyperparameters.
Using a baseline of max_depth=8 chosen from the best model, we may narrow the range to from 7 to 9.
Another strategy is to stop checking hyperparameters whose defaults are working fine. entropy, for instance, is not recommended over 'gini' as the differences are very slight. min_impurity_split and min_impurity_decrease may also be left at their defaults.

In [74]:
randomized_search_clf(params={'max_depth':[None, 6, 7],'max_features':['auto', 0.78], 'max_leaf_nodes':[45, None], 'min_samples_leaf':[1, 0.035, 0.04, 0.045, 0.05],'min_samples_split':[2, 9, 10],'min_weight_fraction_leaf': [0.0, 0.05, 0.06, 0.07],}, runs=100)

Training score: 0.802
Test score: 0.868


DecisionTreeClassifier(max_depth=7, max_features=0.78, max_leaf_nodes=45,
                       min_samples_leaf=0.045, min_samples_split=9,
                       min_weight_fraction_leaf=0.06, random_state=2)

In [75]:
# For a proper baseline of comparison,  it's essential to put the new model into cross_val_clf.
model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7, max_features=0.78, max_leaf_nodes=45, min_impurity_decrease=0.0,  min_samples_leaf=0.045, min_samples_split=9, min_weight_fraction_leaf=0.06,  random_state=2, splitter='best')
scores = cross_val_score(model, X, y, cv=5)
print('Accuracy:', np.round(scores, 2))
print('Accuracy mean: %0.2f' % (scores.mean()))

Accuracy: [0.82 0.9  0.8  0.8  0.78]
Accuracy mean: 0.82


##### six percentage points higher than the default model.

#### Some further trials

In [153]:
randomized_search_clf(params={'max_depth':[7, 8, 9,10],'max_features':['auto', 0.80], 'max_leaf_nodes':[45, None],  'min_impurity_decrease': [0.0, 0.1], 'min_samples_leaf':[1, 0.06, 0.07],'min_samples_split':[4,6,8,10],'min_weight_fraction_leaf': [0.0, 0.05], 'splitter': ['best'],}, runs=100)

Training score: 0.802
Test score: 0.868


DecisionTreeClassifier(max_depth=10, max_features=0.8, max_leaf_nodes=45,
                       min_samples_leaf=0.06, min_samples_split=6,
                       random_state=2)

In [154]:
model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10, max_features=0.8, max_leaf_nodes=45, min_samples_leaf=0.06,   min_samples_split=6,  random_state=2)
scores = cross_val_score(model, X, y, cv=5)
print('Accuracy:', np.round(scores, 2))
print('Accuracy mean: %0.2f' % (scores.mean()))

Accuracy: [0.82 0.9  0.8  0.8  0.78]
Accuracy mean: 0.82


### feature_importances_

In [56]:
best_clf = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,max_features=0.8, max_leaf_nodes=47,min_impurity_decrease=0.0, min_samples_leaf=1, min_samples_split=8,min_weight_fraction_leaf=0.05, random_state=2, splitter='best')

best_clf.fit(X, y)

DecisionTreeClassifier(max_depth=9, max_features=0.8, max_leaf_nodes=47,
                       min_samples_split=8, min_weight_fraction_leaf=0.05,
                       random_state=2)

In [57]:
best_clf.feature_importances_

array([0.04830121, 0.04008887, 0.47546568, 0.        , 0.        ,
       0.        , 0.        , 0.00976578, 0.        , 0.02445397,
       0.02316427, 0.1774694 , 0.20129082])

In [58]:
feature_dict = dict(zip(X.columns, best_clf.feature_importances_))

# Import operator 
import operator

#Sort dict by values (as list of tuples)
sorted(feature_dict.items(), key=operator.itemgetter(1), reverse=True)[0:3]


[('cp', 0.47546567857183675),
 ('thal', 0.20129082387838435),
 ('ca', 0.1774694042213901)]

'cp': Chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
'thalach': Maximum heart rate achieved
'ca': Number of major vessels (0-3) colored by fluoroscopy

These numbers may be interpreted as their explanation of variance, so 'cp' accounts for 48% of the variance, which is more than 'thal' and 'ca' combined.