**Decision-Tree for Classification**
___
Classification tree
- sequence of if-else questions about individual features
    - **objective**: infer class labels
- able to capture non-linear relationships between features and labels
- do not require feature scaling (e.g., Standardization)

In [2]:
#Train your first classification tree

#In this exercise you'll work with the Wisconsin Breast Cancer Dataset
#from the UCI machine learning repository. You'll predict whether a tumor
#is malignant or benign based on two features: the mean radius of the
#tumor (radius_mean) and its mean number of concave points (concave points_mean).

#The dataset is already loaded in your workspace and is split into 80%
#train and 20% test. The feature matrices are assigned to X_train and
#X_test, while the arrays of labels are assigned to y_train and y_test
#where class 1 corresponds to a malignant tumor and class 0 corresponds
#to a benign tumor. To obtain reproducible results, we also defined a
#variable called SEED which is set to 1.

# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict test set labels
y_pred = dt.predict(X_test)
print(y_pred[0:5])

#################################################
#<script.py> output:
#    [0 0 0 1 0]

In [3]:
#Evaluate the classification tree

#Now that you've fit your first classification tree, it's time to evaluate
#its performance on the test set. You'll do so using the accuracy metric
#which corresponds to the fraction of correct predictions made on the test set.

#The trained model dt from the previous exercise is loaded in your workspace
#along with the test set features matrix X_test and the array of labels y_test.

# Import accuracy_score
from sklearn.metrics import accuracy_score

# Predict test set labels
y_pred = dt.predict(X_test)

# Compute test set accuracy
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

#################################################
#<script.py> output:
#    Test set accuracy: 0.89

In [None]:
#Logistic regression vs classification tree

#A classification tree divides the feature space into rectangular regions.
#In contrast, a linear model such as logistic regression produces only a
#single linear decision boundary dividing the feature space into two
#decision regions.

#We have written a custom function called plot_labeled_decision_regions()
#that you can use to plot the decision regions of a list containing two
#trained classifiers. You can type help(plot_labeled_decision_regions)
#in the IPython shell to learn more about this function.

#X_train, X_test, y_train, y_test, the model dt that you've trained in
#an earlier exercise , as well as the function plot_labeled_decision_regions()
#are available in your workspace.

# Import LogisticRegression from sklearn.linear_model
from sklearn.linear_model import  LogisticRegression

# Instantiate logreg
logreg = LogisticRegression(random_state=1)

# Fit logreg to the training set
logreg.fit(X_train, y_train)

# Define a list called clfs containing the two classifiers logreg and dt
clfs = [logreg, dt]

# Review the decision regions of the two classifiers
plot_labeled_decision_regions(X_test, y_test, clfs)

![_images/10.1.svg](_images/10.1.svg)

Notice how the decision boundary produced by logistic regression is linear while the boundaries produced by the classification tree divide the feature space into rectangular regions.

**Classification-Tree Learning**
___
-  Terms
    1. **Decision-Tree**: data structure consisting of a hierarchy of nodes
    2. **Node**: question or prediction
    3. **Root**: *no* parent node, question giving rise to *two* children nodes
    4. **Internal Node**: *one* parent node, question giving rise to *two* children nodes
    5. **Leaf**: *one* parent node, *no* children nodes --> *prediction*
        - in each leaf, one class label is predominant
- Information Gain (IG)
    - at each decision point a question regarding a split point for a feature is answered
    - tree maximizes information by making sure each node contains information, which decides the split point
        - IG(feature,split-point)= Impurity(parent) - (Nleft/N*Impurityleft + Nright/N*Impurityright)
        - criteria to measure the impurity of a node:
            - gini index
            - entropy
- Classification-Tree learning
    - nodes of classification tree are grown recursively; a node exists based on the state of its predecessors
    - At each non-leaf node data is split on:
        - feature and split-point to maximize IG(node)
        - if IG(node)=0, declare the node a leaf (or if the tree is constrained to a certain amount of levels)
___

In [None]:
#Using entropy as a criterion

#In this exercise, you'll train a classification tree on the Wisconsin
#Breast Cancer dataset using entropy as an information criterion. You'll
#do so using all the 30 features in the dataset, which is split into 80%
#train and 20% test.

#X_train as well as the array of labels y_train are available in your workspace.

# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)

# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)

In [None]:
#Entropy vs Gini index
#In this exercise you'll compare the test set accuracy of dt_entropy to
#the accuracy of another tree named dt_gini. The tree dt_gini was trained
#on the same dataset using the same parameters except for the information
#criterion which was set to the gini index using the keyword 'gini'.

#X_test, y_test, dt_entropy, as well as accuracy_gini which corresponds to
#the test set accuracy achieved by dt_gini are available in your workspace.

#Notice how the two models achieve exactly the same accuracy. Most of the
#time, the gini index and entropy lead to the same results. The gini index
#is slightly faster to compute and is the default criterion used in the
#DecisionTreeClassifier model of scikit-learn.

# Import accuracy_score from sklearn.metrics
from sklearn.metrics import accuracy_score

# Use dt_entropy to predict test set labels
y_pred = dt_entropy.predict(X_test)

# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_test, y_pred)

# Print accuracy_entropy
print('Accuracy achieved by using entropy: ', accuracy_entropy)

# Print accuracy_gini
print('Accuracy achieved by using the gini index: ', accuracy_gini)

#################################################
#<script.py> output:
#    Accuracy achieved by using entropy:  0.929824561404
#    Accuracy achieved by using the gini index:  0.929824561404

**Decision tree for regression**
___
- good for applications when relationships between features are non-linear
    - min_samples_leaf ==> how much of training data each leaf must contain at a minimum.
- Information criterion for regression-tree
- I(node) = MSE(node) = 1/Nnode * sum (y^i-ynodemean)^2
    - ynodemean = 1/Nnode * sum (y^i)
- Prediction
    - ypredmean(leaf) = 1/Nleaf * sum (y^i)

In [4]:
#Train your first regression tree

#In this exercise, you'll train a regression tree to predict the mpg
#(miles per gallon) consumption of cars in the auto-mpg dataset using
#all the six available features.

#The dataset is processed for you and is split to 80% train and 20%
#test. The features matrix X_train and the array y_train are available
#in your workspace.

# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8,
             min_samples_leaf=0.13,
            random_state=3)

# Fit dt to the training set
dt.fit(X_train, y_train)

In [None]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute y_pred
y_pred = dt.predict(X_test)

# Compute mse_dt
mse_dt = MSE(y_pred, y_test)

# Compute rmse_dt
rmse_dt = mse_dt**(1/2)

# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))

#################################################
#<script.py> output:
#    Test set RMSE of dt: 4.37

In [None]:
#Linear regression vs regression tree

#In this exercise, you'll compare the test set RMSE of dt to that
#achieved by a linear regression model. We have already instantiated a
#linear regression model lr and trained it on the same dataset as dt.

#The features matrix X_test, the array of labels y_test, the trained
#linear regression model lr, mean_squared_error function which was
#imported under the alias MSE and rmse_dt from the previous exercise
#are available in your workspace.

# Predict test set labels
y_pred_lr = lr.predict(X_test)

# Compute mse_lr
mse_lr = MSE(y_pred_lr, y_test)

# Compute rmse_lr
rmse_lr = mse_lr**(1/2)

# Print rmse_lr
print('Linear Regression test set RMSE: {:.2f}'.format(rmse_lr))

# Print rmse_dt
print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt))

#################################################
#<script.py> output:
#    Linear Regression test set RMSE: 5.10
#    Regression Tree test set RMSE: 4.37

**Generalization Error**
___
- Supervised learning - Under the hood
    - y = f(x) where f is unknown
    - **goal**: find a model that best approximates f
        - discard noise as much as possible
    - **end goal**: model should achieve a low predictive error on unseen datasets.
- Difficulties in approximating f(x)
    - **overfitting**: when model fits the noise in the training set.
        - predictive power on unseen datasets is low
    - **underfitting**: when model is not flexible enough to approximate f(x)
        - training set error is roughly equal to test set error
        - "like teaching calculus to a 3-year-old"
- Generalization error
    - does model generalize well on unseen data?
    - = bias^2 + variance + irreducible error
        - **bias**: error term that tells you, on average, how much model is different from f(x)
            - high bias models lead to underfitting
        - **variance**: tells you how much model is inconsistent over different training sets.
            - high variance models lead to overfitting
    - model complexity (e.g., maximum tree depth) sets the flexibility of model
        - best model complexity = lowest generalization error
        - not enough depth = underfitting
        - too much depth = overfitting
    - bias-variance tradeoff
        - as one increases, the other decreases
        - irreducible error is constant
        - where is the sweet spot?
        - analogous to validity-reliability
___

**Diagnose bias and variance problems**
___
- Estimating the generalization error
    - cannot be done directly because:
        - f is unknown
        - usually have only one dataset
        - noise is unpredictable
    - Solution:
        - split data into training and test set
        - fit model to training set
        - evaluate error of model on test set
        - generalization error of model is roughly equivalent to test set error of model
        - evaluating model on training set provides a biased estimate as model has already seen all of these data points
        - Solution: cross-validation
- K-Fold Cross-Validation (CV)
    - training set is randomly separated into K number of partitions
    - error is calculated for each fold/partition separately
        - one fold is evaluated on test set while other folds are trained
        - this is done k times.
    - CVerror = mean(k errors)
- if model suffers from **high variance**
    - CV error of model > training set error of model
    - model is overfitting the training set
        - decrease model complexity (e.g., decrease max depth, increase min samples per leaf)
        - gather more data
- if model suffers from **high bias**
    - CV error of model is roughly equivalent to training set error of model which is > desired error
    - model is underfitting the training set
        - increase model complexity (e.g., increase max depth, decrease min samples per leaf)
        - gather more relevant features
___

In [None]:
#Instantiate the model

#In the following set of exercises, you'll diagnose the bias and variance
#problems of a regression tree. The regression tree you'll define in this
#exercise will be used to predict the mpg consumption of cars from the auto
#dataset using all available features.

#We have already processed the data and loaded the features matrix X and
#the array y in your workspace. In addition, the DecisionTreeRegressor
#class was imported from sklearn.tree.

# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26, random_state=SEED)

In [None]:
#Evaluate the 10-fold CV error

#In this exercise, you'll evaluate the 10-fold CV Root Mean Squared Error
#(RMSE) achieved by the regression tree dt that you instantiated in the
#previous exercise.

#In addition to dt, the training data including X_train and y_train are
#available in your workspace. We also imported cross_val_score from
#sklearn.model_selection.

#Note that since cross_val_score has only the option of evaluating the
#negative MSEs, its output should be multiplied by negative one to obtain
#the MSEs. The CV RMSE can then be obtained by computing the square root
#of the average MSE.

#A very good practice is to keep the test set untouched until you are
#confident about your model's performance. CV is a great technique to
#get an estimate of a model's performance without affecting the test set.

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10,
                                 scoring='neg_mean_squared_error',
                                 n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

#################################################
#<script.py> output:
#    CV RMSE: 5.14

In [None]:
#Evaluate the training error

#You'll now evaluate the training set RMSE achieved by the regression
#tree dt that you instantiated in a previous exercise.

#In addition to dt, X_train and y_train are available in your workspace.

#Note that in scikit-learn, the MSE of a model can be computed as follows:

#MSE_model = mean_squared_error(y_true, y_predicted)
#where we use the function mean_squared_error from the metrics module
#and pass it the true labels y_true as a first argument, and the predicted
#labels from the model y_predicted as a second argument.

#Notice how the training error is roughly equal to the 10-folds CV error
#you obtained in the previous exercise, i.e. underfitting

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

#################################################
#<script.py> output:
#    Train RMSE: 5.15

**Ensemble Learning**
___
Classification and Regression Trees
- Advantages of CARTs
    - ability to describe non-linear dependencies
    - no need to standardize or normalize features
- Limitations of CARTs
    - can only produce orthogonal decision boundaries in classification
    - sensitive to small variations in the training set
    - unconstrained CARTs may have high variance and overfit the training set
        - **solution**: ensemble learning
- Ensemble learning
    - train different models on same dataset
    - each model makes predictions
    - a meta-model aggregates predictions of individual models
    - final prediction is more robust and less prone to errors
    - Voting Classifier
        - i.e. best 2 out of 3 or majority rules.
___

In [None]:
#Define the ensemble

#In the following set of exercises, you'll work with the Indian Liver
#Patient Dataset from the UCI Machine learning repository.

#In this exercise, you'll instantiate three classifiers to predict
#whether a patient suffers from a liver disease using all the features
#present in the dataset.

#The classes LogisticRegression, DecisionTreeClassifier, and
#KNeighborsClassifier under the alias KNN are available in your workspace.

# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbors', knn), ('Classification Tree', dt)]

In [None]:
#Evaluate individual classifiers

#In this exercise you'll evaluate the performance of the models in the
#list classifiers that we defined in the previous exercise. You'll do
#so by fitting each classifier on the training set and evaluating its
#test set accuracy.

#The dataset is already loaded and preprocessed for you (numerical
#features are standardized) and it is split into 70% train and 30% test.
#The features matrices X_train and X_test, as well as the arrays of
#labels y_train and y_test are available in your workspace. In addition,
#we have loaded the list classifiers from the previous exercise, as well
#as the function accuracy_score() from sklearn.metrics.

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:

    # Fit clf to the training set
   clf.fit(X_train, y_train)

    # Predict y_pred
   y_pred = clf.predict(X_test)

    # Calculate accuracy
   accuracy = accuracy_score(y_pred, y_test)

    # Evaluate clf's accuracy on the test set
   print('{:s} : {:.3f}'.format(clf_name, accuracy))

#################################################
#<script.py> output:
#    Logistic Regression : 0.747
#    K Nearest Neighbours : 0.724
#    Classification Tree : 0.730

In [None]:
#Better performance with a Voting Classifier

#Finally, you'll evaluate the performance of a voting classifier that
#takes the outputs of the models defined in the list classifiers and
#assigns labels by majority voting.

#X_train, X_test,y_train, y_test, the list classifiers defined in a
#previous exercise, as well as the function accuracy_score from
#sklearn.metrics are available in your workspace.

# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)

# Fit vc to the training set
vc.fit(X_train, y_train)

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_pred, y_test)
print('Voting Classifier: {:.3f}'.format(accuracy))

#################################################
#<script.py> output:
#   Voting Classifier: 0.753

**Bagging**
___
Bootstrap aggregation
- **Voting Classifier**
    - same training set
    - != algorithms
- **Bagging**
    - != subsets of the training set
    - one algorithm
    - reduces variance of individual models in the ensemble
    - bootstrapping refers to the activity of sampling with replacement
    - **Classification**
        - aggregates predictions by majority voting
        - BaggingClassifier in sklearn
    - *Regression**
        - aggregates predictions through averaging
        - BaggingRegressor in sklearn
___

In [1]:
#Define the bagging classifier

#In the following exercises you'll work with the Indian Liver Patient
#dataset from the UCI machine learning repository. Your task is to predict
#whether a patient suffers from a liver disease using 10 features including
#Albumin, age and gender. You'll do so using a Bagging Classifier.

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)

In [None]:
#Evaluate Bagging performance

#Now that you instantiated the bagging classifier, it's time to train
#it and evaluate its test set accuracy.

#The Indian Liver Patient dataset is processed for you and split into
#80% train and 20% test. The feature matrices X_train and X_test, as
#well as the arrays of labels y_train and y_test are available in your
#workspace. In addition, we have also loaded the bagging classifier bc
#that you instantiated in the previous exercise and the function
#accuracy_score() from sklearn.metrics.

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_pred, y_test)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))

#################################################
#<script.py> output:
#    Test set accuracy of bc: 0.71
#################################################
#A single tree dt would have achieved an accuracy of 63% which is
#8% lower than bc's accuracy!

**Out of Bag Evaluation**
___
- instance combinations that are not sampled using bootstrap aggregation during training
- OOB score is average of all OOB instances
- can be used as a tool for evaluating model performance without using cross-validation
___

In [3]:
#Prepare the ground

#In the following exercises, you'll compare the OOB accuracy to the test
#set accuracy of a bagging classifier trained on the Indian Liver Patient
#dataset.

#In sklearn, you can evaluate the OOB accuracy of an ensemble classifier
#by setting the parameter oob_score to True during instantiation. After
#training the classifier, the OOB accuracy can be obtained by accessing
#the .oob_score_ attribute from the corresponding instance.

#In your environment, we have made available the class DecisionTreeClassifier
#from sklearn.tree.

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt,
            n_estimators=50,
            oob_score=True,
            random_state=1)

In [None]:
#OOB Score vs Test Set Score

#Now that you instantiated bc, you will fit it to the training set and
#evaluate its test set and OOB accuracies.

#The dataset is processed for you and split into 80% train and 20% test.
#The feature matrices X_train and X_test, as well as the arrays of labels
#y_train and y_test are available in your workspace. In addition, we have
#also loaded the classifier bc instantiated in the previous exercise and
#the function accuracy_score() from sklearn.metrics.

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
acc_test = accuracy_score(y_pred, y_test)

# Evaluate OOB accuracy
acc_oob = bc.oob_score_

# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))

#################################################
#<script.py> output:
#    Test set accuracy: 0.698, OOB accuracy: 0.704

**Random Forests (RF)**
___
- ensemble method that uses a decision tree as its base estimator
- each estimator is trained on a different bootstrap sample having the same size as the training set
- RF introduces further randomization in the training of individual trees
    - *d* features are sampled at each split/node without replacement
    - *d* < total number of features
    - each tree is trained on a different bootstrap sample from the training set
- achieves lower variance than individual trees
- **Classification**
    - Aggregates predictions by majority voting
    - RandomForestClassifier in sklearn
- **Regression**
    - Aggregates predictions through averaging
    - RandomForest Regressor in sklearn
- Feature importance
    - Tree-based methods enable measuring of importance of each feature in prediction
    - in sklearn, this is how much the tree nodes use a particular feature (weighted average) to reduce impurity
    - accessed using attribute feature_importance_
___

In [5]:
#Train an RF regressor
#In the following exercises you'll predict bike rental demand in the Capital
#Bikeshare program in Washington, D.C using historical weather data from the
#Bike Sharing Demand dataset available through Kaggle. For this purpose, you
#will be using the random forests algorithm. As a first step, you'll define
#a random forests regressor and fit it to the training set.

#The dataset is processed for you and split into 80% train and 20% test.
#The features matrix X_train and the array y_train are available in your
#workspace.

# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
            random_state=2)

# Fit rf to the training set
rf.fit(X_train, y_train)

In [None]:
#Evaluate the RF regressor

#You'll now evaluate the test set RMSE of the random forests regressor
#rf that you trained in the previous exercise.

#The dataset is processed for you and split into 80% train and 20% test.
#The features matrix X_test, as well as the array y_test are available in
#your workspace. In addition, we have also loaded the model rf that you
#trained in the previous exercise.

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Predict the test set labels
y_pred = rf.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))

#################################################
#<script.py> output:
#    Test set RMSE of rf: 51.97
#################################################
#You can try training a single CART on the same dataset. The test set
#RMSE achieved by rf is significantly smaller than that achieved by a
#single CART!

In [None]:
#Visualizing features importances

#In this exercise, you'll determine which features were the most predictive
#according to the random forests regressor rf that you trained in a previous
#exercise.

#For this purpose, you'll draw a horizontal barplot of the feature importance
#as assessed by rf. Fortunately, this can be done easily thanks to plotting
#capabilities of pandas.

#We have created a pandas.Series object called importances containing the
#feature names as index and their importances as values. In addition,
#matplotlib.pyplot is available as plt and pandas as pd.

# Create a pd.Series of features importances
importances = pd.Series(data=rf.feature_importances_,
                      index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
#importances_sorted.plot(kind='barh', color='lightgreen')
plt.title('Features Importances')
plt.show()

#################################################
#In [1]: X_train.columns
#Out[1]:
#Index(['hr', 'holiday', 'workingday', 'temp', 'hum', 'windspeed', 'instant',
#       'mnth', 'yr', 'Clear to partly cloudy', 'Light Precipitation', 'Misty'],
#      dtype='object')

#In [2]: rf.feature_importances_
#Out[2]:
#array([  7.67785168e-01,   3.82731944e-03,   1.45434673e-01,
#         1.88155187e-02,   2.59739711e-02,   7.00403334e-03,
#         2.19497935e-02,   5.69578113e-04,   0.00000000e+00,
#         2.23504749e-03,   4.95581152e-03,   1.44908498e-03])
#################################################

![_images/10.2.svg](_images/10.2.svg)

Apparently, hr and workingday are the most important features according to rf. The importances of these two features add up to more than 90%!

**Adaboost**
___
- **Boosting**
    - ensemble method combinging several weak learners to form a strong learner
    - **weak learner** - model doing slightly better than random guessing
    - example: decision stump (CART with maximum depth of 1)
    - ensemble of predictors are trained sequentially where each predictor tries to correct its predecessor
    - Most popular methods
        - AdaBoost
        - Gradient Boosting
- **Adaboost**
    - stands for adaptive boosting
    - each predictor pays more attention to the instances wrongly predicted by its predecessor
    - achieved by changing the weights of training instances
    - each predictor is assigned a coefficient alpha
        - alpha depends on the predictor's training error
    - learning rate eta
        - between 0 and 1
        - used to shrink coefficient alpha of a trained predictor
    - Classification
        - Weighted majority voting
        - in sklearn: AdaBoostClassifier
    - Regression
        - Weighted average
        - in sklearn: AdaBoostRegressor

In [6]:
#Define the AdaBoost classifier

#In the following exercises you'll revisit the Indian Liver Patient
#dataset which was introduced in a previous chapter. Your task is to
#predict whether a patient suffers from a liver disease using 10 features
#including Albumin, age and gender. However, this time, you'll be training
#an AdaBoost ensemble to perform the classification task. In addition,
#given that this dataset is imbalanced, you'll be using the ROC AUC score
#as a metric instead of accuracy.

#As a first step, you'll start by instantiating an AdaBoost classifier.

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)


In [None]:
#Train the AdaBoost classifier

#Now that you've instantiated the AdaBoost classifier ada, it's time to
#train it. You will also predict the probabilities of obtaining the
#positive class in the test set. This can be done as follows:

#Once the classifier ada is trained, call the .predict_proba() method
#by passing X_test as a parameter and extract these probabilities by
#slicing all the values in the second column as follows:

#ada.predict_proba(X_test)[:,1]
#The Indian Liver dataset is processed for you and split into 80% train
#and 20% test. Feature matrices X_train and X_test, as well as the arrays
#of labels y_train and y_test are available in your workspace. In addition,
#we have also loaded the instantiated model ada from the previous exercise.

# Fit ada to the training set
ada.fit(X_train, y_train)

# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]

In [None]:
#Evaluate the AdaBoost classifier

#Now that you're done training ada and predicting the probabilities of
#obtaining the positive class in the test set, it's time to evaluate
#ada's ROC AUC score. Recall that the ROC AUC score of a binary classifier
#can be determined using the roc_auc_score() function from sklearn.metrics.

#The arrays y_test and y_pred_proba that you computed in the previous exercise
#are available in your workspace.

# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))

#################################################
#<script.py> output:
#    ROC AUC score: 0.71

**Gradient Boosting**
___
- sequential correction of predecessor's errors
- does not tweak weights of training instances
- each predictor is trained using its predecessor's residual errors as labels
- Gradient Boosted Trees: a CART is used as a base learner
- **shrinkage**
    - prediction of each tree in the ensemble is shrunk after it is multiplied by the learning rate eta
- **prediction**
    - Regression
        - ypred = y1 + eta\*res1 +...+ eta\*resn
        - sklearn: GradientBoostingRegressor
    - Classification
        - sklearn: GradientBoostingClassifier
___

In [7]:
#Define the GB regressor

#You'll now revisit the Bike Sharing Demand dataset that was introduced
#in the previous chapter. Recall that your task is to predict the bike
#rental demand using historical weather data from the Capital Bikeshare
#program in Washington, D.C.. For this purpose, you'll be using a gradient
#boosting regressor.

#As a first step, you'll start by instantiating a gradient boosting regressor
#which you will train in the next exercise.

# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4,
            n_estimators=200,
            random_state=2)

In [None]:
#Train the GB regressor

#You'll now train the gradient boosting regressor gb that you instantiated
#in the previous exercise and predict test set labels.

#The dataset is split into 80% train and 20% test. Feature matrices X_train
#and X_test, as well as the arrays y_train and y_test are available in your
#workspace. In addition, we have also loaded the model instance gb that you
#defined in the previous exercise.

# Fit gb to the training set
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)

In [None]:
#Evaluate the GB regressor

#Now that the test set predictions are available, you can use them to
#evaluate the test set Root Mean Squared Error (RMSE) of gb.

#y_test and predictions y_pred are available in your workspace.

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute MSE
mse_test = MSE(y_test, y_pred)

# Compute RMSE
rmse_test = mse_test**(1/2)

# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))

#################################################
#<script.py> output:
#    Test set RMSE of gb: 52.065
#################################################

**Stochastic Gradient Boosting (SGB)**
___
- each CART is trained to find the best split points and features
    - may lead to CARTs using same split points or features
    - SGB mitigates these above effects
- each tree is trained on a random subset of rows of the training data
- sampled instances (40-80% of training set) are sampled without replacement
- features are sampled without replacement to determine optimal split points
- **Result**: further diversity of ensemble of trees
- **Effect**: further variance to ensemble of trees
- residials are multipled by learning rate Eta for each tree in ensemble
---

In [1]:
#Regression with SGB

#As in the exercises from the previous lesson, you'll be working with
#the Bike Sharing Demand dataset. In the following set of exercises,
#you'll solve this bike count regression problem using stochastic
#gradient boosting.

# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate sgbr
sgbr = GradientBoostingRegressor(max_depth=4,
                                subsample=0.9,
                                max_features=0.75,
                                n_estimators=200,
                                random_state=2)

#Train the SGB regressor

#In this exercise, you'll train the SGBR sgbr instantiated in the
#previous exercise and predict the test set labels. The bike sharing
#demand dataset is already loaded processed for you; it is split into
#80% train and 20% test. The feature matrices X_train and X_test, the
#arrays of labels y_train and y_test, and the model instance sgbr that
#you defined in the previous exercise are available in your workspace.

# Fit sgbr to the training set
sgbr.fit(X_train, y_train)

# Predict test set labels
y_pred = sgbr.predict(X_test)

#Evaluate the SGB regressor

#You have prepared the ground to determine the test set RMSE of sgbr
#which you shall evaluate in this exercise.

#y_pred and y_test are available in your workspace.

# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute test set MSE
mse_test = MSE(y_test, y_pred)

# Compute test set RMSE
rmse_test = mse_test**(1/2)

# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))

#################################################
#<script.py> output:
#    Test set RMSE of sgbr: 49.979
#################################################

**Tuning a CART's Hyperparameters**
___
- **Optimal Model**: yields optimal R squared score
- Cross-validation is used to estimate the generalization performance
- Why tune hyperparameters?
    - default hyperparameters are not optimal for all solutions
- Approaches include:
    - **Grid Search**
    - Random Search
    - Bayesian Optimization
    - Genetic Algorithms
    - ...
- **Grid Search**
    - manually set a grid of discrete hyperparameter values
    - set a metric for scoring model performance
    - search exhaustively through the grid
    - for each set of hyperparameters, evaluate each model's CV score.
    - optimal hyperparameters achieve best CV score
    - *Note*: suffers from recursive dimensionality
        - the bigger the grid, the longer it takes to find a solution
---

In [4]:
#Tree hyperparameters

#In the following exercises you'll revisit the Indian Liver Patient
#dataset which was introduced in a previous chapter.

#Your task is to tune the hyperparameters of a classification tree.
#Given that this dataset is imbalanced, you'll be using the ROC AUC
#score as a metric instead of accuracy.

#We have instantiated a DecisionTreeClassifier and assigned to dt
#with sklearn's default hyperparameters. You can inspect the
#hyperparameters of dt in your console.

#################################################
#In [2]: print(dt.get_params())
#{'class_weight': None, 'criterion': 'gini', 'max_depth': None,
# 'max_features': None, 'max_leaf_nodes': None,
# 'min_impurity_decrease': 0.0, 'min_impurity_split': None,
# 'min_samples_leaf': 1, 'min_samples_split': 2,
# 'min_weight_fraction_leaf': 0.0, 'presort': False,
# 'random_state': 1, 'splitter': 'best'}
#################################################

#Set the tree's hyperparameter grid

#In this exercise, you'll manually set the grid of hyperparameters
#that will be used to tune the classification tree dt and find the
#optimal classifier in the next exercise.

#Define params_dt
params_dt = {'max_depth' : [2, 3, 4],
             'min_samples_leaf' : [0.12, 0.14, 0.16, 0.18]}

#Search for the optimal tree

#In this exercise, you'll perform grid search using 5-fold cross
#validation to find dt's optimal hyperparameters. Note that because
#grid search is an exhaustive process, it may take a lot time to
#train the model. Here you'll only be instantiating the GridSearchCV
#object without fitting it to the training set. As discussed in the
#video, you can train such an object similar to any scikit-learn
#estimator by using the .fit() method:

#grid_object.fit(X_train, y_train)

#An untuned classification tree dt as well as the dictionary
#params_dt that you defined in the previous exercise are available
#in your workspace.

# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                      param_grid=params_dt,
                      scoring='roc_auc',
                      cv=5,
                      n_jobs=-1)

#Evaluate the optimal tree

#In this exercise, you'll evaluate the test set ROC AUC score of
#grid_dt's optimal model.

#In order to do so, you will first determine the probability of
#obtaining the positive label for each test set observation. You can
#use the methodpredict_proba() of an sklearn classifier to compute a
#2D array containing the probabilities of the negative and positive
#class-labels respectively along columns.

#The dataset is already loaded and processed for you (numerical
#features are standardized); it is split into 80% train and 20% test.
#X_test, y_test are available in your workspace. In addition, we have
#also loaded the trained GridSearchCV object grid_dt that you
#instantiated in the previous exercise. Note that grid_dt was trained
#as follows:

#grid_dt.fit(X_train, y_train)

# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

#################################################
#<script.py> output:
#    Test set ROC AUC score: 0.610
#################################################

**Tuning a RF's Hyperparameters**
___
- Hyperparameters of Random Forests Include:
    - CART hyperparameters
    - number of estimators
    - bootstrap y/n
- Hyperparameter tuning is:
    - computationally expensive
    - sometimes leads to very slight improvement
___

In [None]:
#Random forests hyperparameters

#In the following exercises, you'll be revisiting the Bike Sharing
#Demand dataset that was introduced in a previous chapter. Recall
#that your task is to predict the bike rental demand using historical
#weather data from the Capital Bikeshare program in Washington, D.C.
#For this purpose, you'll be tuning the hyperparameters of a Random
#Forests regressor.

#We have instantiated a RandomForestRegressor called rf using sklearn's
#default hyperparameters. You can inspect the hyperparameters of rf in
#your console.

#################################################
#In [1]: print(rf.get_params())
#{'bootstrap': True, 'criterion': 'mse', 'max_depth': None,
#'max_features': 'auto', 'max_leaf_nodes': None,
#'min_impurity_decrease': 0.0, 'min_impurity_split': None,
#'min_samples_leaf': 1, 'min_samples_split': 2,
#'min_weight_fraction_leaf': 0.0, 'n_estimators': 10,
#'n_jobs': -1, 'oob_score': False, 'random_state': 2,
#'verbose': 0, 'warm_start': False}
#################################################

#Set the hyperparameter grid of RF

#In this exercise, you'll manually set the grid of hyperparameters
#that will be used to tune rf's hyperparameters and find the optimal
#regressor. For this purpose, you will be constructing a grid of
#hyperparameters and tune the number of estimators, the maximum
#number of features used when splitting each node and the minimum
#number of samples (or fraction) per leaf.

# Define the dictionary 'params_rf'
params_rf = {'n_estimators':[100,350,500],
             'max_features':['log2','auto','sqrt'],
             'min_samples_leaf':[2,10,30]}

#Search for the optimal forest

#In this exercise, you'll perform grid search using 3-fold cross
#validation to find rf's optimal hyperparameters. To evaluate each
#model in the grid, you'll be using the negative mean squared error
#metric.

#Note that because grid search is an exhaustive search process, it
#may take a lot time to train the model. Here you'll only be
#instantiating the GridSearchCV object without fitting it to the
#training set. As discussed in the video, you can train such an
#object similar to any scikit-learn estimator by using the .fit()
#method:

#grid_object.fit(X_train, y_train)

#The untuned random forests regressor model rf as well as the
#dictionary params_rf that you defined in the previous exercise are
#available in your workspace.

# Import GridSearchCV
from sklearn.model_selection import  GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                      param_grid=params_rf,
                      scoring='neg_mean_squared_error',
                      cv=3,
                      verbose=1,
                      n_jobs=-1)

#Evaluate the optimal forest

#In this last exercise of the course, you'll evaluate the test set
#RMSE of grid_rf's optimal model.

#The dataset is already loaded and processed for you and is split
#into 80% train and 20% test. In your environment are available
#X_test, y_test and the function mean_squared_error from
#sklearn.metrics under the alias MSE. In addition, we have also
#loaded the trained GridSearchCV object grid_rf that you instantiated
#in the previous exercise. Note that grid_rf was trained as follows:

#grid_rf.fit(X_train, y_train)

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))

#################################################
#<script.py> output:
#    Test RMSE of best model: 50.569
#################################################

**Summary**
___
- **Chapter 1**: Decision-Tree Learning
- **Chapter 2**: Generalization Error, Cross-Validation, Ensembling
- **Chapter 3**: Bagging and Random Forests
- **Chapter 4**: AdaBoost and Gradient Boosting
- **Chapter 5**: Model Tuning