# Decision Tree & Ensemble Learning

Classification And Regression Trees (CART for short) is a term introduced by [Leo Breiman](https://en.wikipedia.org/wiki/Leo_Breiman) to refer to Decision Tree algorithms that can be used for classification or regression predictive modeling problems.

In this lab assignment, you will implement various ways to calculate impurity which is used to split data in constructing the decision trees and apply the Decision Tree and ensemble learning algorithms to solve two real-world problems: a classification one and a regression one. 

In [0]:
#import packages
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# make this notebook's output stable across runs
np.random.seed(0)

In [0]:
# helper functions used in this lab
def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5, contour=True):
    """
    Plot the decision boundary of a learnt classifier
    """
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=1)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

## Gini impurity and Entropy


#### Gini impurity

The CART algorithm recursively splits the training set into two subsets using a single feature k and a threshold $t_k$. The best feature and threshold are chosen to produce the purest subsets weighted by their size. **Gini impurity** measures the impurity of the data points in a set and is used to evaluate how good a split is when the CART algorithm searches for the best pair of feature and the threshold.

To compute Gini impurity for a set of items with J classes, suppose $i \in \{1, 2, \dots, J\}$ and let $p_i$ be the fraction of items labeled with class i in the set.
\begin{align}
I(p) = 1 - \sum_{i=1}^J p_i^2
\end{align}

The following function calculates the gini impurity for a given set of data points.

In [0]:
def gini_impurity(x):
    """
    This function calculate the Gini impurity for a given set of data points.

    Args:
    x: a numpy ndarray
    """
    unique, counts = np.unique(x, return_counts=True)
    probabilities = counts / sum(counts)
    gini = 1 - sum([p*p for p in probabilities])

    return gini

In [0]:
np.testing.assert_equal(0, gini_impurity(np.array([1, 1, 1])))
np.testing.assert_equal(0.5, gini_impurity(np.array([1, 0, 1, 0])))
np.testing.assert_equal(3/4, gini_impurity(np.array(['a', 'b', 'c', 'd'])))
np.testing.assert_almost_equal(2.0/3, gini_impurity(np.array([1, 2, 3, 1, 2, 3])))

#### Entropy

Another popular measure of impurity is called **entropy**, which measures the average information content of a message. Entropy is zero when all messages are identical. When it applied to CART, a set's entropy is zero when it contains instances of only one class. Entropy is calculated as follows:
\begin{align}
I(p) = - \sum_{i=1}^J p_i log_2{p_i}
\end{align}

<span style="color:orange">**Question 1: In this exercise, you will implement the entropy function.**

In [0]:
def entropy(x):
    probs = [np.mean(x == c) for c in set(x)]
    return np.sum(-p * np.log2(p) for p in probs)

In [6]:
np.testing.assert_equal(0, entropy(np.array([1, 1, 1])))
np.testing.assert_equal(1.0, entropy(np.array([1, 0, 1, 0])))
np.testing.assert_equal(2.0, entropy(np.array(['a', 'b', 'c', 'd'])))
np.testing.assert_almost_equal(1.58496, entropy(np.array([1, 2, 3, 1, 2, 3])), 4)

  This is separate from the ipykernel package so we can avoid doing imports until


---

## Iris dataset

The Iris data set contains the morphologic variation of Iris flowers of three related species (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each observation (see image below):
- Sepal.Length: sepal length in centimeters.
- Sepal.Width: sepal width in centimeters.
- Petal.Length: petal length in centimeters.
- Petal.Width: petal width in centimeters.

<table>
  <tr>
    <td><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg/180px-Kosaciec_szczecinkowaty_Iris_setosa.jpg" style="width:250px"></td>
    <td><img src="https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg" width="250px"></td>
    <td><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Iris_virginica.jpg/295px-Iris_virginica.jpg" width="250px"></td>
  </tr>
  <tr>
    <td>Iris setosa</td>
    <td>Iris versicolor</td>
    <td>Iris virginica</td>
  </tr>
</table>


In [7]:
# load the iris train and test data from CSV files
train = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x = train.iloc[:,0:4]
train_y = train.iloc[:,4]

test_x = test.iloc[:,0:4]
test_y = test.iloc[:,4]

# print the number of instances in each class
print(train_y.value_counts().sort_index())
print(test_y.value_counts().sort_index())

print("Number of train data: {}".format(len(train_y)))
print("Number of test data: {}".format(len(test_y)))

Iris-setosa        34
Iris-versicolor    32
Iris-virginica     39
Name: species, dtype: int64
Iris-setosa        16
Iris-versicolor    18
Iris-virginica     11
Name: species, dtype: int64
Number of train data: 105
Number of test data: 45


### Decision Tree Classifier

<span style="color:orange">**In this exercise, we will apply the Decision Tree classifier to classify the Iris flower data.**

#### Train and visualize a simple Decision Tree

<span style="color:orange">**Question 2: create a decision tree with max_depth of 2.**

In [0]:
# load the iris train and test data from CSV files
train = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x = train.iloc[:,0:4]
train_y = train.iloc[:,4]

test_x = test.iloc[:,0:4]
test_y = test.iloc[:,4]

In [9]:
# TODO: read the scikit-learn doc on DecisionTreeClassifier and train a Decision Tree with max depth of 2

parameters = {
    "max_depth": [2], 
    #"min_samples_split": [0.05, 0.1, 0.2]
}

dtc = DecisionTreeClassifier()
dtc_grid = GridSearchCV(dtc, parameters, cv=3)
dtc_grid.fit(train_x, train_y)

# summarize the results of the grid search
print("The best score is {}".format(dtc_grid.best_score_))
print("The best hyper parameter setting is {}".format(dtc_grid.best_params_))

# model initialization
dt_model = DecisionTreeClassifier(max_depth=2)

# train the model
dt_model.fit(train_x, train_y)

The best score is 0.9428571428571427
The best hyper parameter setting is {'max_depth': 2}


DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Now let's visualize the decision tree we just trained on the iris dataset and see how it makes predictions. Note that if the following code does not work for you because the graphviz is missing, do not worry about it and you should still be able to move on.

In [10]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

#dot_data = StringIO()
#feature_names = train_x.columns
#class_names = train_y.unique()
#class_names.sort()
#export_graphviz(dtc, out_file=dot_data, feature_names=feature_names, class_names=class_names, filled=True, rounded=True)
#graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
#Image(graph.create_png())



Decision trees are easy to inteprete and is often referred to as *whitebox* machine learning algorithm. Let's see how this decision tree represented above makes predictions. Suppose you find an iris flower and want to classify it into setosa, versicolor or virginica. You start at the root node (the very top node in the tree). In this node, we check if the flower's patel length is smaller than or equal to 2.35 cm. If it is, we move to the left child and predict setosa to be its class. Otherwise, we move to the right child node. Then similarly we check if the petal length is smaller than or equal to 4.95 cm. If it is, we move to its left child node and predict versicolor to be its class. Otherwise, we move to its right child and predict virginica to be its class. 

#### Prediction with Decision tree

With this simple decision tree above, we can apply it to make predictions on the test dataset and evaluate its performance.

<span style="color:orange">**Question 3: make prediction using the trained decision tree model on the test data.**

In [11]:
# TODO: use the trained decision tree model to make predictions on the test data and evaluate the model performance.

train_z = dt_model.predict(train_x)
train_z_prob = dt_model.predict_proba(train_x)[:,1]

test_z = dt_model.predict(test_x)
test_z_prob = dt_model.predict_proba(test_x)[:,1]

print("model accuracy on train set: {}".format(accuracy_score(train_y, train_z)))
print("model confusion matrix (Train Set):\n {}".format(confusion_matrix(train_y, train_z, labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])))

print("model accuracy on test set: {}".format(accuracy_score(test_y, test_z)))
print("model confusion matrix (Test Set):\n {}".format(confusion_matrix(test_y, test_z, labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])))

model accuracy on train set: 0.9619047619047619
model confusion matrix (Train Set):
 [[34  0  0]
 [ 0 31  1]
 [ 0  3 36]]
model accuracy on test set: 0.9111111111111111
model confusion matrix (Test Set):
 [[16  0  0]
 [ 0 17  1]
 [ 0  3  8]]


#### Hyper-parameters

Hyper-parameter controls the complexity of the decision tree model. For example, the deeper the tree is, the more complex patterns the model will be able to capture. In this exercise, we train the decision trees with increasing number of maximum depth and plot its performance. We should see the accuracy of the training data increase as the tree grows deeper, but the accuracy on the test data might not as the model will eventually start to overfit and does not generalize well on the unseen test data.

<span style="color:orange">**Question 4: for each value of max_depth, we train a decision tree model and evaluate its accuracy on both train and test data, and plot both accuracies in the figure.**

In [0]:
# load the iris train and test data from CSV files
train_dtc = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test_dtc = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x_dtc = train.iloc[:,0:4]
train_y_dtc = train.iloc[:,4]

test_x_dtc = test.iloc[:,0:4]
test_y_dtc = test.iloc[:,4]

In [13]:
# TODO: train the decision tree model with various max_depth, make predictions and evaluate on both train and test data.

parameters = {
    "max_depth": [2, 3, 4, 5, 6] 
    # "min_samples_split": [0.05, 0.1, 0.2]
}

dtc = DecisionTreeClassifier()
dtc_grid = GridSearchCV(dtc, parameters, cv=3)
dtc_grid.fit(train_x_dtc, train_y_dtc)

# summarize the results of the grid search
print("The best score is {}".format(dtc_grid.best_score_))
print("The best hyper parameter setting is {}".format(dtc_grid.best_params_))

The best score is 0.9619047619047619
The best hyper parameter setting is {'max_depth': 3}


In [0]:
# make prediction and evaluate the model performance on test data
test_z_dtc = dtc_grid.predict(test_x_dtc)
test_z_prob_dtc = dtc_grid.predict_proba(test_x_dtc)[:,1]

train_z_dtc = dtc_grid.predict(train_x_dtc)
train_z_prob_dtc = dtc_grid.predict_proba(train_x_dtc)[:,1]

In [15]:
print("model accuracy on train set: {}".format(accuracy_score(train_y_dtc, train_z_dtc)))
print("model confusion matrix (Train Set):\n {}".format(confusion_matrix(train_y, train_z, labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])))

print("model accuracy on test set: {}".format(accuracy_score(test_y_dtc, test_z_dtc)))
print("model confusion matrix (Test Set):\n {}".format(confusion_matrix(test_y, test_z, labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])))

model accuracy on train set: 0.9809523809523809
model confusion matrix (Train Set):
 [[34  0  0]
 [ 0 31  1]
 [ 0  3 36]]
model accuracy on test set: 0.9777777777777777
model confusion matrix (Test Set):
 [[16  0  0]
 [ 0 17  1]
 [ 0  3  8]]


In [0]:
# plot the decision boundary for decision tree classifier
#plot_decision_boundary(dtc_grid, train_x_dtc.values, train_y_dtc.values)

#### Fine-tune the decision tree classifier

Decision tree is a very powerful model with very few assumptions about the incoming training data (unlike linear models, which assume the data linear), however, it is more likely to overfit the data and won't generalize well to unseen data. To void overfitting, we need to restrict the decision tree's freedom during training via regularization (e.g. max_depth, min_sample_split, max_leaf_nodes and etc.).

To fine-tune the model and combat overfitting, use grid search with cross-validation (with the help of the GridSearchCV class) to find the best hyper-parameter settings for the DecisionTreeClassifier. In particular, we would like to fine-tune the following hyper-parameters:
- **criteria**: this defines how we measure the quality of a split. we can choose either "gini" for the Gini impurity or "entropy" for the information gain.
- **max_depth**: the maximum depth of the tree. This indicates how deep the tree can be. The deeper the tree, the more splits it has and it captures more information about the data. But meanwhile, deeper trees are more likely to overfit the data. For this practice, we will choose from {1, 2, 3} given there are only 4 features in the iris dataset.
- **min_samples_split**: This value represents the minimum number of samples required to split an internal node. The smaller this value is, the deeper the tree will grow, thus more likely to overfit. On the other hand, if the value is really large (the size of the training data in the extreme case), the tree will be very shallow and could suffer from underfit. In this practice, we choose from {0.01, 0.05, 0.1, 0.2}.

<span style="color:orange">**Question 5: Use grid search with 3-fold cross-validation to fine-tune the decision tree model and output the best hyper-parameters.**

In [0]:
# load the iris train and test data from CSV files
train = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x = train.iloc[:,0:4]
train_y = train.iloc[:,4]

test_x = test.iloc[:,0:4]
test_y = test.iloc[:,4]

In [18]:
# TODO: fine-tune the model, use grid search with 3-fold cross-validation.
parameters = {
    "max_depth": [2, 3, 4, 5, 6], 
    "min_samples_split": [0.05, 0.1, 0.2]
    #"criteria": []
}

dtc = DecisionTreeClassifier()
dtc_grid = GridSearchCV(dtc, parameters, cv=3)
dtc_grid.fit(train_x, train_y)

# summarize the results of the grid search
print("The best score is {}".format(dtc_grid.best_score_))
print("The best hyper parameter setting is {}".format(dtc_grid.best_params_))

The best score is 0.9619047619047619
The best hyper parameter setting is {'max_depth': 3, 'min_samples_split': 0.05}


#### Prediction and Evaluation

Now we have a fine-tuned decision tree classifier based on the training data, let's apply this model to make predictions on the test data and evaluate its performance.

In [19]:
dtc_grid.predict(test_x)

print("model accuracy: {}".format(accuracy_score(test_y, test_z)))
print("model confusion matrix:\n {}".format(confusion_matrix(test_y, test_z, labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])))

model accuracy: 0.9111111111111111
model confusion matrix:
 [[16  0  0]
 [ 0 17  1]
 [ 0  3  8]]


### Random Forest

**Question 6: Apply Random Forest together with Gridsearch to the Iris dataset and evaluate its accuracy.**

In [0]:
### TODO
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

In [21]:
# load the iris train and test data from CSV files
train_rf = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test_rf = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x_rf = train_rf.iloc[:,0:4]
train_y_rf = train_rf.iloc[:,4]

test_x_rf = test_rf.iloc[:,0:4]
test_y_rf = test_rf.iloc[:,4]

# print the number of instances in each class
print(train_y_rf.value_counts().sort_index())
print(test_y_rf.value_counts().sort_index())

Iris-setosa        34
Iris-versicolor    32
Iris-virginica     39
Name: species, dtype: int64
Iris-setosa        16
Iris-versicolor    18
Iris-virginica     11
Name: species, dtype: int64


In [22]:
parameters = {
    "n_estimators": [20, 40],
    "max_depth": [2, 4], 
    "min_samples_split": [0.05, 0.1, 0.2]
}

rfc_grid = GridSearchCV(RandomForestClassifier(n_jobs=-1, random_state=0), parameters, cv=3)
rfc_grid.fit(train_x_rf, train_y_rf)

# summarize the results of the grid search
print("The best score is {}".format(rfc_grid.best_score_))
print("The best hyper parameter setting is {}".format(rfc_grid.best_params_))

The best score is 0.9619047619047619
The best hyper parameter setting is {'max_depth': 2, 'min_samples_split': 0.05, 'n_estimators': 40}


In [23]:
#make prediction and evaluate the model performance on test data
test_z_rf = rfc_grid.predict(test_x_rf)
test_z_prob_rf = rfc_grid.predict_proba(test_x_rf)[:,1]

print("model accuracy: {}".format(accuracy_score(test_y_rf, test_z_rf)))

model accuracy: 0.9555555555555556


### Adaboost

**Question 7: Apply Adaboost together with Gridsearch to the Iris dataset and evaluate its accuracy.**

In [0]:
from sklearn.ensemble import AdaBoostClassifier

In [25]:
# load the iris train and test data from CSV files
train_ada = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test_ada = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x_ada = train_ada.iloc[:,0:4]
train_y_ada = train_ada.iloc[:,4]

test_x_ada = test_ada.iloc[:,0:4]
test_y_ada = test_ada.iloc[:,4]

# print the number of instances in each class
print(train_y_ada.value_counts().sort_index())
print(test_y_ada.value_counts().sort_index())

Iris-setosa        34
Iris-versicolor    32
Iris-virginica     39
Name: species, dtype: int64
Iris-setosa        16
Iris-versicolor    18
Iris-virginica     11
Name: species, dtype: int64


In [26]:
### TODO ADA BOOST
parameters = {
    "n_estimators": [20, 40],
    "learning_rate": [0.01, 0.1, 1, 10]
}

adaboost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=4), random_state=0)
adaboost_grid = GridSearchCV(adaboost, parameters, cv=3)
adaboost_grid.fit(train_x_ada, train_y_ada)

# summarize the results of the grid search
print("The best score is {}".format(adaboost_grid.best_score_))
print("The best hyper parameter setting is {}".format(adaboost_grid.best_params_))

The best score is 0.9523809523809522
The best hyper parameter setting is {'learning_rate': 0.01, 'n_estimators': 20}


In [27]:
# TODO: make prediction and evaluate the model performance on test data
test_z_ada = adaboost_grid.predict(test_x_ada)
test_z_prob_ada = adaboost_grid.predict_proba(test_x_ada)[:,1]

print("model accuracy: {}".format(accuracy_score(test_y_ada, test_z_ada)))

model accuracy: 0.9777777777777777


### Gradient Boosting

**Question 8: Apply Boosting together with Gridsearch to the Iris dataset and evaluate its accuracy.**

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

In [29]:
# load the iris train and test data from CSV files
train_gb = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_train.csv')
test_gb = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/iris_test.csv')

train_x_gb = train_gb.iloc[:,0:4]
train_y_gb = train_gb.iloc[:,4]

test_x_gb = test_gb.iloc[:,0:4]
test_y_gb = test_gb.iloc[:,4]

# print the number of instances in each class
print(train_y_gb.value_counts().sort_index())
print(test_y_gb.value_counts().sort_index())

Iris-setosa        34
Iris-versicolor    32
Iris-virginica     39
Name: species, dtype: int64
Iris-setosa        16
Iris-versicolor    18
Iris-virginica     11
Name: species, dtype: int64


In [30]:
#fine-tune Gradient Boosted Trees using grid search with cross-validation (GridSearchCV).
parameters = {
    "loss":["deviance"],
    "learning_rate": [0.01, 0.1],
    "min_samples_split": [0.05, 0.1, 0.2],
    "max_depth":[2, 4],
    "n_estimators":[100]
}

gbc_grid = GridSearchCV(GradientBoostingClassifier(), parameters, cv=3, n_jobs=-1)
gbc_grid.fit(train_x_gb, train_y_gb)

# summarize the results of the grid search
print("The best score is {}".format(gbc_grid.best_score_))
print("The best hyper parameter setting is {}".format(gbc_grid.best_params_))

The best score is 0.9619047619047619
The best hyper parameter setting is {'learning_rate': 0.01, 'loss': 'deviance', 'max_depth': 2, 'min_samples_split': 0.1, 'n_estimators': 100}


---

In [31]:
#make prediction and evaluate the model performance on test data
test_z_gb = gbc_grid.predict(test_x_gb)
test_z_prob_gb = gbc_grid.predict_proba(test_x_gb)[:,1]

print("model accuracy: {}".format(accuracy_score(test_y_gb, test_z_gb)))

model accuracy: 0.9777777777777777


**BONUS POINT: we will apply the supervised learning models we learnt so far to predict the California housing prices.**

## California Housing Dataset

The California Housing dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). 

In [0]:
# Load train and test data from CSV files.
train_ca = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_train.csv')
test_ca = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_test.csv')

train_x_ca = train_ca.iloc[:,0:8]
train_y_ca = train_ca.iloc[:,8]

test_x_ca = test_ca.iloc[:,0:8]
test_y_ca = test_ca.iloc[:,8]

**Decision Tree**

In [0]:
# Load train and test data from CSV files.
train_dt = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_train.csv')
test_dt = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_test.csv')

train_x_dt = train_dt.iloc[:,0:8]
train_y_dt = train_dt.iloc[:,8]

test_x_dt = test_dt.iloc[:,0:8]
test_y_dt = test_dt.iloc[:,8]

In [34]:
# Build Decision Tree classifier
parameters = {
    "max_depth": [2, 4, 6 , 8], 
    "min_samples_split": [0.05, 0.1, 0.2]
}

dtc = DecisionTreeClassifier()
dtc_grid = GridSearchCV(dtc, parameters, cv=3)
dtc_grid.fit(train_x_dt, train_y_dt)

# summarize the results of the grid search
print("The best Decision Tree score is {}".format(dtc_grid.best_score_))
print("The best Decision Tree hyper parameter setting is {}".format(dtc_grid.best_params_))



The best Decision Tree score is 0.04900332225913621
The best Decision Tree hyper parameter setting is {'max_depth': 6, 'min_samples_split': 0.05}


In [35]:
# make prediction and evaluate the model performance on test data
test_z_dt = dtc_grid.predict(test_x_dt)
test_z_prob_dt = dtc_grid.predict_proba(test_x_dt)[:,1]

print("Decision Tree model accuracy: {}".format(accuracy_score(test_y_dt, test_z_dt)))

Decision Tree model accuracy: 0.04861111111111111


**Random Forest**

In [0]:
# Load train and test data from CSV files.
train_rf = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_train.csv')
test_rf = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_test.csv')

train_x_rf = train_rf.iloc[:,0:8]
train_y_rf = train_rf.iloc[:,8]

test_x_rf = test_rf.iloc[:,0:8]
test_y_rf = test_rf.iloc[:,8]

In [37]:
# TODO: fine-tune Random Forest classifier using grid search with cross-validation (GridSearchCV).
parameters = {
    "n_estimators": [20, 40],
    "max_depth": [2, 4], 
    "min_samples_split": [0.05, 0.1, 0.2]
}

rfc_grid = GridSearchCV(RandomForestClassifier(n_jobs=-1, random_state=0), parameters, cv=3)
rfc_grid.fit(train_x_rf, train_y_rf)

# summarize the results of the grid search
print("The best Random Forest score is {}".format(rfc_grid.best_score_))
print("The best Random Foresthyper parameter setting is {}".format(rfc_grid.best_params_))



The best Random Forest score is 0.04741140642303434
The best Random Foresthyper parameter setting is {'max_depth': 4, 'min_samples_split': 0.05, 'n_estimators': 20}


In [38]:
# TODO: make prediction and evaluate the model performance on test data
test_z_rf = rfc_grid.predict(test_x_rf)
test_z_prob_rf = rfc_grid.predict_proba(test_x_rf)[:,1]

print("Random Forest model accuracy: {}".format(accuracy_score(test_y_rf, test_z_rf)))

Random Forest model accuracy: 0.04699612403100775


**ADABOOST**

In [0]:
from sklearn.ensemble import AdaBoostClassifier

In [0]:
# Load train and test data from CSV files.
train_ada = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_train.csv')
test_ada = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_test.csv')

train_x_ada = train_ada.iloc[:,0:8]
train_y_ada = train_ada.iloc[:,8]

test_x_ada = test_ada.iloc[:,0:8]
test_y_ada = test_ada.iloc[:,8]

In [41]:
# TODO: fine-tune Adaboost with decision tree (max_depth=4) as base learners using grid search with cross-validation (GridSearchCV).
parameters = {
    "n_estimators": [20, 40],
    "learning_rate": [0.01, 0.1, 1, 10]
}

adaboost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=4), random_state=0)
adaboost_grid = GridSearchCV(adaboost, parameters, cv=3)
adaboost_grid.fit(train_x_ada, train_y_ada)

# summarize the results of the grid search
print("The best AdaBoost score is {}".format(adaboost_grid.best_score_))
print("The best AdaBoost hyper parameter setting is {}".format(adaboost_grid.best_params_))



The best AdaBoost score is 0.04934939091915836
The best AdaBoost hyper parameter setting is {'learning_rate': 0.01, 'n_estimators': 40}


In [42]:
# TODO: make prediction and evaluate the model performance on test data
test_z_ada = adaboost_grid.predict(test_x_ada)
test_z_prob_ada = adaboost_grid.predict_proba(test_x_ada)[:,1]

print("model accuracy: {}".format(accuracy_score(test_y_ada, test_z_ada)))

model accuracy: 0.04941860465116279


**Gradient Boost**

In [0]:
# Load train and test data from CSV files.
train_gb = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_train.csv')
test_gb = pd.read_csv('https://raw.githubusercontent.com/zariable/data/master/housing_test.csv')

train_x_gb = train_gb.iloc[:,0:8]
train_y_gb = train_gb.iloc[:,8]

test_x_gb = test_gb.iloc[:,0:8]
test_y_gb = test_gb.iloc[:,8]

In [0]:
# TODO: fine-tune Gradient Boosted Trees using grid search with cross-validation (GridSearchCV).
parameters = {
    "loss":["deviance"],
    "learning_rate": [0.01, 0.1],
    "min_samples_split": [0.05, 0.1, 0.2],
    "max_depth":[2, 4],
    "n_estimators":[100]
}

gbc_grid = GridSearchCV(GradientBoostingClassifier(), parameters, cv=3, n_jobs=-1)
gbc_grid.fit(train_x_gb, train_y_gb)

# summarize the results of the grid search
print("The best Gradient Boost score is {}".format(gbc_grid.best_score_))
print("The best Gradient Boost hyper parameter setting is {}".format(gbc_grid.best_params_))



In [0]:
# TODO: make prediction and evaluate the model performance on test data
test_z_gb = gbc_grid.predict(test_x_gb)
test_z_prob_gb = gbc_grid.predict_proba(test_x_gb)[:,1]

print("Gradient Boost model accuracy: {}".format(accuracy_score(test_y_gb, test_z_gb)))

# Results Summary

In [0]:
# Decision Tree Results
print("The best Decision Tree score is {}".format(dtc_grid.best_score_))
print("The best Decision Tree hyper parameter setting is {}".format(dtc_grid.best_params_))
print("Decision Tree model accuracy: {}".format(accuracy_score(test_y_dt, test_z_dt)))

In [0]:
# Random Forest Results
print("The best Random Forest score is {}".format(rfc_grid.best_score_))
print("The best Random Foresthyper parameter setting is {}".format(rfc_grid.best_params_))
print("Random Forest model accuracy: {}".format(accuracy_score(test_y_rf, test_z_rf)))

In [0]:
# AdaBoost Results
print("The best AdaBoost score is {}".format(adaboost_grid.best_score_))
print("The best AdaBoost hyper parameter setting is {}".format(adaboost_grid.best_params_))
print("model accuracy: {}".format(accuracy_score(test_y_ada, test_z_ada)))

In [0]:
# Gradient Boost Results
print("The best Gradient Boost score is {}".format(gbc_grid.best_score_))
print("The best Gradient Boost hyper parameter setting is {}".format(gbc_grid.best_params_))
print("Gradient Boost model accuracy: {}".format(accuracy_score(test_y_gb, test_z_gb)))

### End of Assignment 2
---
