# Adaboost
For this exercise you will implement AdaBoost from scratch and applied it to a spam dataset. You will be classifying data into spam and not spam. You can call DecisionTreeClassifier from sklearn (with default max_depth=1) to learn your base classifiers.

Here is how you train a decision tree classifier with weights.

`
h = DecisionTreeClassifier(max_depth=max_depth, random_state=0)
h.fit(X, Y, sample_weight=w)
`

In [1]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [2]:
# accuracy computation
# this data is not highly imbalance so accuracy is ok
def accuracy(y, pred):
    return np.sum(y == pred) / float(len(y)) 

In [3]:
def parse_spambase_data(filename):
    """ Given a filename return X and Y numpy arrays

    X is of size number of rows x num_features
    Y is an array of size the number of rows
    Y is the last element of each row. (Convert 0 to -1)
    """
    ### BEGIN SOLUTION
   

    ### END SOLUTION
    return X, Y

In [4]:
y_test = np.array([1., -1., 1., 1., -1., -1., 1., 1., 1., -1.])
X, Y = parse_spambase_data("tiny.spam.train")
for i in range(len(y_test)): assert(y_test[i] == Y[i])
n, m = X.shape
assert(n == 10)
assert(m == 57)

In [5]:
def adaboost(X, y, num_iter, max_depth=1):
    """Given an numpy matrix X, a array y and num_iter return trees and weights 
   
    Input: X, y, num_iter
    Outputs: array of trees from DecisionTreeClassifier
             trees_weights array of floats
    Assumes y is {-1, 1}
    """
    trees = []
    trees_weights = [] 
    N, _ = X.shape
    d = np.ones(N) / N

    ### BEGIN SOLUTION
    
    
    ### END SOLUTION
    return trees, trees_weights

In [18]:
X, Y = parse_spambase_data("tiny.spam.train")
trees, weights = adaboost(X, Y, 2)
y_hat_0 = trees[0].predict(X)
assert(len(trees) == 2)
assert(len(weights) == 2)
assert(isinstance(trees[0], DecisionTreeClassifier))
assert(np.array_equal(y_hat_0[:5], [1.,-1.,1, 1, -1]))

In [19]:
y_hat_0 = trees[0].predict(X)
assert(np.array_equal(y_hat_0[:5], [1.,-1.,1, 1, -1]))

In [27]:
def adaboost_predict(X, trees, trees_weights):
    """Given X, trees and weights predict Y
    """
    # X input, y output
    N, _ =  X.shape
    y = np.zeros(N)
    ### BEGIN SOLUTION
    
    
    ### END SOLUTION
    return y

In [28]:
x = np.array([[0, -1], [1, 0], [-1, 0]])
y = np.array([-1, 1, 1])
trees, weights = adaboost(x, y, 1)
pred = adaboost_predict(x, trees, weights)
assert(np.array_equal(pred, y))

In [29]:
X, Y = parse_spambase_data("spambase.train")
X_test, Y_test = parse_spambase_data("spambase.test")
trees, trees_weights = adaboost(X, Y, 10)
Yhat = adaboost_predict(X, trees, trees_weights)
Yhat_test = adaboost_predict(X_test, trees, trees_weights)
    
acc_test = accuracy(Y_test, Yhat_test)
acc_train = accuracy(Y, Yhat)
print("Train Accuracy %.4f" % acc_train)
print("Test Accuracy %.4f" % acc_test)
assert(np.around(acc_train, decimals=4)==0.9111)
assert(np.around(acc_test, decimals=4)==0.9190)

Train Accuracy 0.9111
Test Accuracy 0.9190


# Gradient boosting for regression with MSE loss
For this exercise you will implement a version of gradient boosting from scratch and applied it to predict rental prices. You can call DecisionTreeRegressor from sklearn to learn your base classifiers.
 
`tree = DecisionTreeRegressor(max_depth=max_depth, random_state=0)`

In [142]:
def load_dataset():
    dataset = np.loadtxt("rent-ideal.csv", delimiter=",", skiprows=1)
    y = dataset[:, -1]
    X = dataset[:, 0:- 1]
    return X, y

In [138]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split


def gradient_boosting_mse(X, y, num_iter, max_depth=1, nu=0.1):
    """Given X, a array y and num_iter return y_mean and trees 
   
    Input: X, y, num_iter
           max_depth
           nu (is the shinkage)
    Outputs:y_mean, array of trees from DecisionTreeRegression
    """
    trees = []
    N, _ = X.shape
    y_mean = np.mean(y)
    fm = y_mean
    ### BEGIN SOLUTION
    
    
    ### END SOLUTION
    return y_mean, trees   

In [139]:
def gradient_boosting_predict(X, trees, y_mean,  nu=0.1):
    """Given X, trees, y_mean predict y_hat
    """
    ### BEGIN SOLUTION
    
    
    ### END SOLUTION
    return y_hat

In [140]:
X, y = load_dataset()
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=3)

y_mean, trees = gradient_boosting_mse(X_train, y_train, 300, max_depth=7, nu=0.1)
assert(np.around(y_mean, decimals=4)==3434.7185)
y_hat_train = gradient_boosting_predict(X_train, trees, y_mean, nu=0.1)
assert(np.around(r2_score(y_train, y_hat_train), decimals=4)== 0.8993) 

In [141]:
y_hat = gradient_boosting_predict(X_val, trees, y_mean, nu=0.1)
assert(np.around(r2_score(y_val, y_hat), decimals=4)== 0.8399)