### Introduction

In this assignment, we will implement cross validation to pick the best depth (hyperparameter) for a regression tree. Before we get started, let's import a few packages that you might need. We will use the <a href="https://archive.ics.uci.edu/ml/datasets/Ionosphere">ION</a> dataset for regression. 

In [2]:
import numpy as np
from pylab import *
from numpy.matlib import repmat
import matplotlib.pyplot as plt
from scipy.io import loadmat
import time
import helper as h
%matplotlib notebook

data = loadmat("ion.mat")
xTr  = data['xTr'].T
yTr  = data['yTr'].flatten()
xTe  = data['xTe'].T
yTe  = data['yTe'].flatten()

We also developed a regression tree classifier in helper.py. The following code cell shows you how to instantiate a  regression tree.

In [3]:
# Create a regression tree with no restriction on its depth
# if you want to create a tree of depth k
# then call h.RegressionTree(depth=k)
tree = h.RegressionTree(depth=np.inf)

# To fit/train the regression tree
tree.fit(xTr, yTr)

# To use the trained regression tree to make prediction
pred = tree.predict(xTr)

We have also created square loss function that takes in the prediction <code>pred</code> and ground truth <code>truth</code> and returns the average square loss between prediction and ground truth. 

In [4]:
def square_loss(pred, truth):
    return np.mean((pred - truth)**2)

Now we will look at the performance of our tree on both the training set and test set.

In [5]:
print('Training Loss: {:.4f}'.format(square_loss(tree.predict(xTr), yTr)))
print('Test Loss: {:.4f}'.format(square_loss(tree.predict(xTe), yTe)))

Training Loss: 0.0000
Test Loss: 0.6857


As you can see, our tree achives zero training loss on the training set but not zero test loss. Clearly, our model is overfitting! To reduce overfitting, we need to control the depth of the tree; one way to pick the optimal depth is to do kFold Cross Validation. To do so, you will first implement <code>grid_search</code>, which finds the best depths given a training set and validation set. Then you will implement <code>generate_kFold</code> that generates the split for kFold cross validation. Finally, you will combine the two functions to implement <code>cross_validation</code>.

Implement the function <code>grid_search</code>, which takes in a training set <code>xTr, yTr</code>, a validation set <code>xVal, yVal</code> and a list of tree depth candidates <code>depths</code>. Your job here is to fit a regression tree for each depth candidate on the training set <code>xTr, yTr</code>, evaluate the fitted tree on the validation set <code>xVal, yVal</code> and then pick the candidate that yields the lowest loss for the validation set. Note: in the event of tie, return the smallest depth candidate.

In [9]:
def grid_search(xTr, yTr, xVal, yVal, depths):
    '''
    Input:
        xTr: nxd matrix
        yTr: n vector
        xVal: mxd matrix
        yVal: m vector
        depths: a list of len k
    Return:
        best_depth: the depth that yields that lowest loss on the validation set
        training losses: a list of len k. the i-th entry corresponds to the the training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the validation loss
                the tree of depths[i]
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    ### BEGIN SOLUTION
    for i in depths:
        tree = h.RegressionTree(i)
        tree.fit(xTr, yTr)
        
        training_loss = square_loss(tree.predict(xTr), yTr)
        validation_loss = square_loss(tree.predict(xVal), yVal)
        training_losses.append(training_loss)
        validation_losses.append(validation_loss)
    
    best_depth = depths[np.argmin(validation_losses)]
    ### END SOLUTION
    return best_depth, training_losses, validation_losses
        

Now, implement the <code>generate_kFold</code> function, which takes in the number of training examples <code>n</code> and the number of folds <code>k</code> and returns a list of <code>k</code> folds where each fold takes the form <code>(training indices, validation indices)</code> .

For instance, if n = 3 and k = 3, then one possible output of the the function is <code>[([1, 2], [3]), ([2, 3], [1]), ([1, 3], [2])]</code> 

In [6]:
def generate_kFold(n, k):
    '''
    Input:
        n: number of training examples
        k: number of folds
    Returns:
        kfold_indices: a list of len k. Each entry takes the form
        (training indices, validation indices)
    '''
    assert k >= 2
    kfold_indices = []
    
    ### BEGIN SOLUTION
    indices = np.random.permutation(n)
    fold_size = n // k
    
    fold_indices = [indices[i*fold_size: (i+1)*fold_size] for i in range(k - 1)]
    fold_indices.append(indices[(k-1)*fold_size:])
    
    
    for i in range(k):
        training_indices = [fold_indices[j] for j in range(k) if j != i]
        validation_indices = fold_indices[i]
        kfold_indices.append((np.concatenate(training_indices), validation_indices))
    ### END SOLUTION
    return kfold_indices

Use <code>grid_search</code> to implement the <code>cross_validation</code> function that takes in the training set <code>xTr, yTr</code>, a list of depth candidates <code>depths</code> and a list of indices that is generated by <code>generate_kFold</code>. Using <code>indices</code>, the function will do a grid search  on each fold and return the parameter that yields the best average validation loss across the folds. Note that in event of tie, the function should return the smallest depth candidate.

In [7]:
def cross_validation(xTr, yTr, depths, indices):
    '''
    Input:
        xTr: nxd matrix (training data)
        yTr: n vector (training data)
        depths: a list of len k
        indices: indices from generate_kFold
    Returns:
        best_depth: the best parameter 
        training losses: a list of len k. the i-th entry corresponds to the the average training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the average validation loss
                the tree of depths[i] 
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    ### BEGIN SOLUTION
    for train_indices, validation_indices in indices:
        xtrain, ytrain = xTr[train_indices], yTr[train_indices]
        xval, yval = xTr[validation_indices], yTr[validation_indices]
        
        _, training_loss, validation_loss = grid_search(xtrain, ytrain, xval, yval, depths)
        
        training_losses.append(training_loss)
        validation_losses.append(validation_loss)
    
    training_losses = np.mean(training_losses, axis=0)
    validation_losses = np.mean(validation_losses, axis=0)
    
    best_depth = depths[np.argmin(validation_losses)]
    best_tree = h.RegressionTree(depth=best_depth)
    best_tree.fit(xTr, yTr)
    ### END SOLUTION
    
    return best_depth, training_losses, validation_losses

In [10]:
### BEGIN HIDDEN TESTS
def grid_search_grader(xTr, yTr, xVal, yVal, depths):
    '''
    Input:
        xTr: nxd matrix
        yTr: n vector
        xVal: mxd matrix
        yVal: m vector
        depths: a list of len k
    Return:
        best_depth: the depth that yields that lowest loss on the validation set
        training losses: a list of len k. the i-th entry corresponds to the the training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the validation loss
                the tree of depths[i]
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    

    for i in depths:
        tree = h.RegressionTree(i)
        tree.fit(xTr, yTr)
        
        training_loss = square_loss(tree.predict(xTr), yTr)
        validation_loss = square_loss(tree.predict(xVal), yVal)
        training_losses.append(training_loss)
        validation_losses.append(validation_loss)
    
    best_depth = depths[np.argmin(validation_losses)]

    return best_depth, training_losses, validation_losses

depths = [1,2,3,4,5]
k = len(depths)
best_depth, training_losses, validation_losses = grid_search(xTr, yTr, xTe, yTe, depths)
best_depth_grader, training_losses_grader, validation_losses_grader = grid_search_grader(xTr, yTr, xTe, yTe, depths)


# Check the length of the training loss
def grid_search_test1(training_losses, k):
    return (len(training_losses) == k) 

# Check the length of the validation loss
def grid_search_test2(validation_losses, k):
    return (len(validation_losses) == k)

# Check the argmin
def grid_search_test3(best_depth, validation_losses):
    return (best_depth == depths[np.argmin(validation_losses)])

def grid_search_test4(best_depth, best_depth_grader):
    return (best_depth == best_depth_grader)

def grid_search_test5(training_losses, training_losses_grader):
    return np.linalg.norm(np.array(training_losses) - np.array(training_losses_grader)) < 1e-7

def grid_search_test6(validation_losses, validation_losses_grader):
    return np.linalg.norm(np.array(validation_losses) - np.array(validation_losses_grader)) < 1e-7

assert grid_search_test1(training_losses, k), "[Failed] grid_search: the len(training_losses) != len(depths)"
assert grid_search_test2(validation_losses, k), "[Failed] grid_search: the len(validation_losses) != len(depths)"
assert grid_search_test3(best_depth, validation_losses), "[Failed] grid_search: Your best depth is not the minimizer of your validation loss" 
assert grid_search_test4(best_depth, best_depth_grader), "[Failed] grid_search: Your best depth does not match the optimal max depth!"
assert grid_search_test5(training_losses, training_losses_grader), "[Failed] grid_search: Some of your training losses are not right!"
assert grid_search_test6(validation_losses, validation_losses_grader), "[Failed] grid search: Some of your validation losses are not right"
### END HIDDEN TESTS

In [11]:
### BEGIN HIDDEN TESTS
kfold_indices = generate_kFold(1004, 5)

def generate_kFold_test1(kfold_indices):
    return len(kfold_indices) == 5

def generate_kFold_test2(kfold_indices):
    t = [((len(train_indices) + len(test_indices)) == 1004) for (train_indices, test_indices) in kfold_indices]
    return np.all(t)

def generate_kFold_test3(kfold_indices):
    ratio_test = []
    for (train_indices, validation_indices) in kfold_indices:
        ratio = len(validation_indices) / len(train_indices)
        ratio_test.append((ratio > 0.24 and ratio < 0.26))
    return np.all(ratio_test)

assert generate_kFold_test1(kfold_indices), "[Failed] generate_kFold: You did not generate k folds!"
assert generate_kFold_test2(kfold_indices), "[Failed] generate_kFold: The length of your train indices and validation indices does not add up!"
assert generate_kFold_test3(kfold_indices), "[Failed] generate_kFold: The ratio len(validation_indices) / len(train_indices) is incorrect"
### END HIDDEN TESTS

In [12]:
### BEGIN HIDDEN TESTS
def cross_validation_grader(xTr, yTr, depths, indices):
    '''
    Input:
        xTr: nxd matrix (training data)
        yTr: n vector (training data)
        depths: a list of len k
        indices: indices from generate_kFold
    Returns:
        best_depth: the best parameter 
        training losses: a list of len k. the i-th entry corresponds to the the average training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the average validation loss
                the tree of depths[i] 
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    for train_indices, validation_indices in indices:
        xtrain, ytrain = xTr[train_indices], yTr[train_indices]
        xval, yval = xTr[validation_indices], yTr[validation_indices]
        
        _, training_loss, validation_loss = grid_search(xtrain, ytrain, xval, yval, depths)
        
        training_losses.append(training_loss)
        validation_losses.append(validation_loss)
    
    training_losses = np.mean(training_losses, axis=0)
    validation_losses = np.mean(validation_losses, axis=0)
    
    best_depth = depths[np.argmin(validation_losses)]
    best_tree = h.RegressionTree(depth=best_depth)
    best_tree.fit(xTr, yTr)
    
    return best_depth, training_losses, validation_losses

depths = [1,2,3,4,5]
k = len(depths)
indices = generate_kFold(len(xTr), 5)
best_depth, training_losses, validation_losses = cross_validation(xTr, yTr, depths, indices)
best_depth_grader, training_losses_grader, validation_losses_grader = cross_validation_grader(xTr, yTr, depths, indices)


# Check the length of the training loss
def cross_validation_test1(training_losses, k):
    return (len(training_losses) == k) 

# Check the length of the validation loss
def cross_validation_test2(validation_losses, k):
    return (len(validation_losses) == k)

# Check the argmin
def cross_validation_test3(best_depth, validation_losses):
    return (best_depth == depths[np.argmin(validation_losses)])

def cross_validation_test4(best_depth, best_depth_grader):
    return (best_depth == best_depth_grader)

def cross_validation_test5(training_losses, training_losses_grader):
    return np.linalg.norm(np.array(training_losses) - np.array(training_losses_grader)) < 1e-7

def cross_validation_test6(validation_losses, validation_losses_grader):
    return np.linalg.norm(np.array(validation_losses) - np.array(validation_losses_grader)) < 1e-7

assert cross_validation_test1(training_losses, k), "[Failed] cross_validation: the len(training_losses) != len(depths)"
assert cross_validation_test2(validation_losses, k), "[Failed] cross_validation: the len(validation_losses) != len(depths)"
assert cross_validation_test3(best_depth, validation_losses), "[Failed] cross_validation: Your best depth is not the minimizer of your validation loss" 
assert cross_validation_test4(best_depth, best_depth_grader), "[Failed] cross_validation: Your best depth does not match the optimal max depth!"
assert cross_validation_test5(training_losses, training_losses_grader), "[Failed] cross_validation: Some of your training losses are not right!"
assert cross_validation_test6(validation_losses, validation_losses_grader), "[Failed] cross_validation: Some of your validation losses are not right"
### END HIDDEN TESTS