### Introduction

In this assignment, we will implement cross validation to pick the best depth (hyperparameter) for a regression tree. Before we get started, let's add a few packages that you might need. We will use the <a href="https://archive.ics.uci.edu/ml/datasets/Ionosphere">ION</a> dataset for regression. 

In [None]:
import numpy as np
from pylab import *
from numpy.matlib import repmat
import matplotlib.pyplot as plt
from scipy.io import loadmat
import time
import helper as h
%matplotlib notebook

from sklearn.datasets import load_boston

data = loadmat("ion.mat")
xTr  = data['xTr'].T
yTr  = data['yTr'].flatten()
xTe  = data['xTe'].T
yTe  = data['yTe'].flatten()

We also developed a regression tree classifier in helper.py. The following code cell contains how to create a  regression tree

In [None]:
# Create a regression tree with no restriction on its depth
# if you want to create a tree of depth k
# then call h.RegressionTree(depth=k)
tree = h.RegressionTree(depth=np.inf)

# To fit/train the regression tree
tree.fit(xTr, yTr)

# To use the trained regression tree to make prediction
pred = tree.predict(xTr)

We have also evaluate the square loss function that takes in the prediction <code>pred</code> and ground truth <code>truth</code> and returns the average square loss between prediction and ground truth. 

In [None]:
def square_loss(pred, truth):
    return np.mean((pred - truth)**2)

Now we will look at the performance of our tree on both training set and test set.

In [None]:
print('Training Loss: {:.4f}'.format(square_loss(tree.predict(xTr), yTr)))
print('Test Loss: {:.4f}'.format(square_loss(tree.predict(xTe), yTe)))

As you can see, our tree achives zero training loss on the training set but not zero test loss. Clearly, our model is overfitting! To reduce overfitting, we need to control the depth of the tree and one way to pick the optimal depth is to do kFold Cross Validation. To do so, you will first implement <code>grid_search</code> that  finds the best depths given a training set and validation set. Then you will implement <code>generate_kFold</code> that generates the split for kFold cross validation. Eventually, you will combine the two functions to implement <code>cross_validation</code>.

Implement the <code>grid_search</code> that takes in a training set <code>xTr, yTr</code>, a validation set <code>xVal, yVal</code> and a list of depths candidates <code>depths</code>. Your job here is to fit a regression tree for each depth candidate on the training set <code>xTr, yTr</code>, evaluate the fitted tree on the validation set <code>xVal, yVal</code> and then based on the loss on the validation set, pick the candidate that yields the lowest loss. Note: in the event of tie, return the smallest depth candidate.

In [None]:
def grid_search(xTr, yTr, xVal, yVal, depths):
    '''
    Input:
        xTr: nxd matrix
        yTr: n vector
        xVal: mxd matrix
        yVal: m vector
        depths: a list of len k
    Return:
        best_depth: the depth that yields that lowest loss on the validation set
        training losses: a list of len k. the i-th entry corresponds to the the training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the validation loss
                the tree of depths[i]
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return best_depth, training_losses, validation_losses
        

Implement the <code>generate_kFold</code> function takes in the number of training examples <code>n</code> and the number of folds <code>k</code> and returns a list of <code>k</code> folds where each fold takes the form <code>(training indices, validation indices)</code> . For instance, if n = 3 and k = 3, then one possible output of the the function is <code>[([1, 2], [3]), ([2, 3], [1]), ([1, 3], [2])]</code> 

In [None]:
def generate_kFold(n, k):
    '''
    Input:
        n: number of training examples
        k: number of folds
    Returns:
        kfold_indices: a list of len k. Each entry takes the form
        (training indices, validation indices)
    '''
    assert k >= 2
    kfold_indices = []
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return kfold_indices

Use <code>grid_search</code> to implement the <code>cross_validation</code> function that takes in the training set <code>xTr, yTr</code>, a list of depth candidates <code>depths</code> and a list of indices that is generated by <code>generate_kFold</code>. Based on the <code>indices</code>, the function will do a grid search  on each fold and return the best parameter that yields the best average validation loss across the folds. Note that in event of tie, return the smallest depth candidate.

In [None]:
def cross_validation(xTr, yTr, depths, indices):
    '''
    Input:
        xTr: nxd matrix (training data)
        yTr: n vector (training data)
        depths: a list of len k
        indices: indices from generate_kFold
    Returns:
        best_depth: the best parameter 
        training losses: a list of len k. the i-th entry corresponds to the the average training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the average validation loss
                the tree of depths[i] 
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return best_depth, training_losses, validation_losses