<h2>About this Project</h2>
<p>In this project, you will implement cross validation to pick the best depth (hyperparameter) for a regression tree, again using the ION data set.</p>

<h3>Evaluation</h3>

<p><strong>This project must be successfully completed and submitted in order to receive credit for this course. Your score on this project will be included in your final grade calculation.</strong><p>
    
<p>You are expected to write code where you see <em># YOUR CODE HERE</em> within the cells of this notebook. Not all cells will be graded; code input cells followed by cells marked with <em>#Autograder test cell</em> will be graded. Upon submitting your work, the code you write at these designated positions will be assessed using an "autograder" that will run all test cells to assess your code. You will receive feedback from the autograder that will identify any errors in your code. Use this feedback to improve your code if you need to resubmit. Be sure not to change the names of any provided functions, classes, or variables within the existing code cells, as this will interfere with the autograder. Also, remember to execute all code cells sequentially, not just those you’ve edited, to ensure your code runs properly.</p>
    
<p>You can resubmit your work as many times as necessary before the submission deadline. If you experience difficulty or have questions about this exercise, use the Q&A discussion board to engage with your peers or seek assistance from the instructor.<p>

<p>Before starting your work, please review <a href="https://s3.amazonaws.com/ecornell/global/eCornellPlagiarismPolicy.pdf">eCornell's policy regarding plagiarism</a> (the presentation of someone else's work as your own without source credit).</p>

<h3>Submit Code for Autograder Feedback</h3>

<p>Once you have completed your work on this notebook, you will submit your code for autograder review. Follow these steps:</p>

<ol>
  <li><strong>Save your notebook.</strong></li>
  <li><strong>Mark as Completed —</strong> In the blue menu bar along the top of this code exercise window, you’ll see a menu item called <strong>Education</strong>. In the <strong>Education</strong> menu, click <strong>Mark as Completed</strong> to submit your code for autograder/instructor review. This process will take a moment and a progress bar will show you the status of your submission.</li>
	<li><strong>Review your results —</strong> Once your work is marked as complete, the results of the autograder will automatically be presented in a new tab within the code exercise window. You can click on the assessment name in this feedback window to see more details regarding specific feedback/errors in your code submission.</li>
  <li><strong>Repeat, if necessary —</strong> The Jupyter notebook will always remain accessible in the first tabbed window of the exercise. To reattempt the work, you will first need to click <strong>Mark as Uncompleted</strong> in the <strong>Education</strong> menu and then proceed to make edits to the notebook. Once you are ready to resubmit, follow steps one through three. You can repeat this procedure as many times as necessary.</li>
</ol>

## Get Started

<p>Let's import a few packages that you will need. You will work with the <a href="https://archive.ics.uci.edu/ml/datasets/Ionosphere">ION</a> dataset for this project.</p> 

In [None]:
import numpy as np
from pylab import *
from numpy.matlib import repmat
import matplotlib.pyplot as plt
from scipy.io import loadmat
import time

%matplotlib notebook

sys.path.append('/home/codio/workspace/.guides/hf')
from helper import *

print('You\'re running python %s' % sys.version.split(' ')[0])

In [None]:
data = loadmat("ion.mat")
xTr  = data['xTr'].T
yTr  = data['yTr'].flatten()
xTe  = data['xTe'].T
yTe  = data['yTe'].flatten()

We also developed a regression tree classifier in ``helper.py``. The following code cell shows you how to instantiate a  regression tree.

In [None]:
# Create a regression tree with no restriction on its depth
# if you want to create a tree of depth k
# then call h.RegressionTree(depth=k)
tree = RegressionTree(depth=np.inf)

# To fit/train the regression tree
tree.fit(xTr, yTr)

# To use the trained regression tree to make prediction
pred = tree.predict(xTr)

We have also created a square loss function that takes in the prediction <code>pred</code> and ground truth <code>truth</code> and returns the average square loss between prediction and ground truth. 

In [None]:
def square_loss(pred, truth):
    return np.mean((pred - truth)**2)

Now, look at the performance of your tree on both the training set and test set using the code cell below.

In [None]:
print('Training Loss: {:.4f}'.format(square_loss(tree.predict(xTr), yTr)))
print('Test Loss: {:.4f}'.format(square_loss(tree.predict(xTe), yTe)))

As you can see, your tree achives zero training loss on the training set but not zero test loss. Clearly, the model is overfitting! To reduce overfitting, you need to control the depth of the tree. One way to pick the optimal depth is to do kFold Cross Validation. To do so, you will first implement <code>grid_search</code>, which finds the best depths given a training set and validation set. Then you will implement <code>generate_kFold</code>, which generates the split for kFold cross validation. Finally, you will combine the two functions to implement <code>cross_validation</code>.

## Implement Cross Validation

### Part One [Graded]
Implement the function <code>grid_search</code>, which takes in a training set <code>xTr, yTr</code>, a validation set <code>xVal, yVal</code> and a list of tree depth candidates <code>depths</code>. Your job here is to fit a regression tree for each depth candidate on the training set <code>xTr, yTr</code>, evaluate the fitted tree on the validation set <code>xVal, yVal</code> and then pick the candidate that yields the lowest loss for the validation set. Note: in the event of a tie, return the smallest depth candidate.

In [None]:
def grid_search(xTr, yTr, xVal, yVal, depths):
    '''
    Input:
        xTr: nxd matrix
        yTr: n vector
        xVal: mxd matrix
        yVal: m vector
        depths: a list of len k
    Return:
        best_depth: the depth that yields that lowest loss on the validation set
        training losses: a list of len k. the i-th entry corresponds to the the training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the validation loss
                the tree of depths[i]
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    ### BEGIN SOLUTION
    for i in depths:
        tree = RegressionTree(i)
        tree.fit(xTr, yTr)
        
        training_loss = square_loss(tree.predict(xTr), yTr)
        validation_loss = square_loss(tree.predict(xVal), yVal)
        training_losses.append(training_loss)
        validation_losses.append(validation_loss)
    
    best_depth = depths[np.argmin(validation_losses)]
    ### END SOLUTION
    return best_depth, training_losses, validation_losses

In [None]:
# The following tests check that your implementation of grid search returns the correct number of training and validation loss values and the correct best depth

depths = [1,2,3,4,5]
k = len(depths)
best_depth, training_losses, validation_losses = grid_search(xTr, yTr, xTe, yTe, depths)
best_depth_grader, training_losses_grader, validation_losses_grader = grid_search_grader(xTr, yTr, xTe, yTe, depths)

# Check the length of the training loss
def grid_search_test1():
    return (len(training_losses) == k) 

# Check the length of the validation loss
def grid_search_test2():
    return (len(validation_losses) == k)

# Check the argmin
def grid_search_test3():
    return (best_depth == depths[np.argmin(validation_losses)])

def grid_search_test4():
    return (best_depth == best_depth_grader)

def grid_search_test5():
    return np.linalg.norm(np.array(training_losses) - np.array(training_losses_grader)) < 1e-7

def grid_search_test6():
    return np.linalg.norm(np.array(validation_losses) - np.array(validation_losses_grader)) < 1e-7

runtest(grid_search_test1, 'grid_search_test1')
runtest(grid_search_test2, 'grid_search_test2')
runtest(grid_search_test3, 'grid_search_test3')
runtest(grid_search_test4, 'grid_search_test4')
runtest(grid_search_test5, 'grid_search_test5')
runtest(grid_search_test6, 'grid_search_test6')

In [None]:
# Autograder test cell - worth 1 point
# runs grid search test#
### BEGIN HIDDEN TESTS

assert (len(training_losses) == k) 

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs grid search test2
### BEGIN HIDDEN TESTS

assert (len(validation_losses) == k)

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs grid search test3
### BEGIN HIDDEN TESTS

assert (best_depth == depths[np.argmin(validation_losses)])


### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs grid search test4
### BEGIN HIDDEN TESTS

assert (best_depth == best_depth_grader)

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs grid search test5
### BEGIN HIDDEN TESTS

assert np.linalg.norm(np.array(training_losses) - np.array(training_losses_grader)) < 1e-7

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs grid search test6
### BEGIN HIDDEN TESTS

assert np.linalg.norm(np.array(validation_losses) - np.array(validation_losses_grader)) < 1e-7

### END HIDDEN TESTS

### Part Two [Graded]

Now, implement the <code>generate_kFold</code> function, which takes in the number of training examples <code>n</code> and the number of folds <code>k</code> and returns a list of <code>k</code> folds where each fold takes the form <code>(training indices, validation indices)</code>.

For instance, if n = 3 and k = 3, then one possible output of the the function is <code>[([1, 2], [3]), ([2, 3], [1]), ([1, 3], [2])]</code> 

In [None]:
def generate_kFold(n, k):
    '''
    Input:
        n: number of training examples
        k: number of folds
    Returns:
        kfold_indices: a list of len k. Each entry takes the form
        (training indices, validation indices)
    '''
    assert k >= 2
    kfold_indices = []
    
    ### BEGIN SOLUTION
    indices = np.random.permutation(n)
    fold_size = n // k
    
    fold_indices = [indices[i*fold_size: (i+1)*fold_size] for i in range(k - 1)]
    fold_indices.append(indices[(k-1)*fold_size:])
    
    
    for i in range(k):
        training_indices = [fold_indices[j] for j in range(k) if j != i]
        validation_indices = fold_indices[i]
        kfold_indices.append((np.concatenate(training_indices), validation_indices))
    ### END SOLUTION
    return kfold_indices

In [None]:
# The following tests check that your generate_kFold function returns the correct number of total indices, train and test indices, and the correct ratio

kfold_indices = generate_kFold(1004, 5)

def generate_kFold_test1():
    return len(kfold_indices) == 5

def generate_kFold_test2():
    t = [((len(train_indices) + len(test_indices)) == 1004) for (train_indices, test_indices) in kfold_indices]
    return np.all(t)

def generate_kFold_test3():
    ratio_test = []
    for (train_indices, validation_indices) in kfold_indices:
        ratio = len(validation_indices) / len(train_indices)
        ratio_test.append((ratio > 0.24 and ratio < 0.26))
    return np.all(ratio_test)

runtest(generate_kFold_test1, 'generate_kFold_test1')
runtest(generate_kFold_test2, 'generate_kFold_test2')
runtest(generate_kFold_test3, 'generate_kFold_test3')

In [None]:
# Autograder test cell - worth 1 point
# runs generate Kfold test1
### BEGIN HIDDEN TESTS

assert len(kfold_indices) == 5

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs generate Kfold test2
### BEGIN HIDDEN TESTS

t = [((len(train_indices) + len(test_indices)) == 1004) for (train_indices, test_indices) in kfold_indices]
assert np.all(t)

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs generate Kfold test3
### BEGIN HIDDEN TESTS

ratio_test = []
for (train_indices, validation_indices) in kfold_indices:
    ratio = len(validation_indices) / len(train_indices)
    ratio_test.append((ratio > 0.24 and ratio < 0.26))
assert np.all(ratio_test)

### END HIDDEN TESTS

### Part Three [Graded]

Use <code>grid_search</code> to implement the <code>cross_validation</code> function that takes in the training set <code>xTr, yTr</code>, a list of depth candidates <code>depths</code> and a list of indices that is generated by <code>generate_kFold</code>. Using <code>indices</code>, the function will do a grid search  on each fold and return the parameter that yields the best average validation loss across the folds. Note that in event of tie, the function should return the smallest depth candidate.

In [None]:
def cross_validation(xTr, yTr, depths, indices):
    '''
    Input:
        xTr: nxd matrix (training data)
        yTr: n vector (training data)
        depths: a list of len k
        indices: indices from generate_kFold
    Returns:
        best_depth: the best parameter 
        training losses: a list of len k. the i-th entry corresponds to the the average training loss
                the tree of depths[i]
        validation_losses: a list of len k. the i-th entry corresponds to the the average validation loss
                the tree of depths[i] 
    '''
    training_losses = []
    validation_losses = []
    best_depth = None
    
    ### BEGIN SOLUTION
    for train_indices, validation_indices in indices:
        xtrain, ytrain = xTr[train_indices], yTr[train_indices]
        xval, yval = xTr[validation_indices], yTr[validation_indices]
        
        _, training_loss, validation_loss = grid_search(xtrain, ytrain, xval, yval, depths)
        
        training_losses.append(training_loss)
        validation_losses.append(validation_loss)
    
    training_losses = np.mean(training_losses, axis=0)
    validation_losses = np.mean(validation_losses, axis=0)
    
    best_depth = depths[np.argmin(validation_losses)]
    best_tree = RegressionTree(depth=best_depth)
    best_tree.fit(xTr, yTr)
    ### END SOLUTION
    
    return best_depth, training_losses, validation_losses

In [None]:
# The following tests check that your implementation of cross_validation returns the correct number of training and validation losses, the correct "best depth" and the correct values for training and validation loss

depths = [1,2,3,4,5]
k = len(depths)
indices = generate_kFold(len(xTr), 5)
best_depth, training_losses, validation_losses = cross_validation(xTr, yTr, depths, indices)
best_depth_grader, training_losses_grader, validation_losses_grader = cross_validation_grader(xTr, yTr, depths, indices)

# Check the length of the training loss
def cross_validation_test1():
    return (len(training_losses) == k) 

# Check the length of the validation loss
def cross_validation_test2():
    return (len(validation_losses) == k)

# Check the argmin
def cross_validation_test3():
    return (best_depth == depths[np.argmin(validation_losses)])

def cross_validation_test4():
    return (best_depth == best_depth_grader)

def cross_validation_test5():
    return np.linalg.norm(np.array(training_losses) - np.array(training_losses_grader)) < 1e-7

def cross_validation_test6():
    return np.linalg.norm(np.array(validation_losses) - np.array(validation_losses_grader)) < 1e-7

runtest(cross_validation_test1, 'cross_validation_test1')
runtest(cross_validation_test2, 'cross_validation_test2')
runtest(cross_validation_test3, 'cross_validation_test3')
runtest(cross_validation_test4, 'cross_validation_test4')
runtest(cross_validation_test5, 'cross_validation_test5')
runtest(cross_validation_test6, 'cross_validation_test6')

In [None]:
# Autograder test cell - worth 1 point
# runs cross validation test1
### BEGIN HIDDEN TESTS

assert (len(training_losses) == k) 

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs cross validation test2
### BEGIN HIDDEN TESTS

assert (len(validation_losses) == k)

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs cross validation test3
### BEGIN HIDDEN TESTS

assert (best_depth == depths[np.argmin(validation_losses)])

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs cross validation test4
### BEGIN HIDDEN TESTS

assert (best_depth == best_depth_grader)

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs cross validation test5
### BEGIN HIDDEN TESTS

assert np.linalg.norm(np.array(training_losses) - np.array(training_losses_grader)) < 1e-7

### END HIDDEN TESTS

In [None]:
# Autograder test cell - worth 1 point
# runs cross validation test6
### BEGIN HIDDEN TESTS

assert np.linalg.norm(np.array(validation_losses) - np.array(validation_losses_grader)) < 1e-7

### END HIDDEN TESTS