<h2>About this Project</h2>
<p>In this project, you will implement a regression tree that predicts the severity of a patient's heart disease based on a set of attributes. The training data is in <code>heart_disease_train.csv</code> and the test data is in <code>heart_disease_test.csv</code>. Before we start, take a look at the two csv files and  <code>attribute.txt</code>, which contains a description of each attribute in the csv files.</p>

<h3>Evaluation</h3>

<p>You are expected to write code where you see <em># YOUR CODE HERE</em> within the cells of this notebook. Upon submitting your work, the code you write at these designated positions will be assessed using an "autograder" that will run a series of tests on your code. You will receive instant feedback from the autograder that will identify issues with and errors in your code. Use this feedback to improve your code if you need to resubmit. Be sure not to change the names of any provided functions, classes, or variables within the existing code cells, as this will interfere with the autograder. Also, remember to execute all code cells, not just those you’ve edited, to ensure the code runs properly.</p>
    
<p>You can resubmit your work as many times as necessary before the submission deadline. If you experience difficulty or have questions about this exercise, use the Q&A discussion board to engage with your peers or seek assistance from the instructor.<p>

<p><strong>This exercise must be successfully completed in order to receive credit for this course.</strong><p>

<p>Before starting your work, please review <a href="https://s3.amazonaws.com/ecornell/global/eCornellPlagiarismPolicy.pdf">eCornell's policy regarding plagiarism</a> (the presentation of someone else's work as your own without source credit).</p>

<h3>Submit Code for Autograder/Instructor Feedback</h3>

<p>Once you have completed your work on this notebook, you will submit your code for autograder/instructor review. Follow these steps:</p>

<ol>
<li>Save your notebook. Though the system should automatically save your progress, you should ensure the latest version of your work is saved before submitting. </li>
  <li>In the blue menu bar along the top of the code exercise window, you’ll see a menu item called <strong>Education</strong>. In the <strong>Education</strong> menu, click <strong>Mark as Completed</strong> to submit your code for autograder/instructor review. This process will take a moment and a progress bar will show you the status of your submission.</li>
	<li>Once your work is marked as complete, the results of the autograder will automatically be presented in a new tab within the code exercise window. You can click on the assessment name in this feedback window to see more details regarding specific feedback/errors in your code submission.</li>
  <li>The Jupyter notebook will always remain accessible in the first tabbed window of the exercise. To reattempt the work, you will first need to click <strong>Mark as Uncompleted</strong> in the <strong>Education</strong> menu and then proceed to make edits to the notebook. Once you are ready to resubmit, follow steps one through three. You can repeat this procedure as many times as necessary.</li>
</ol>

## Getting Started

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import helper as h

## Implement a Regression Tree

### Part One [Graded]

Now, implement a function called <code>load_data</code>, which will load the given <code>.csv</code> file and return <code>X, y</code> where <code>X</code> are the patients' attributes and <code>y</code> is the severity of the patients' heart disease.

In [3]:
def load_data(file='heart_disease_train.csv', label=True):
    '''
    Input:
        file: filename of the dataset
        label: a boolean to decide whether to return the labels or not
    Returns:
        X: patient attributes
        y: severity
    '''
    X = None
    y = None
    ### BEGIN SOLUTION
    with open(file) as f:
        columns = f.readline().rstrip().split(',')
        df = {i: [] for i in columns}
        
        for i in f.readlines():
            for ind_j, j in enumerate(i.rstrip().split(',')):
                df[columns[ind_j]].append(float(j))
    
    
    if label:
        y = np.array(df['label'])
        columns.remove('label')
        X = np.vstack([df[i] for i in columns]).T
        return X, y
    else:
        X = np.vstack([df[i] for i in columns]).T
        return X
    ### END SOLUTION

In [4]:
X, y = load_data()

In [6]:
def load_data_grader(file='heart_disease_train.csv', label=True):
    '''
    Input:
        file: filename of the dataset
        label: a boolean to decide whether to return the labels or not
    Returns:
        X: patient attributes
        y: severity
    '''
    X = None
    y = None
    with open(file) as f:
        columns = f.readline().rstrip().split(',')
        df = {i: [] for i in columns}
        
        for i in f.readlines():
            for ind_j, j in enumerate(i.rstrip().split(',')):
                df[columns[ind_j]].append(float(j))
    
    
    if label:
        y = np.array(df['label'])
        columns.remove('label')
        X = np.vstack([df[i] for i in columns]).T
        return X, y
    else:
        X = np.vstack([df[i] for i in columns]).T
        return X

Xtrain, ytrain = load_data()
Xtrain_grader, ytrain_grader = load_data_grader()
Xtest = load_data(file='heart_disease_test.csv', label=False)

def load_data_test1(Xtrain, ytrain):
    return (len(Xtrain) == len(ytrain))

def load_data_test2(Xtrain, Xtrain_grader):
    return (len(Xtrain) == len(Xtrain_grader))

def load_data_test3(ytrain, ytrain_grader):
    y_unique = np.sort(np.unique(ytrain))
    y_grader_unique = np.sort(np.unique(ytrain_grader))
    
    if len(y_unique) != len(y_grader_unique):
        return False
    else:
        return np.linalg.norm(y_unique - y_grader_unique) < 1e-7
    
def load_data_test4(Xtrain, Xtrain_grader):
    Xtrain_flatten = np.sort(Xtrain.flatten())
    Xtrain_grader_flatten = np.sort(Xtrain_grader.flatten())
    return np.linalg.norm(Xtrain_flatten - Xtrain_grader_flatten) < 1e-7

### BEGIN HIDDEN TESTS
assert load_data_test1(Xtrain, ytrain), "[Failed] load_data: The number of examples does not match with number of labels"
assert load_data_test2(Xtrain, Xtrain_grader), "[Failed] load_data: You did not load the right number of examples"
assert load_data_test3(ytrain, ytrain_grader), "[Failed] load_data: The unique values in your labels are incorrect"
assert load_data_test4(Xtrain, Xtrain_grader), "[Failed] load_data: The values in your data are incorrect"
### END HIDDEN TESTS

Now, you will use the regression tree from the previous assignment for this prediction problem. As a reminder:

In [8]:
# Create a regression with no restriction on its depth
# if you want to create a tree of depth k
# then call h.RegressionTree(depth=k)
tree = h.RegressionTree(depth=np.inf)

# To fit/train the regression tree
tree.fit(X, y)

# To use the trained regression tree to make predictions
pred = tree.predict(X)

### Part Two [Graded]

In <code>test</code>, you will find the optimal regression tree for the dataset <code>heart_disease_train.csv</code> and return its prediction on <code>heart_disease_test.csv</code>. You will be evaluated based on <code>square_loss</code>. You will get a full score if the test loss on your classifier is less than 2. You may use any functions that you implemented in the previous project.

In [9]:
def square_loss(pred, truth):
    return np.mean((pred - truth)**2)

In [10]:
def test():
    '''
        prediction: the prediction of your classifier on the heart_disease_test.csv
    '''
    prediction = None
    Xtrain, ytrain = load_data(file='heart_disease_train.csv', label=True)
    Xtest = load_data(file='heart_disease_test.csv', label=False)
    
    ### BEGIN SOLUTION
    tree = h.RegressionTree(depth=2)
    tree.fit(Xtrain, ytrain)
    
    prediction = tree.predict(Xtest)
    ### END SOLUTION
    return prediction

In [1]:
gt = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4])

pred = test()
test_loss = square_loss(pred, gt)
print('Your test loss: {:0.4f}'.format(test_loss))
### BEGIN HIDDEN TESTS
assert (test_loss < 2.0), "Your test loss is more than 2.0"
### END HIDDEN TESTS

NameError: name 'np' is not defined