### Cross validation

- In this notebook we will get practice using cross validation to perform hyperparameter tuning. 

- Sklearn [implements sklearn for you](https://scikit-learn.org/stable/modules/cross_validation.html) and is totally fine to use the built-in implementation for future homeworks. 

- However, you will understand this concept much better if you practice implmenting it yourself.

In [23]:
# Import needed modules
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import math

# configure matplotlib to show plots in the notebook itself
%matplotlib inline 

### Data

In this notebook, we will be using the [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database). The data consists of patient records with a number of features, along with a binary label indicating if the patient has diabetes or does not have diabetes. Note that the "all patients" in the dataset are "are females at least 21 years old of Pima Indian heritage." Note that the `outcome` variable records if a patient does or does not have diabetes.

In [107]:
df = pd.read_csv('diabetes.csv') #Load the dataset

# Let's go ahead and start with a two-dimensional dataset to build intuitions
X = df[['Glucose', 'BloodPressure']]
Y = df["Outcome"]

#### Break the data into training and test

- Divide the dataset into training and testing. Use the last 100
instances for test and the rest for training. Pandas [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) is helpful for this.

In [108]:
# answer cell. Your code here

#### Naive approach

Let's start off by naively tuning to find the "best" $K$ using a naive grid search. You should loop over all values of K from 2 to 75 and find the value of $K$ that gets the highest accuracy on the training set. This process is likely familiar from other assignments.

In [91]:
# score the classifier on all neighbors from K=2 to K=75
accuracy_record = []
neighbors = range(2, 75)

In [None]:
#Generate plot to pick a K value
from sklearn.neighbors import KNeighborsClassifier
plt.title('k-NN Varying number of neighbors')
plt.plot(neighbors, accuracy_record, label='Training Accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Training Accuracy')
plt.show()

In [1]:
### Pick a value of K based on the plot, and predict the accuracy on the test set 

# your code here

###### Reflection 

What value of $K$ did you pick? Was your accuracy on the test set higher or lower than you expected? 

#### A better approach with cross validation

Now we will try that again using cross validation. Cross validation is important and useful! Coding it yourself will help you really understand the concept. So we will build up to an implementation in the rest of the notebook. Hopefully, it will get us to better test set accuracy. 

The first thing to do is to divide the dataset into $K$ equal folds. For each fold, we will need: 

- `X_held_out`: a matrix of held out training data (i.e. a pseudo test set)
- `Y_held_out`: a vector of held out labels

- `X_remainder`: training data not included in the held out set
- `Y_remainder`: labels for training data not included in the held out set

Let's start off by writing a function that takes the training set as input, along with the indexes of the held out data, and returns a dictionary with the four fields listed above.

In [112]:
def make_fold(X_train, Y_train, indexes_of_held_out_data):
    '''
    Take the training data and return a dictionary with four keys:
        "X_held_out", "Y_held_out", "X_remainder", "Y_remainder"
    Links: 
        - https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-basics
        - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
    '''
    return {"X_held_out": None, 
            "Y_held_out": None,
            "X_remainder": None,
            "Y_remainder": None}

Now let's use the make_fold function to complete the make_folds function, which divides the data into $K$ folds

In [3]:
def make_folds(_X_train, _Y_train, K):
    '''
    
    Divide the dataset into K equal folds using the make_fold function above

    Args:
        X (np.ndarray): A numpy matrix of N rows and J features
        Y (np.ndarray): A numpy array of N rows, with binary labels
        K (int): the number of folds

    Returns:
        Return a dictionary with keys 1 thru K. Each key maps to an equal-sized subset of the data. The subsets 
        should not overlap
                              {1: {"X_train": fold_1_X_train, "Y_train": fold_1_Y_train}, 
                               2: {"X_train": fold_2_X_train, "Y_train": fold_2_Y_train}  
                               ... 
                               }

        In some cases, the sizes of the folds will differ by 1 
        (e.g. if N=16 and K=3, you should have folds of sizes 5, 5 and 6)
    '''
    fold_size = int(len(_X_train)/K)
    N = len(_X_train)
    ou = {}
    for f in range(K):
        select_these = None # TODO: fill in with indexes to select
        ou[f] = make_fold(_X_train, _Y_train, select_these)
    return ou

### Perform cross valdation! 

We are now ready to perform cross validation. In the code below, for each $K$:
1. Use your make_folds function to divide the training data into $K$ folds. 
2. Then train the data on the remaining data and "test" it on the held out validation data. Test is in quotes here because remember we are just dividing up the training set.
3. Average the performance over each of the folds to get the average held out performance.
4. Compute the training performance for each $K$

Based on the average held out performance, pick the best $K$. 

In [None]:
### Your cross validation code here


# make folds. These can stay the same for all K


# loop over values of K
for k in range(3, 75): 
    
    accuracies = []
    knn = None # Instantiate a KNN classifier examining K neighbors

    for fold in folds:
        
        # train the classifier on the remaining data
        # test on the held out data
        # Record the accuracy

#### Best K?

Make another plot showing validation accuracy for different values of $K$, and the trianing accuracy of $K$. This is sometimes called a [validation curve](https://scikit-learn.org/stable/modules/learning_curve.html).

In [None]:
# Your plot here

#### Reflection

Based on your plot, which values of $K$ might be underfitting? Which values of $K$ might be overfitting?

#### Test accuracy

1. Which $K$ seems to do best on validation data? 
2. Using this value of $K$, compute your test accuracy again, this time with your best value of $K$. What do you observe? (Remember, normally, you would compute test accuracy only once because you don't peek at the test data. But this notebook is just for practice.)