# EECS189 Project T Final Notebook
## Week 2: Testing/Training, Cross-Validation, and Bias-Variance
## This is the solution notebook, so all code will be written out. In the student notebook, specific lines will be left blank for students to fill in
    
Topic: 9. Training/Testing, Cross-Validation, Bias-Variance

By: Team Sean

In [3]:
# import necessary libraries

from helper_fns import *

# data organization libraries
import numpy as np
import pandas as pd

# data visualization libraries
import plotly.express as px
#import matplotlib.pyplot as plt

# modeling libraries
import sklearn as sk


# The Setup
[TODO paste the The Setup slide from the slide deck once that's finalized.

Something like:

YOU have been chosen to pioneer the development of a machine learning system that ranks startups for probability of success.  We give you profiles of previous startups, including their capital, resource costs, country, Public or Nonprofit status, the amount of success they achieved years later, and much more. Can you estimate the success rates for today’s slew of new (fictional) startups? The investment firms are waiting to hear about the insight you provide!
]

# Goals

Prediction models that are used in the industry must be able to maintain accuracy on previously unseen data. At this point in your EECS education, you have only learned model assesment within the context of data the model has already seen. This creates a problem, because if we assess our model with the same data that was used to fit it, then we may overestimate how well our model does at prediction. After completing this assigment, students will know the methodology behind improving prediction models so they are ready for use in the real world.

# Visualizing the Training Data

The first thing to do is always to find out what you're working with. We load the data X and labels y.

In [10]:
# TODO load X and y
X = np.array([[1, 2], [3, 4]]*10)
y = np.array([2, 3]*10)

# shuffle x and y?

# Features present in the data in order. These column names will help you interpret trends you see in the data
# This was originally nested alongside the numerical data, but OLS requires us to use a matrix instead of a dictionary
# TODO populate this with the actual column names
FEATURES = ['product name', 'price']

## TODO something like "plot some of the features of startups with respect to their success rates. What do you notice about the correlations? Are there correlations?"

In [6]:
def create_feat_selector(features):
    """
    Helper function to create a subset of the data that will only include certain features
    A full list of features is defined in FEATURES
    """
    for f in features:
        assert f in FEATURES, "'{}' is not defined in the varaible FEATURES!".format(f)
    def feat_selector(X):
        indices = [FEATURES.index(f) for f in features]
        return X[:, indices]
    return feat_selector

# Structuring Your Machine Learning Model

In this project, we will focus on the general process behind training many machine learning models. To illustrate this, let's pull out the SVM models you have learned about this week!

In particular, we will be using it as a classifier, so it will be called a Support Vector Classifier (SVC). Use [sklearn.svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to implement the training step below, given hyperparameters C and gamma. You may find the examples in the documentation very useful.

In [7]:
def train(X_train, y_train, C=1.0, gamma=0.2):
    """
    X_train - Training data
    y_train - Training labels
    C - a hyperparameter for SVC
    gamma - a hyperparameter for SVC
    
    Return reg, an instance of LinearRegression.fit() that represents the trained model
    """
    ##### START #####
    reg = SVC(C=C, gamma=gamma).fit(X_train, y_train)
    ##### END #####
    return reg

## TODO something about changing up the features passed in X_train based on the visualizations above

# K-Fold Cross Validation

How good are your hyperparameters? How do we measure that? One thought is to estimate accuracy on the test set. But wait! The test set should only be run AFTER we're done training everything or else our final results will be fudged. Here's an idea: let's use our knowledge of k-fold cross validation to split up our training set into a "training set" and a "validation set", and measure accuracy on the validation set. Average the accuracy over all k folds. What is the accuracy of your model?

Perhaps [sklearn.model_selection.KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) would be useful.

TODO say WOW the featurizations that have low training error and high test error are exactly the ones that have low variance, that was calculated in the previous part! O:

# Measuring the Bias and Variance of Your Model

Is the model doing well?

As noted in lecture, bias is generally expressed as a model's tendency to approximate certain functions even if conflicting features are in the training set, and variance is generally expressed as a model's difference in performance on the test set given a different training set. Also remember the irreducible error is that which cannot be eliminated because it is in our inherently noisy measurements of the labels.

A mathematical formulation is below:

$$\text{Total Noise} = (E[h(x|D)] - f(x))^2 + Var(h(x|D)) + Var(Z)$$

where h(x|D) is the model's prediction given a training dataset, f(x) is the true label, and Z is the inherent noise in the labels. These terms are bias, variance, and irreducible error, respectively. A detailed derivation can be found [here](https://www.eecs189.org/static/notes/n5.pdf) or in the notes.

## The Game Plan

1. Since these values are evaluated over many different training datasets, let's structure this like k-fold cross validation so we can randomly sample datasets. Perhaps, you can use [sklearn.model_selection.KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html). **NOTE:** X_test must be the same for all datasets for the bias variance measurement corresponding to the above to be correct.

2. For each of the k splits, 
 - train your model using your selected features
 - record predictions for each test datapoint x

3. After gathering the above information, average the predicted label over the k splits for each input x to obtain E[h(x|D)] and combine with the appropriate y label f(x). Average these values over inputs x to get the bias

4. Compute the variance of predictions for each input x. Then average over inputs x to get the variance Var(h(x|D))

What is the bias of your model? The variance? 

In [13]:
# TODO this  function is COMPLETELY untested

def get_bias_variance(X, y, feat_selector):
    """
    X- the original training data
    y- the labels for the original training data
    feat_selector- a function created by create_function_selector()
    """
    
    predictions = []
    true_labels = []

    ##### START STEP 1 #####
    n_rest = int(X.shape[0] * 0.75)
    X_rest, X_test = X[:n_rest], X[n_rest:]
    y_rest, y_test = y[:n_rest], y[n_rest:]
    
    kf = sklearn.model_selection.KFold(n_splits=4)
    for train_index, test_index in kf.split(X_rest):
        X_train, y_train = X[train_index], y[train_index]
    ##### END STEP 1 #####
        ##### START STEP 2 #####
        reg = train(X_train, y_train)
        predictions.append(reg.predict(X_test))
        true_labels.append(y_test)
        ##### END STEP 2 #####
    
    ##### START STEP 3 #####
    bias = np.mean((np.mean(predictions, axis=0) - true_labels[0])**2)
    ##### END STEP 3 #####
    
    ##### START STEP 4 #####
    variance = np.std(np.mean(predictions, axis=0))
    ##### END STEP 4 #####
    
    return bias, variance

In [15]:
get_bias_variance(X, y, create_feat_selector(['product name']))

(0.0, 0.4898979485566356)

## TODO something about repeating the above until your model prediction accuracy is ~90% (base this percentage on the best staff solution)