In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import vapeplot 
%matplotlib inline

#### Goal
--------------------

The goal of this project is to make a classifier that predicts the final rankings for bakers.
The idea is to make a model for each episode and to use data from previous episodes in the model.
Therefore, a classifier for episode 1 will likely be bad at predicting the final outcome, but a classifier for episode 5 might accurately predict who will be in the top 3 and who might be eliminated in the next episode



#### Technical Challenge Rankings
--------------------------
* tech_med : median technical challenge ranking over each episode
* tech_mean : same as `tech_med` but the mean
* tech : technical challenge ranking for that episode

In [2]:
tech = pd.read_csv("../RESULTS/gbbo.techinical.data.20190907.tsv",sep='\t')
feats = ['tech_mean','tech_med','tech']
tech.head()

Unnamed: 0,season,baker,index,episode,tech_mean,tech_med,tech,place
0,1,Annetha,6,1,2.0,2.0,2,6
1,1,Annetha,6,2,4.5,4.5,7,6
2,1,Annetha,6,3,3.0,2.0,0,6
3,1,Annetha,6,4,2.25,1.0,0,6
4,1,Annetha,6,5,1.8,0.0,0,6


Data is not scaled or normalized. 

Now apply a MinMax scaler so the minimum value is 0 and the maximum value is 1

In [3]:
from sklearn.preprocessing import MinMaxScaler
mms = tech
scaler = MinMaxScaler()
# fit the scaler
scaler.fit(mms[feats])
# transform values
mms[feats] = scaler.transform(mms[feats])

  return self.partial_fit(X, y)


In [4]:
from sklearn.preprocessing import QuantileTransformer
qua = tech
scaler = QuantileTransformer(
    n_quantiles=10,
    random_state=42,
    ignore_implicit_zeros=True, #sparse matrix
)
# fit the scaler
scaler.fit(qua[feats])
# transform values
qua[feats] = scaler.transform(qua[feats])



In [12]:
epi=sorted(list(set(tech['episode'])))
for x in epi:
    print('EPISODE:',x,
          '  Training Set Length: ',
          len(qua.loc[qua['episode']==x]))

EPISODE: 1   Training Set Length:  80
EPISODE: 2   Training Set Length:  80
EPISODE: 3   Training Set Length:  80
EPISODE: 4   Training Set Length:  80
EPISODE: 5   Training Set Length:  80
EPISODE: 6   Training Set Length:  80
EPISODE: 7   Training Set Length:  70
EPISODE: 8   Training Set Length:  70
EPISODE: 9   Training Set Length:  58
EPISODE: 10   Training Set Length:  58


#### Training Strategy
---------------------
For each episode, we evaluate the performance of classifier using two different methods.

* Leave-One-Out (`loo`) 

* Stratified Cross Validation (`scv`)

For `loo` we set aside 1 season as the test set, and train on the remaining seasons. We do this iteratively untill all seasons are evaluated. 

For `scv` we randomly subsample 1/5th of the dataset for testing and train on the remaining 4/5ths. Stratified means that random subsampling will keep the same proportion of class labels (here final rankings). Keeping the same proportion of class labels ensures that label is represented in the test and training sets. Like `loo`, this process is done iteratively so that each fold is evaluated. 

----------------

#### Classifiers 
----------------

We want to test the performance of different classifiers on our data. Different methods work better for different distributions of data (linear vs. non-linear). 

* Linear Support Vector Machine (SVM) 
* RBF SVM 
* Stochastic Gradient Descent (SGD)
* K Nearest Neighbors
* Gaussian Naive Bayes
* Decision Trees
* Random Forest
* Neural Network


In [17]:
# classifiers
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
clfs = {
    'Linear SVM' : SVC(kernel='linear'),
    'RBF SVM' : SVC(kernel='rbf'),
    'SGD' : SGDClassifier(),
    'KNN' : KNeighborsClassifier(),
    'Gaussian NB' : GaussianNB(),
    'Descision Trees' : DecisionTreeClassifier(),
    'Random Forest' : RandomForestClassifier(n_estimators=100),
    'Neural Network' : MLPClassifier(hidden_layer_sizes=(100,30),max_iter=1000)
}

#### Performance Evaluation 
--------------------------
Record the R

In [None]:
# LOO 
def gbbo_loo(df,clfs):
    results={}
    seasons = sorted(list(set(df['season'])))
    for s in seasons:
        # split test and training set
        test = df.loc[df['season']==s].sample(frac=1.) # shuffles the data
        train= df.loc[df['season']!=s].sample(frac=1.)
        
        episodes= sorted(list(set(df['episode'])))
    