# Estimator extraction

### Content

+ [1. Notebook description](#1.-Notebook-Description)
+ [2. Extraction](#2.-Extraction)
    + [2.1 LDA](#2.1-LDA)
    + [2.2 SVM](#2.2-SVM)
    + [2.3 CSP](#2.3-CSP)

---

# 1. Notebook Description

With the gridsearch results aggregated to numpy binary files for each model run, we now want to find the estimator meta parameters for each group that yielded the best score. We know which models were the best from the plot notebooks, but it's not yet clear which estimator params resulted in those scores.
Each model, good or bad, will find the best parameter set and, with some thresholding, we want to isolate those for each model group (LDA, SVC and CSP).

---

**Imports:**

In [1]:
from digits.utils import dotdict

import numpy as np
import pandas as pd
import glob
from itertools import combinations

---
# 2. Extraction


Given the best model from the plots results, we will first open the numpy binary files containing the result data for that specific model for **each** subject and iteratively only load the `results` dictionary in it to an equally named variable.

The keys in this dictionary are tuples of the form `(i,j)` where `i` and `j` are the digits we tested. Remember, that we have **CV** and **final** results for each digit combination. Because we don't care about the subject ID here we will simply append the current **final** score and the associated parameter(s) to a list.

After reading in all model+sbuject results we create a pandas dataframe from the lists we just filled.

---
## 2.1 LDA

Extract parameters for the best LDA model.
The only variable estimotor here is the regularization parameter `shrinkage`. This can either be set to `auto` (Ledoit-Wolf) or a value in `np.linspace(0,0.2,10)`.

In [2]:
model = 'lda_ss2_201_700_fft_None_200_40bins_nopower'

shrinks = []
scores = []
for resfile in glob.glob('results/'+model+'*.npz'):
    objs = np.load(resfile)
    results = dotdict(objs["results"].reshape(-1)[0])
    data = results.data
    for comb in combinations(np.arange(10), 2):
        for finalscore in data[comb]:
            scores.append(finalscore[0])
            shrink = finalscore[1]['lda__shrinkage']
            shrinks.append(shrink)

df_lda = pd.DataFrame({'score':scores, 'shrinkage':shrinks})



After we create the data frame, we will describe the dataset by aggregating all scores (over subjects $\times$ combinations $=1440$) with three functions: `count`, `mean` and `median`. We can optionally threshold all results to only list the most common or best parameters. Because aggregate created a two-level index we'll have to specify the column with a tuple, e.g `('score', 'count')` or `('score', 'mean')`.

In [3]:
dfg = df_lda.groupby(['shrinkage']).agg({'score': ['count', 'mean', 'median', 'std', 'min', 'max']})

In [4]:
threshold = max(dfg[('score', 'count')])/2
#threshold = 0
dftmp = dfg[dfg[('score', 'count')] > threshold]
dftmp.sort_values(by=('score','mean'), ascending=False)

Unnamed: 0_level_0,score,score,score,score,score,score
Unnamed: 0_level_1,count,mean,median,std,min,max
shrinkage,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0.044444,168,0.720001,0.725795,0.109397,0.428571,0.947368
0.111111,146,0.718396,0.730769,0.109963,0.416667,0.964286
0.066667,156,0.714293,0.717955,0.101758,0.37037,0.931298
0.088889,156,0.713823,0.708644,0.100652,0.444444,0.961538
0.177778,143,0.711518,0.714286,0.099387,0.428571,1.0
0.155556,116,0.705833,0.692308,0.103993,0.384615,0.928571
0.2,195,0.703063,0.713178,0.103538,0.423077,0.962963
0.133333,135,0.697144,0.699187,0.100179,0.448276,0.892857
0.022222,141,0.658584,0.653846,0.114928,0.304348,0.961538


In [5]:
print(_.to_latex())

\begin{tabular}{lrrrrrr}
\toprule
{} & score &           &           &           &           &           \\
{} & count &      mean &    median &       std &       min &       max \\
shrinkage &       &           &           &           &           &           \\
\midrule
0.044444  &   168 &  0.720001 &  0.725795 &  0.109397 &  0.428571 &  0.947368 \\
0.111111  &   146 &  0.718396 &  0.730769 &  0.109963 &  0.416667 &  0.964286 \\
0.066667  &   156 &  0.714293 &  0.717955 &  0.101758 &  0.370370 &  0.931298 \\
0.088889  &   156 &  0.713823 &  0.708644 &  0.100652 &  0.444444 &  0.961538 \\
0.177778  &   143 &  0.711518 &  0.714286 &  0.099387 &  0.428571 &  1.000000 \\
0.155556  &   116 &  0.705833 &  0.692308 &  0.103993 &  0.384615 &  0.928571 \\
0.200000  &   195 &  0.703063 &  0.713178 &  0.103538 &  0.423077 &  0.962963 \\
0.133333  &   135 &  0.697144 &  0.699187 &  0.100179 &  0.448276 &  0.892857 \\
0.022222  &   141 &  0.658584 &  0.653846 &  0.114928 &  0.304348 &  0.961538 \\

---
## 2.2 SVM

Similar to the [LDA](#2.1-LDA) above we load all subject specific results from a named model (the best).
This time we're going to create two data frames, because we ran two distinct grid searches in accordance with the method described by Keerthi & Lin 2003.

The resulting scores are 2-tuples, with the first element holding scores and values for the linear SVM and the second element holding the score for the subsequent non-linear SVM based on the best C from the first step.

First we'll extract only the linear scores.

In [6]:
model = 'svc_ss2_201_700_fft_None_200_40bins_nopower'

cs = []
scores = []
for resfile in glob.glob('results/'+model+'*.npz'):
    objs = np.load(resfile)
    config = dotdict(objs["config"].reshape(-1)[0])
    results = dotdict(objs["results"].reshape(-1)[0])
    data = results.data
    for comb in combinations(np.arange(10), 2):
        (score, _) = data[comb]
        scores.append(score[0])
        cs.append(score[1]['svc__C'])

df_lin = pd.DataFrame({'score':scores, 'C':cs})

Then, with the exact same loop as above we will take the second element in the result tuple to save the score and $\gamma,C$ values for the rbf SVM.

In [7]:
scores = []
gammas = []
cs = []
for resfile in glob.glob('results/'+model+'*.npz'):
    objs = np.load(resfile)
    config = dotdict(objs["config"].reshape(-1)[0])
    results = dotdict(objs["results"].reshape(-1)[0])
    data = results.data
    for comb in combinations(np.arange(10), 2):
        (_, score) = data[comb]
        scores.append(score[0])
        gammas.append(score[1]['svc__gamma'])
        cs.append(score[1]['svc__C'])

df_rbf = pd.DataFrame({'score':scores, 'gamma':gammas, 'C':cs})

Now we can look at each data frame `df_lin` and `df_rbf` to identify the best estimators.
First, for the linear SVM, thresholded for estimators which won at least 10 times.

In [8]:
dfg = df_lin.groupby(['C']).agg({'score': ['count', 'mean', 'median', 'std', 'min', 'max']})

In [9]:
threshold = max(dfg[('score', 'count')])/2
#threshold = 0
dftmp = dfg[dfg[('score', 'count')] > threshold]
dftmp.sort_values(by=('score','mean'), ascending=False)

Unnamed: 0_level_0,score,score,score,score,score,score
Unnamed: 0_level_1,count,mean,median,std,min,max
C,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1.274275e-06,526,0.711809,0.72,0.098427,0.321429,0.947368
4.281332e-06,294,0.7027,0.707505,0.105202,0.36,0.925926
3.79269e-07,344,0.6992,0.702997,0.10446,0.407407,0.941176


In [25]:
#print(dftmp.sort_values(by=('score','mean'), ascending=False).to_latex())

Second, we'll look at the best RBF parameters, also thresholding by estimators that won at least 10 times.
This is the first time that thresholding is really handy, because there are potentially a lot of $\gamma$-$C$ combinations.

In [12]:
dfg = df_rbf.groupby(['gamma','C']).agg({'score': ['count', 'mean', 'median', 'std', 'min', 'max']})

In [13]:
threshold = max(dfg[('score', 'count')])/2
#threshold = 0
dftmp = dfg[dfg[('score', 'count')] > threshold]
dftmp.sort_values(by=('score','mean'), ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,score,score,score,score,score,score
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,std,min,max
gamma,C,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
5e-09,37.926902,313,0.689934,0.694915,0.10859,0.407407,0.965517
5e-09,127.427499,503,0.68984,0.689655,0.092497,0.458333,0.962963
5e-09,428.13324,275,0.678704,0.682171,0.105359,0.321429,0.931298


In [24]:
#print(dftmp.sort_values(by=('score','mean'), ascending=False).to_latex())

---
## 2.3 CSP

In [16]:
model = 'cspflatlda_ss4_201_700_fft_None_80_40bins_nopower'

comps = []; scores = []; shrinks = []

for resfile in glob.glob('results/'+model+'*.npz'):
    objs = np.load(resfile)
    config = dotdict(objs["config"].reshape(-1)[0])
    results = dotdict(objs["results"].reshape(-1)[0])
    data = results.data
    for comb in combinations(np.arange(10), 2):
        score = data[comb][0]
        scores.append(score[0])
        comps.append(score[1]['csp__n_components'])
        shrinks.append(score[1]['lda__shrinkage'])

dfcsp = pd.DataFrame({'score':scores, 'components':comps, 'shrinkage':shrinks})

In [17]:
dfg = dfcsp.groupby(['shrinkage','components']).agg({'score': ['count', 'mean', 'median', 'std', 'min', 'max'] })

In [18]:
threshold = max(dfg[('score', 'count')])/2
threshold = 0
dftmp = dfg[dfg[('score','count')] > threshold]
dftmp.sort_values(by=('score','mean'), ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,score,score,score,score,score,score
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,std,min,max
shrinkage,components,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0.270526315789,2,8,0.818002,0.830239,0.093608,0.651163,0.925373
0.583157894737,4,3,0.794462,0.821429,0.092255,0.691729,0.870229
0.374736842105,2,2,0.784188,0.784188,0.053788,0.746154,0.822222
0.270526315789,3,6,0.736283,0.695973,0.121917,0.592593,0.923077
0.635263157895,4,8,0.711846,0.720506,0.123421,0.511811,0.923077
0.531052631579,3,8,0.699045,0.699861,0.130354,0.481481,0.888889
0.791578947368,4,3,0.694103,0.692308,0.055022,0.640000,0.750000
0.426842105263,2,4,0.693240,0.677976,0.149881,0.547445,0.869565
0.166315789474,3,4,0.692644,0.736165,0.133191,0.500000,0.798246
0.426842105263,5,24,0.686706,0.681391,0.116309,0.440000,0.887218


In [23]:
#print(dftmp.sort_values(by=('score','mean'), ascending=False).to_latex())

---
## 2.4 KNN

In [26]:
model = 'knn_ss10_201_600_raw'
model = 'knn_ss4_201_700_fft_None_80_40bins_nopower'

scores = []; ks = []

for resfile in glob.glob('results/'+model+'*.npz'):
    objs = np.load(resfile)
    config = dotdict(objs["config"].reshape(-1)[0])
    results = dotdict(objs["results"].reshape(-1)[0])
    data = results.data
    for comb in combinations(np.arange(10), 2):
        score = data[comb][0]
        scores.append(score[0])
        ks.append(score[1]['knn__n_neighbors'])

df_knn = pd.DataFrame({'score':scores, 'k':ks})

In [27]:
threshold = max(dfg[('score', 'count')])/2
#threshold = 0
dfg = df_knn.groupby(['k']).agg({'score':  ['count', 'mean', 'median', 'std', 'min', 'max']})
dftmp = dfg[dfg[('score', 'count')] > threshold]
dftmp.sort_values(by=('score','mean'), ascending=False)

Unnamed: 0_level_0,score,score,score,score,score,score
Unnamed: 0_level_1,count,mean,median,std,min,max
k,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
13,254,0.577781,0.572039,0.083906,0.259259,0.821429
14,228,0.571825,0.576091,0.084505,0.26087,0.794118
12,104,0.571661,0.571429,0.087335,0.344828,0.884615
11,186,0.56981,0.566553,0.072728,0.346154,0.75
10,85,0.566837,0.571429,0.092984,0.37037,0.88
8,84,0.560006,0.545367,0.106202,0.307692,0.857143
7,147,0.55891,0.56,0.090985,0.24,0.777778
9,153,0.553672,0.555556,0.099214,0.241379,0.851852
5,127,0.544978,0.545455,0.080761,0.344828,0.777778
6,72,0.536363,0.539304,0.080111,0.25,0.714286


In [28]:
#print(dftmp.sort_values(by=('score','mean'), ascending=False).to_latex())