# Modeling exercise

## General Instructions

* Submission date: 25.4.2022
* Submission Method: Link to your solution notebook in [this sheet](https://docs.google.com/spreadsheets/d/1fTmjiVxzw_rM1hdh16enwUTtxzlHSJIiw41dJS2LKp0/edit?usp=sharing).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys; sys.path.append('../Modles and Modeling/src')
import numpy as np
import plotly_express as px

In [3]:
import pandas as pd
import numpy as np
import ipywidgets as widgets
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

In [4]:
from datasets import make_circles_dataframe, make_moons_dataframe


## Fitting and Overfiting 

The goal of the following exercise is to:
* Observe overfitting due to insuffient data
* Observe Overfitting due to overly complex model
* Identify the overfitting point by looking at Train vs Test error dynamic
* Observe how noise levels effect the needed data samples and model capacity

To do so, you'll code an experiment in the first part, and analyze the experiment result in the second part.

### Building an experiment

Code:

1. Create data of size N with noise level of magnitude NL from datasets DS_NAME. 
1. Split it to training and validation data (no need for test set), use 80%-20%. 
1. Use Logistic regression and Choose one complex model of your choice: [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [SVM with RBF kernel](https://scikit-learn.org/stable/modules/svm.html) with different `gamma` values or [Random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with differnt number of `min_samples_split`. 
1. Train on the train set for different hyper parameter values. compute:
   1. Classification accuracy on the training set (TRE)
   1. Classification accuracy on the validation set (TESTE)
   1. The difference beteen the two above (E_DIFF)
1. Save DS_NAME, N, NL, CLF_NAME, K, TRE, TESTE, E_DIFF and the regularization/hyper param (K, gamma or min_samples_split and regularization value for the linear regression classifier)

Repeat for:
* DS_NAME in Moons, Circles
* N (number of samples) in [5, 10, 50, 100, 1000, 10000]
* NL (noise level) in [0, 0.1, 0.2, 0.3, 0.4, 0.5]
* For the complex model: 10 Values of hyper parameter of the complex model you've chosen.
* For the linear model: 5 values of ridge (l2) regularization - [0.001, 0.01, 0.1, 1, 10, 100, 1000]

### Analysing the expermient results

## Tips and Hints

For buliding the experiment:

* Start with one dataframe holding all the data for both datastes with different noise level. Use the `make_<dataset_name>_dataframe()` functions below, and add two columns, dataset_name and noise_level, before appending the new dataset to the rest of the datasets. Use `df = pd.DataFrame()` to start with an empty dataframe and using a loop, add data to it using `df = df.append(<the needed df here>)`. Verify that you have 10k samples for each dataset type and noise level by a proper `.value_counts()`. You can modify the 
* When you'll need an N samples data with a specific noise level, use `query()` and `head(n)` to get the needed dataset. 
* Use sklearn `train_test_split()` method to split the data with `test_size` and `random_state` parameters set correctly to ensure you are always splitting the data the same why for a given fold `k`. Read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) if needed. 
* You can also not create your own data splitter, and instead use `model_selection.cross_validate()` from sklearn. You'll need to ask for the train erros as well as the test errors, see [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).
* Use prints in proper location to ensure the progress of the experiment. 

**If you get stuck, and need refernce, scroll to the end of the notebook to see more hints!**

## Moons dataset

In [5]:
from sklearn.datasets import make_moons


In [6]:
moons_df = make_moons_dataframe(n_samples=1000, noise_level=0.1)
moons_df.head()

Unnamed: 0,x,y,label
0,0.522253,0.887367,A
1,-0.797071,0.897803,A
2,-0.523298,0.861566,A
3,0.874433,0.510847,A
4,-0.900683,0.261604,A


In [7]:
@widgets.interact
def plot_noisy_moons(noise_level = widgets.FloatSlider(value=0, min=0, max=0.5, step=0.05)):
    moons_df = make_moons_dataframe(n_samples=1000, noise_level=noise_level)
    return px.scatter(moons_df, x='x', y='y', color = 'label')

interactive(children=(FloatSlider(value=0.0, description='noise_level', max=0.5, step=0.05), Output()), _dom_c…

## Circles Dataset

In [8]:
circles_df = make_circles_dataframe(n_samples=500, noise_level=0)
moons_df.head()

Unnamed: 0,x,y,label
0,0.522253,0.887367,A
1,-0.797071,0.897803,A
2,-0.523298,0.861566,A
3,0.874433,0.510847,A
4,-0.900683,0.261604,A


In [11]:
@widgets.interact
def plot_noisy_circles(noise_level = widgets.FloatSlider(value=0, min=0, max=0.5, step=0.05)):
    df = make_circles_dataframe(1000, noise_level)
    return px.scatter(df, x='x', y='y', color = 'label')

interactive(children=(FloatSlider(value=0.0, description='noise_level', max=0.5, step=0.05), Output()), _dom_c…

## Appendix

### More hints!

If you'll build the datasets dataframe correctly, you'll have **one** dataframe that has dataset_name and noise_level colmuns, as well as the regular x,y,label colmns. To unsure you've appended everything correctly, groupby the proper colmuns and look at the size:

In [10]:
# Use proper groupby statement to ensure the datasets dataframe contains data as expected. You should see the following result:

Your 

You experiment code should look something like that:

In [12]:
def make_modeling_exercise_dataframe():
    df = pd.DataFrame()
    noise_levels = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
    for x in noise_levels:
        df_circles = make_circles_dataframe(n_samples=10000, noise_level=x) 
        df_circles['dataset_name'] = 'circles'
        df_circles['noise_level'] =  x
        df = pd.concat([df, df_circles])    
        df_moons = make_moons_dataframe(n_samples=10000, noise_level=x)
        df_moons['dataset_name'] = 'moons'
        df_moons['noise_level'] =  x
        df = pd.concat([df, df_moons]) 
    return df
        
df = make_modeling_exercise_dataframe()    

In [13]:
df.groupby(['dataset_name', 'noise_level']).size()


dataset_name  noise_level
circles       0.0            10000
              0.1            10000
              0.2            10000
              0.3            10000
              0.4            10000
              0.5            10000
moons         0.0            10000
              0.1            10000
              0.2            10000
              0.3            10000
              0.4            10000
              0.5            10000
dtype: int64

I have struggled to much on the dataset generation part, therefore i have copied from another student's work

In [14]:
from sklearn.datasets import make_circles
from sklearn.datasets import make_moons

In [15]:
def make_circles_dataframe(n_samples, noise_level):
    points, label = make_circles(n_samples=n_samples, noise=noise_level)
    circles_df = pd.DataFrame(points, columns=['x','y'])
    circles_df['label'] = label
    circles_df.label = circles_df.label.map({0:'A', 1:'B'})
    return circles_df

In [16]:
def make_moons_dataframe(n_samples, noise_level):
    points, label = make_moons(n_samples=n_samples, noise=noise_level)
    moons_df = pd.DataFrame(points, columns=['x','y'])
    moons_df['label'] = label
    moons_df.label = moons_df.label.map({0:'A', 1:'B'})
    return moons_df


In [17]:
fulldataset = pd.DataFrame()
N = [5,10,50,100,1000,10000]
Noise = [x/10 for x in range(0,6,1)]
for i in N:
    for j in Noise:
        circles_df = make_circles_dataframe(n_samples=i, noise_level=j)
        noise_df = make_moons_dataframe(n_samples=i, noise_level=j)
        fulldataset = fulldataset.append([circles_df,noise_df])
fulldataset

Unnamed: 0,x,y,label
0,1.000000,0.000000e+00,A
1,-0.400000,6.928203e-01,B
2,0.800000,0.000000e+00,B
3,-1.000000,1.224647e-16,A
4,-0.400000,-6.928203e-01,B
...,...,...,...
9995,2.278680,-6.989045e-01,B
9996,-1.756340,-2.119362e-01,A
9997,0.695706,1.376782e+00,A
9998,-0.166062,1.097600e+00,B


In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [27]:
dataset_type = ['Moons','Circles']
fulldataset = pd.DataFrame()
N = [5,10,50,100,1000,10000]
class_type = ['log_reg','KNN']
KNN = [x for x in range(1,20,2)]
Noise = [x/10 for x in range(0,6,1)]
L2params = [0.01, 0.1, 1, 10, 100]
results = pd.DataFrame()
for i in N:
    for j in Noise:
        for ctype in dataset_type:
            if ctype == 'Moons':
                workdf = make_circles_dataframe(n_samples=i, noise_level=j)
            elif ctype == 'Circles':
                workdf = make_moons_dataframe(n_samples=i, noise_level=j)
            x_train,x_test,y_train,y_test= train_test_split(workdf[['x','y']] , workdf['label'] , test_size=0.2 , random_state=42)
            for cltype in class_type:
                if(cltype == 'log_reg'):
                    for pen in L2params:
                        logregfit = LogisticRegression(penalty = 'l2' , C = pen)
                        logregfit.fit(x_train,y_train)
                        y_pred = logregfit.predict(x_test)
                        train_acc = accuracy_score(logregfit.predict(x_train),y_train)
                        test_acc = accuracy_score(y_test,y_pred)
                        locres = pd.DataFrame({'Dataset' : [ctype] , 'Samples' : i , 'Noise' : j , 'Hyper_Parameter' : False , 'Regularization_Parameter' : pen , 'Class' : ['Logistic Regression'] , 'Train_Accuracy' : [train_acc] , 'Test_Accuracy' : [test_acc] })
                        results = results.append(locres)
                if(cltype == 'KNN'):
                    for k in KNN:
                        if(len(x_train) < k):
                            continue
                        knnfit = KNeighborsClassifier(n_neighbors=k)
                        knnfit.fit(x_train,y_train)
                        y_pred = knnfit.predict(x_test)
                        train_acc = accuracy_score(knnfit.predict(x_train),y_train)
                        test_acc = accuracy_score(y_test,y_pred)
                        locres = pd.DataFrame({'Dataset' : [ctype] , 'Samples' : i , 'Noise' : j , 'Hyper_Parameter' : k , 'Regularization_Parameter' : False , 'Class' : ['KNN'] , 'Train_Accuracy' : [train_acc] , 'Test_Accuracy' : [test_acc] })
                        results = results.append(locres)                

Working on circles


NameError: name 'datasets' is not defined

In [20]:
results['Accuracy_Difference'] = results['Train_Accuracy'] - results['Test_Accuracy'] 
results

Unnamed: 0,Dataset,Samples,Noise,Hyper_Parameter,Regularization_Parameter,Class,Train_Accuracy,Test_Accuracy,Accuracy_Difference
0,Moons,5,0.0,False,0.01,Logistic Regression,0.750000,0.0000,0.750000
0,Moons,5,0.0,False,0.10,Logistic Regression,0.750000,0.0000,0.750000
0,Moons,5,0.0,False,1.00,Logistic Regression,0.750000,0.0000,0.750000
0,Moons,5,0.0,False,10.00,Logistic Regression,1.000000,0.0000,1.000000
0,Moons,5,0.0,False,100.00,Logistic Regression,1.000000,0.0000,1.000000
...,...,...,...,...,...,...,...,...,...
0,Circles,10000,0.5,11,0.00,KNN,0.833375,0.8135,0.019875
0,Circles,10000,0.5,13,0.00,KNN,0.832500,0.8145,0.018000
0,Circles,10000,0.5,15,0.00,KNN,0.832500,0.8185,0.014000
0,Circles,10000,0.5,17,0.00,KNN,0.831125,0.8160,0.015125


In [21]:
results.to_csv('results.csv' , index=True)

In [22]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Question 1 - Manual Classification

In [23]:
moons = results.query("Dataset == 'Moons'")
px.scatter(moons,x='Test_Accuracy',y='Accuracy_Difference',color='Class' , size='Samples')

We can see that KNN is not at dense as LOGIT, therefore i think it has las chance to be an overfitting model

# Q2

In [24]:
circles = results.query("Dataset == 'Circles'")
px.scatter(circles,x='Test_Accuracy',y='Accuracy_Difference',color='Class', size='Samples')

In [22]:
moons.groupby(['Samples','Class'])['Accuracy_Difference'].mean()


Samples  Class              
5        KNN                    0.458333
         Logistic Regression    0.683333
10       KNN                    0.510417
         Logistic Regression    0.420833
50       KNN                    0.327917
         Logistic Regression    0.243333
100      KNN                    0.220417
         Logistic Regression    0.208333
1000     KNN                    0.097146
         Logistic Regression    0.044083
10000    KNN                    0.103369
         Logistic Regression    0.006933
Name: Accuracy_Difference, dtype: float64

In [23]:
circles.groupby(['Samples','Class'])['Accuracy_Difference'].mean()


Samples  Class              
5        KNN                    0.666667
         Logistic Regression    0.391667
10       KNN                    0.270833
         Logistic Regression    0.225000
50       KNN                    0.102917
         Logistic Regression    0.131667
100      KNN                    0.049167
         Logistic Regression    0.040417
1000     KNN                    0.014854
         Logistic Regression   -0.002542
10000    KNN                    0.026019
         Logistic Regression    0.000437
Name: Accuracy_Difference, dtype: float64

It seems that KNN values are more stable than logistic regression but that is due to small sampeles, on a large dataset i will choose the regression

# Q3

In [24]:
moons.query("Class == 'Logistic Regression'").groupby('Regularization_Parameter')['Test_Accuracy'].mean()


Regularization_Parameter
0.01      0.296569
0.10      0.295750
1.00      0.323222
10.00     0.324569
100.00    0.324569
Name: Test_Accuracy, dtype: float64

In [25]:
moons.query("Class == 'Logistic Regression'").groupby('Regularization_Parameter')['Accuracy_Difference'].mean()


Regularization_Parameter
0.01      0.277681
0.10      0.287052
1.00      0.254674
10.00     0.260358
100.00    0.259278
Name: Accuracy_Difference, dtype: float64

In [26]:
circles.query("Class == 'Logistic Regression'").groupby('Regularization_Parameter')['Test_Accuracy'].mean()


Regularization_Parameter
0.01      0.639778
0.10      0.684000
1.00      0.694208
10.00     0.715222
100.00    0.716611
Name: Test_Accuracy, dtype: float64

In [27]:
circles.query("Class == 'Logistic Regression'").groupby('Regularization_Parameter')['Accuracy_Difference'].mean()


Regularization_Parameter
0.01      0.139417
0.10      0.121069
1.00      0.138267
10.00     0.125458
100.00    0.131326
Name: Accuracy_Difference, dtype: float64

# Q4

In [28]:
plotdata = results.query("Noise == 0.3 & Dataset == 'Moons' & Class == 'KNN' & Hyper_Parameter == 3")
plotdata


fig = make_subplots(rows=1, cols=3 , subplot_titles=("N vs Train Accuracy", "N vs Test Accuracy", "N vs Accuracy Difference"))

# Add traces
fig.add_trace(go.Scatter(x=plotdata['Samples'], y= plotdata['Train_Accuracy'],
                    mode='markers',
                    name='Train') ,  row=1, col=1)

fig.add_trace(go.Scatter(x=plotdata['Samples'], y= plotdata['Test_Accuracy'],
                    mode='markers',
                    name='Test') ,  row=1, col=2)

fig.add_trace(go.Scatter(x=plotdata['Samples'], y= plotdata['Accuracy_Difference'],
                    mode='markers',
                    name='Difference') ,  row=1, col=3)

fig.update_xaxes(type="log", range=[0,5])
fig.update_yaxes(range=[0,1.2])

In [29]:
plotdata = results.query("Noise == 0.3 & Dataset == 'Circles' & Class == 'KNN' & Hyper_Parameter == 3")
plotdata


fig = make_subplots(rows=1, cols=3 , subplot_titles=("N vs Train Accuracy", "N vs Test Accuracy", "N vs Accuracy Difference"))

# Add traces
fig.add_trace(go.Scatter(x=plotdata['Samples'], y= plotdata['Train_Accuracy'],
                    mode='markers',
                    name='Train') ,  row=1, col=1)

fig.add_trace(go.Scatter(x=plotdata['Samples'], y= plotdata['Test_Accuracy'],
                    mode='markers',
                    name='Test') ,  row=1, col=2)

fig.add_trace(go.Scatter(x=plotdata['Samples'], y= plotdata['Accuracy_Difference'],
                    mode='markers',
                    name='Difference') ,  row=1, col=3)

fig.update_xaxes(type="log", range=[0,5])
fig.update_yaxes(range=[0,1.2])

# Q5

In [30]:
plotdata = results.query("Noise == 0.3 & Dataset == 'Moons' & Class == 'KNN' & Samples == 10000")
plotdata


fig = make_subplots(rows=1, cols=3 , subplot_titles=("Hyper Parameter vs Train Accuracy", "Hyper Parameter vs Test Accuracy", "Hyper Parameter vs Accuracy Difference"))

# Add traces
fig.add_trace(go.Scatter(x=plotdata['Hyper_Parameter'], y= plotdata['Train_Accuracy'],
                    mode='markers',
                    name='Train') ,  row=1, col=1)

fig.add_trace(go.Scatter(x=plotdata['Hyper_Parameter'], y= plotdata['Test_Accuracy'],
                    mode='markers',
                    name='Test') ,  row=1, col=2)

fig.add_trace(go.Scatter(x=plotdata['Hyper_Parameter'], y= plotdata['Accuracy_Difference'],
                    mode='markers',
                    name='Difference') ,  row=1, col=3)

fig.update_xaxes(range=[0,20])
fig.update_yaxes(range=[-0.1,1.2])

In [31]:
plotdata = results.query("Noise == 0.3 & Dataset == 'Circles' & Class == 'KNN' & Samples == 10000")
plotdata


fig = make_subplots(rows=1, cols=3 , subplot_titles=("Hyper Parameter vs Train Accuracy", "Hyper Parameter vs Test Accuracy", "Hyper Parameter vs Accuracy Difference"))

# Add traces
fig.add_trace(go.Scatter(x=plotdata['Hyper_Parameter'], y= plotdata['Train_Accuracy'],
                    mode='markers',
                    name='Train') ,  row=1, col=1)

fig.add_trace(go.Scatter(x=plotdata['Hyper_Parameter'], y= plotdata['Test_Accuracy'],
                    mode='markers',
                    name='Test') ,  row=1, col=2)

fig.add_trace(go.Scatter(x=plotdata['Hyper_Parameter'], y= plotdata['Accuracy_Difference'],
                    mode='markers',
                    name='Difference') ,  row=1, col=3)

fig.update_xaxes(range=[0,20])
fig.update_yaxes(range=[-0.1,1.2])

# Q6

In [32]:
circles.groupby(['Noise','Samples'])['Test_Accuracy'].mean()


Noise  Samples
0.0    5          0.000000
       10         0.500000
       50         0.733333
       100        0.920000
       1000       0.965333
       10000      0.964000
0.1    5          0.857143
       10         0.888889
       50         0.746667
       100        0.866667
       1000       0.953667
       10000      0.959267
0.2    5          0.857143
       10         0.000000
       50         0.846667
       100        0.943333
       1000       0.937667
       10000      0.930867
0.3    5          0.285714
       10         0.055556
       50         0.646667
       100        0.883333
       1000       0.872000
       10000      0.884433
0.4    5          0.000000
       10         0.888889
       50         0.786667
       100        0.636667
       1000       0.866000
       10000      0.839000
0.5    5          0.000000
       10         0.888889
       50         0.846667
       100        0.913333
       1000       0.805333
       10000      0.797433
Name: Test_Ac

In [33]:
moons.groupby(['Noise','Samples'])['Test_Accuracy'].mean()


Noise  Samples
0.0    5          0.000000
       10         0.111111
       50         0.293333
       100        0.563333
       1000       0.825333
       10000      0.828567
0.1    5          0.142857
       10         0.000000
       50         0.280000
       100        0.270000
       1000       0.677000
       10000      0.704533
0.2    5          0.142857
       10         0.055556
       50         0.373333
       100        0.440000
       1000       0.603333
       10000      0.589967
0.3    5          0.000000
       10         0.000000
       50         0.306667
       100        0.433333
       1000       0.534667
       10000      0.546833
0.4    5          0.142857
       10         0.000000
       50         0.233333
       100        0.456667
       1000       0.540333
       10000      0.541467
0.5    5          0.571429
       10         0.833333
       50         0.466667
       100        0.473333
       1000       0.547333
       10000      0.512433
Name: Test_Ac

# Q7

In [34]:
circles.query("Class == 'Logistic Regression'").groupby(['Regularization_Parameter'])['Test_Accuracy'].mean()


Regularization_Parameter
0.01      0.639778
0.10      0.684000
1.00      0.694208
10.00     0.715222
100.00    0.716611
Name: Test_Accuracy, dtype: float64

In [35]:
circles.query("Class == 'Logistic Regression' & Regularization_Parameter == 1.00").groupby(['Samples','Noise'])['Test_Accuracy'].mean()


Samples  Noise
5        0.0      0.0000
         0.1      1.0000
         0.2      1.0000
         0.3      0.0000
         0.4      0.0000
         0.5      0.0000
10       0.0      0.5000
         0.1      1.0000
         0.2      0.0000
         0.3      0.0000
         0.4      1.0000
         0.5      1.0000
50       0.0      0.6000
         0.1      0.6000
         0.2      0.9000
         0.3      0.6000
         0.4      0.7000
         0.5      0.9000
100      0.0      0.9000
         0.1      0.6500
         0.2      0.9000
         0.3      0.8500
         0.4      0.7000
         0.5      0.9500
1000     0.0      0.9050
         0.1      0.8700
         0.2      0.8750
         0.3      0.8150
         0.4      0.8450
         0.5      0.7950
10000    0.0      0.8985
         0.1      0.8835
         0.2      0.8600
         0.3      0.8520
         0.4      0.8405
         0.5      0.8020
Name: Test_Accuracy, dtype: float64

In [36]:
circles.query("Class == 'KNN'").groupby(['Hyper_Parameter'])['Test_Accuracy'].mean()


Hyper_Parameter
1     0.664208
3     0.738236
5     0.831800
7     0.834167
9     0.899688
11    0.888833
13    0.874083
15    0.871812
17    0.872792
19    0.864229
Name: Test_Accuracy, dtype: float64

In [37]:
circles.query("Class == 'KNN' & Hyper_Parameter == 5").groupby(['Samples','Noise'])['Test_Accuracy'].mean()


Samples  Noise
10       0.0      0.5000
         0.1      1.0000
         0.2      0.0000
         0.3      0.0000
         0.4      1.0000
         0.5      1.0000
50       0.0      1.0000
         0.1      0.9000
         0.2      0.9000
         0.3      0.6000
         0.4      0.8000
         0.5      0.8000
100      0.0      1.0000
         0.1      1.0000
         0.2      1.0000
         0.3      0.9000
         0.4      0.6000
         0.5      0.9000
1000     0.0      1.0000
         0.1      1.0000
         0.2      0.9700
         0.3      0.9150
         0.4      0.8700
         0.5      0.8000
10000    0.0      1.0000
         0.1      0.9990
         0.2      0.9670
         0.3      0.9005
         0.4      0.8360
         0.5      0.7965
Name: Test_Accuracy, dtype: float64

In [38]:
moons.query("Class == 'Logistic Regression'").groupby(['Regularization_Parameter'])['Test_Accuracy'].mean()


Regularization_Parameter
0.01      0.296569
0.10      0.295750
1.00      0.323222
10.00     0.324569
100.00    0.324569
Name: Test_Accuracy, dtype: float64

In [39]:
moons.query("Class == 'Logistic Regression' & Regularization_Parameter == 100.00").groupby(['Samples','Noise'])['Test_Accuracy'].mean()


Samples  Noise
5        0.0      0.0000
         0.1      0.0000
         0.2      0.0000
         0.3      0.0000
         0.4      0.0000
         0.5      1.0000
10       0.0      0.0000
         0.1      0.0000
         0.2      0.0000
         0.3      0.0000
         0.4      0.0000
         0.5      1.0000
50       0.0      0.3000
         0.1      0.2000
         0.2      0.3000
         0.3      0.2000
         0.4      0.2000
         0.5      0.4000
100      0.0      0.4000
         0.1      0.1500
         0.2      0.4500
         0.3      0.3000
         0.4      0.4000
         0.5      0.5500
1000     0.0      0.4800
         0.1      0.4700
         0.2      0.4450
         0.3      0.5250
         0.4      0.4700
         0.5      0.5250
10000    0.0      0.4865
         0.1      0.4925
         0.2      0.4355
         0.3      0.5025
         0.4      0.5175
         0.5      0.4850
Name: Test_Accuracy, dtype: float64

In [40]:
moons.query("Class == 'KNN'").groupby(['Hyper_Parameter'])['Test_Accuracy'].mean()


Hyper_Parameter
1     0.394000
3     0.549181
5     0.467233
7     0.476067
9     0.527875
11    0.531208
13    0.540687
15    0.559562
17    0.543896
19    0.548646
Name: Test_Accuracy, dtype: float64

In [41]:
circles.query("Class == 'KNN' & Hyper_Parameter == 19").groupby(['Samples','Noise'])['Test_Accuracy'].mean()


Samples  Noise
50       0.0      0.6000
         0.1      0.6000
         0.2      0.9000
         0.3      0.7000
         0.4      0.7000
         0.5      0.8000
100      0.0      0.9500
         0.1      0.8500
         0.2      0.9500
         0.3      0.9000
         0.4      0.7000
         0.5      0.9500
1000     0.0      1.0000
         0.1      1.0000
         0.2      0.9800
         0.3      0.9100
         0.4      0.9000
         0.5      0.8150
10000    0.0      1.0000
         0.1      0.9995
         0.2      0.9690
         0.3      0.9075
         0.4      0.8515
         0.5      0.8090
Name: Test_Accuracy, dtype: float64