# <font color='#28B463'>RESUBSTITUTION

<br>

## <font color='#28B463'>Introduction

<br>

The evaluation method known as resubsitution consists in fitting a model to the same training set we will use for predictions. We will see that resubstitution, although being the simplest evaluation technique, introduces a very optimistic (statistical) bias due to overfitting.


In [1]:
# SETUP : importing

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="darkgrid")

#import sklearn.linear_model as lm
import sklearn.neighbors as nbr
import sklearn.metrics as mtr

import utilcompute as uc
from pprint import pprint


In [2]:
# SETUP : reading in the datasets

data = np.column_stack( (load_iris().data, load_iris().target) )
df = pd.DataFrame(data)
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'species']

#print('df.shape[0] : ', df.shape[0])


In [3]:
#df.describe()

In [4]:
# DATA PREPROCESSING : deleting features

to_delete = []
cols = [c for c in df.columns.values.tolist() if (c not in to_delete)]
df = df[cols]

#print('columns : ', df.columns.values.tolist())

In [5]:
target = 'species'
if (isinstance(target, list)):
    features = [c for c in df.columns.values.tolist() if (c not in target)]
else:
    features = [c for c in df.columns.values.tolist() if (c != target)]

#print('features : ', features)
#print('target   : ', target)

In [6]:
# DATA PREPROCESSING : features standardization

vif_dict = uc.compute_vif(df = df, features = features)
print('df : ')
print()
pprint(vif_dict)

print()

df_std = uc.standardize(df = df, included = features, excluded = target)

vif_dict = uc.compute_vif(df = df_std, features = features)
print('df_std : ')
print()
pprint(vif_dict)


df : 

{'petal length': 173.96896536339727,
 'petal width': 55.48868864572551,
 'sepal length': 264.7457109493044,
 'sepal width': 97.111605833803296}

df_std : 

{'petal length': 31.397291650719751,
 'petal width': 16.141563956997683,
 'sepal length': 7.1031134428332869,
 'sepal width': 2.0990386257420881}


In [7]:
# DATA PREPROCESSING : vif subset selection [reduces multicollinearity]

VIF = False

if (VIF):
    selected_features = uc.vif_best_subset_selection(
        vif_threshold = 5, 
        df = df_std, 
        features = features, 
        level = len(features), 
        debug = False
    )
    t = uc.concatenate(features, target)
    df_std = df_std[t]
    
    vif_dict = uc.compute_vif(df = df_std, features = selected_features)
    pprint(vif_dict)
else:
    selected_features = features


In [8]:
# DATA PREPROCESSING : final setup

df = df_std
features = selected_features

print(df_std.columns.values)

['petal length' 'petal width' 'sepal length' 'sepal width' 'species']


In [9]:
# GLOBAL PARAMETERS

s = 1

print('seed : ', s)

seed :  1


In [10]:
np.random.seed(s)
df_shuffled = df.reindex(np.random.permutation(df.index))    
       
#model = lm.LogisticRegression()
model = nbr.KNeighborsClassifier(n_neighbors = 5)
model.fit(df_shuffled[features], df_shuffled[target])
y_pred = model.predict(df_shuffled[features])

#metrics = uc.compute_classification_metrics(y = df_shuffled[target], y_pred = y_pred)
acc = mtr.accuracy_score(y_true = df_shuffled[target], y_pred = y_pred, normalize = True, sample_weight = None)
    

In [11]:
print()
#print('err : ', 1 - metrics['ACCURACY'])
print('err : ', 1 - acc)


err :  0.0466666666667


## <font color='#28B463'>Considerations

<br>
<b>We really can't tell whether the model simply memorized the training data or if it is actually able to generalize well to new, unseen data</b>.

<font color='#28B463'><b>Note</b></font> : see Hold-Out


## <font color='#28B463'>References

<br>
<ul style="list-style-type:square">
    <li>
        Sebastian Raschka - Model evaluation, model selection, and algorithm selection in machine learning - Part I <br>
        https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
    </li>
</ul>
