# HOLD-OUT

<br>

## Introduction

<br>
Before we dive deeper into the details of the hold-out evaluation method, let’s take a look at a visual summary of the whole process:


<img src="images/hold-out-summary.png" alt="hold-out" width="60%" height="60%">

## Data Splitting

<br>
First, we <b>randomly split our data into two subsets : a training set and a test set</b>. Setting test data aside is a work-around for dealing with the imperfections of a non-ideal world, such as limited data and resources, and the inability to collect more data from the generating distribution. 

<br>
<b>The test set represents new, unseen data to our learning algorithm; it’s important that we only touch the test set once to make sure we don’t introduce any bias</b> when we estimate the generalization accuracy. The most frequent train-test ratio is 2/3, other common ratios are 60/40, 70/30, 80/20, or even 90/10.


<img src="images/hold-out-1.png" alt="hold-out" width="60%" height="60%">

## Model Setup

<br>
After having set the test samples aside, we will <b>choose a learning algorithm</b> that we think could be appropriate for the given problem. 

<br>
<b>Hyper-parameters</b> (also called meta-parameters) <b>are the parameters of the learning algorithm itself; in contrast to the actual model parameters, the learning algorithm does not learn the hyper-parameters from the training data</b>, we have to specify these values manually.

Since hyperparameters cannot be learned during model fitting, we need some sort of "external loop" to optimize them separately; we will see that the hold-out method is ill-suited for the task so, at least for the moment, we have to start with some fixed values (we could use our intuition or the default parameters in case we are using an existing library).

Once the learning algorightm has been configured into a model, we can <b>fit the latter on the training set</b>.

<img src="images/hold-out-2.png" alt="hold-out" width="60%" height="60%">

## Performance Evaluation

<br>
Since our model hasn’t seen the test set before, it should give us a pretty unbiased estimate of its performance on new data, in other words, of its ability to generalize. We use the model to predict the class labels for the test set; <b>the predicted labels (of the test set) will be compared with the correct ones in order to estimate the generalization accuracy of our model</b>.

<br>
Assuming that the algorithm could learn a better model from more data, the test set we with-held from fitting (so that we could have a less optmistic estimate of the generalization performance) represents valuable data which the algorithm hasn't seen yet. <b>If our model has not reached its capacity, our performance estimate would be pessimistically biased</b>. 


<img src="images/hold-out-3.png" alt="hold-out" width="60%" height="60%">

## Final Model

<br>
Now that we have an estimate of how well our model performs on unseen data, <b>there is no reason for with-holding the test set from the algorithm any longer</b>.

Since we assume our samples to be IID, there is no reason to assume the model would perform worse after feeding it all the available data. <b>Generally, the model will have a better generalization performance if the algorithms uses more data</b>, given that it hasn’t reached its capacity yet.

<img src="images/hold-out-4.png" alt="hold-out" width="60%" height="60%">

In [1]:
# SETUP : importing

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="darkgrid")

#import sklearn.linear_model as lm
import sklearn.neighbors as nbr
import sklearn.metrics as mtr

import utilcompute as uc
from pprint import pprint


In [2]:
# SETUP : reading in the datasets

data = np.column_stack( (load_iris().data, load_iris().target) )
df = pd.DataFrame(data)
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'species']

#print('df.shape[0] : ', df.shape[0])


In [3]:
#df.describe()

In [4]:
# DATA PREPROCESSING : deleting features

to_delete = []
cols = [c for c in df.columns.values.tolist() if (c not in to_delete)]
df = df[cols]

#print('columns : ', df.columns.values.tolist())

In [5]:
target = 'species'
if (isinstance(target, list)):
    features = [c for c in df.columns.values.tolist() if (c not in target)]
else:
    features = [c for c in df.columns.values.tolist() if (c != target)]

#print('features : ', features)
#print('target   : ', target)

In [6]:
# DATA PREPROCESSING : features standardization

vif_dict = uc.compute_vif(df = df, features = features)
print('df : ')
print()
pprint(vif_dict)

print()

df_std = uc.standardize(df = df, included = features, excluded = target)

vif_dict = uc.compute_vif(df = df_std, features = features)
print('df_std : ')
print()
pprint(vif_dict)


df : 

{'petal length': 173.96896536339727,
 'petal width': 55.48868864572551,
 'sepal length': 264.7457109493044,
 'sepal width': 97.111605833803296}

df_std : 

{'petal length': 31.397291650719751,
 'petal width': 16.141563956997683,
 'sepal length': 7.1031134428332869,
 'sepal width': 2.0990386257420881}


In [7]:
# DATA PREPROCESSING : vif subset selection [reduces multicollinearity]

VIF = False

if (VIF):
    selected_features = uc.vif_best_subset_selection(
        vif_threshold = 5, 
        df = df_std, 
        features = features, 
        level = len(features), 
        debug = False
    )
    t = uc.concatenate(features, target)
    df_std = df_std[t]
    
    vif_dict = uc.compute_vif(df = df_std, features = selected_features)
    pprint(vif_dict)
else:
    selected_features = features


In [8]:
# DATA PREPROCESSING : final setup

df = df_std
features = selected_features

print(df_std.columns.values)

['petal length' 'petal width' 'sepal length' 'sepal width' 'species']


In [9]:
# GLOBAL PARAMETERS 

train_perc = 0.8
delimiter = int(len(df) * train_perc)
s = 1

print('train set size   : ', delimiter)
print('test  set size   : ', (len(df) - delimiter))
print()
print('seed : ', s)


train set size   :  120
test  set size   :  30

seed :  1


In [10]:
np.random.seed(s)
df_shuffled = df.reindex(np.random.permutation(df.index))    
    
train = df_shuffled[:delimiter]
test = df_shuffled[delimiter:]
    
#    print('df_shuffled indices : {0} ... {1}'.format(df_shuffled.index.values[:3], df_shuffled.index.values[-3:]))     
#    print('train set/fold size : ', len(train))
#    print('test  set/fold size : ', len(test))
#    print()
#    print('train : {0} - {1}'.format(train.index.values.tolist()[:3], train.index.values[-3:]))
#    print('test  : {0} - {1}'.format(test.index.values.tolist()[:3], test.index.values[-3:]))
#    print()
    
#model = lm.LogisticRegression()
model = nbr.KNeighborsClassifier(n_neighbors = 5)
model.fit(train[features], train[target])

y_pred_train = model.predict(train[features])
y_pred_test = model.predict(test[features])

#metrics_train = uc.compute_classification_metrics(y = train[target], y_pred = y_pred_train)
#metrics_test = uc.compute_classification_metrics(y = test[target], y_pred = y_pred_test)

acc_train = mtr.accuracy_score(y_true = train[target], y_pred = y_pred_train, normalize = True, sample_weight = None)
acc_test = mtr.accuracy_score(y_true = test[target], y_pred = y_pred_test, normalize = True, sample_weight = None)

#score_train = 1 - metrics_train['ACCURACY'] 
#score_test = 1 - metrics_test['ACCURACY'] 

score_train = 1 - acc_train
score_test = 1 - acc_test
    

In [11]:
print()
print('train | err : ', score_train)
print('test  | err : ', score_test)


train | err :  0.0333333333333
test  | err :  0.0666666666667


## Considerations

<br>
<b>The expected prediction error estimated through resubstitution is not a reliable approximation of the actual value</b>,
since it introduces a very <b>optimistic bias due to overfitting</b>.

<br>
<b>Hold-out evaluation is a better alternative</b> to resubstitution evaluation.

<br>
<b>Further questions/issues</b> :

<br>
<ul style="list-style-type:square">
    <li>
        the distribution of data points which happen to fall in the training or test set; in other words, how the split is
        performed
    </li>
    <br>
    <li>
        some data points may never appear in the training (or test) set
    </li>
    <br>
    <li>
        the size we choose for the test set 
    </li>
</ul>

<br>
An advanced version of the hold-out method uses <b>'stratification'</b> to make sure the target variable is represented consistently (or with approximately equal proportions in case of classificationin) in both subsets.   

<br>
<b>Note</b> : see Repeated Hold-Out

In [13]:
#     - the choice of the size for the test set has a lot of influence :
#       withholding a large portion of the data as a test set may lead to pessimistically biased estimates,
#       while reducing the size of the test set may decrease this overly pessimistic bias, 
#       the variance of our performance estimates will most likely increase. 

#     - the more we reduce the size of the test set, the closer we resemble the resubstitution method 
#       and we'll have a progressively optimistic bias but higher variance, the more we increase it (up to a point)
#       and we'll have a progressively pessimistic bias but a lower variance

#     - we should find the optimal size of the test set that allows us to to have reasonable values 
#       for both the estimate of expected prediction error and its variance, 
#       without decreasing too much the size of the training set.

#     - having chosen a good test size will not protect us from the effects
#       of the distribution of data-points between training and test set, the data in the test set 
#       (and therefore also in the training set) should change across iterations

## References

<br>
<ul style="list-style-type:square">
    <li>
        Sebastian Raschka - Model evaluation, model selection, and algorithm selection in machine learning - Part I <br>
        https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
    </li>
</ul>
