# <font color='#28B463'>STRATIFIED HOLD-OUT

<br>

## <font color='#28B463'>Introduction

<br>
The problem of the representativeness of the training (and test) set becomes even worse if the original dataset has a high class imbalance upfront. In the worst-case scenario, the test set may not contain any instance of a minority class at all. The common practice is to divide the dataset in a stratified fashion. 

<br>
Stratification simply means that we randomly split the dataset so that each class is correctly represented in the resulting subsets.


In [1]:
# SETUP : importing

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="darkgrid")

from sklearn.model_selection import train_test_split
#import sklearn.linear_model as lm
import sklearn.neighbors as nbr
import sklearn.metrics as mtr

import utilcompute as uc
from pprint import pprint


In [2]:
# SETUP : reading in the datasets

data = np.column_stack( (load_iris().data, load_iris().target) )
df = pd.DataFrame(data)
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'species']

#print('df.shape[0] : ', df.shape[0])


In [3]:
#df.describe()

In [4]:
# DATA PREPROCESSING : deleting features

to_delete = []
cols = [c for c in df.columns.values.tolist() if (c not in to_delete)]
df = df[cols]

#print('columns : ', df.columns.values.tolist())

In [5]:
target = 'species'
if (isinstance(target, list)):
    features = [c for c in df.columns.values.tolist() if (c not in target)]
else:
    features = [c for c in df.columns.values.tolist() if (c != target)]

#print('features : ', features)
#print('target   : ', target)

In [6]:
# DATA PREPROCESSING : features standardization

vif_dict = uc.compute_vif(df = df, features = features)
print('df : ')
print()
pprint(vif_dict)

print()

df_std = uc.standardize(df = df, included = features, excluded = target)

vif_dict = uc.compute_vif(df = df_std, features = features)
print('df_std : ')
print()
pprint(vif_dict)


df : 

{'petal length': 173.96896536339727,
 'petal width': 55.48868864572551,
 'sepal length': 264.7457109493044,
 'sepal width': 97.111605833803296}

df_std : 

{'petal length': 31.397291650719751,
 'petal width': 16.141563956997683,
 'sepal length': 7.1031134428332869,
 'sepal width': 2.0990386257420881}


In [7]:
# DATA PREPROCESSING : vif subset selection [reduces multicollinearity]

VIF = False

if (VIF):
    selected_features = uc.vif_best_subset_selection(
        vif_threshold = 5, 
        df = df_std, 
        features = features, 
        level = len(features), 
        debug = False
    )
    t = uc.concatenate(features, target)
    df_std = df_std[t]
    
    vif_dict = uc.compute_vif(df = df_std, features = selected_features)
    pprint(vif_dict)
else:
    selected_features = features


In [8]:
# DATA PREPROCESSING : final setup

df = df_std
features = selected_features

print(df_std.columns.values)

['petal length' 'petal width' 'sepal length' 'sepal width' 'species']


In [9]:
# GLOBAL PARAMETERS 

train_perc = 0.8
delimiter = int(len(df) * train_perc)
random_state = 1

print('train set size   : ', delimiter)
print('test  set size   : ', (len(df) - delimiter))
print()
print('random_state : ', random_state)


train set size   :  120
test  set size   :  30

random_state :  1


In [10]:
# [TEST]

SHOW_EXAMPLE = False

if (SHOW_EXAMPLE):
    
    X_train, X_test, y_train, y_test = train_test_split(
        df[features], df[target], 
        train_size = train_perc, 
        random_state = random_state, 
        stratify = df[target]
    )

    print('random state : ', random_state)
    print()
    print('X_train : {0} ... {1}'.format(X_train.index.values[:8], X_train.index.values[-8:]))
    print('y_train : {0} ... {1}'.format(y_train.index.values[:8], y_train.index.values[-8:]))
    print(y_train.value_counts(normalize = True, sort = True))

    print()
    print('X_test : {0} ... {1}'.format(X_test.index.values[:8], X_test.index.values[-8:]))
    print('y_test : {0} ... {1}'.format(y_test.index.values[:8], y_test.index.values[-8:]))
    print(y_test.value_counts(normalize = True, sort = True))
    print()
    print()

    fig, axs = plt.subplots(nrows = 1, ncols = 2 , figsize=(20, 4))  
    sns.countplot(y = y_train.values, ax = axs[0])
    sns.countplot(y = y_test.values, ax = axs[1])
    plt.show()

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    df[features], 
    df[target], 
    train_size = train_perc, 
    random_state = random_state, 
    stratify = df[target]
)

#model = lm.LogisticRegression()
model = nbr.KNeighborsClassifier(n_neighbors = 5)
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

#metrics_train = uc.compute_classification_metrics(y = y_train, y_pred = y_pred_train)
#metrics_test = uc.compute_classification_metrics(y = y_test, y_pred = y_pred_test)

acc_train = mtr.accuracy_score(y_true = y_train, y_pred = y_pred_train, normalize = True, sample_weight = None)
acc_test = mtr.accuracy_score(y_true = y_test, y_pred = y_pred_test, normalize = True, sample_weight = None)

#score_train = 1 - metrics_train['ACCURACY'] 
#score_test = 1 - metrics_test['ACCURACY'] 

score_train = 1 - acc_train
score_test = 1 - acc_test
    

In [12]:
print()
print('train | err : ', score_train)
print('test  | err : ', score_test)



train | err :  0.0333333333333
test  | err :  0.0333333333333


## <font color='#28B463'>Considerations

<br>
With regard to the distribution of data points (in the training or test set), <b>stratification guarantees that the target variable is represented consistently</b> (with approximately equal proportions) <b>in both subsets</b>.

<br>
<b>Stratified hold-out validation is thus a better alternative to hold-out validation</b>.

<br>
<font color='#28B463'><b>Further questions/issues</b></font> :

<br>
<ul style="list-style-type:square">
    <li>
        some data points may never appear in training (or test) set (inherited from hold-out validation)
    </li>
    <br>
    <li>
        some data points may never appear in training (or test) set (inherited from hold-out validation)
    </li>
</ul>

<br>
<font color='#28B463'><b>Note</b></font> : see Repeated Stratified Hold-Out


## <font color='#28B463'>References

<br>
<ul style="list-style-type:square">
    <li>
        Sebastian Raschka - Model evaluation, model selection, and algorithm selection in machine learning - Part I <br>
        https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
    </li>
    <br>
    <li>
        Sebastian Raschka - Model evaluation, model selection, and algorithm selection in machine learning - Part II <br>
        https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html
    </li>
</ul>
