# Introduction

The notebook is intended to experiment with different train-test split methods.

In [13]:
import numpy as np

from sklearn.model_selection import ShuffleSplit, KFold, StratifiedKFold

# Define Data

In [9]:
# Define dataset and labels
x = np.arange(1, 11, 1)
y = np.random.randint(0, 2, 10)

# Train-Test Split

In [7]:
# Define a function for showing splits
def show_splits(splitter, x, y):
    
    # Perform the Split
    for train, test in splitter.split(x, y):

        print("Train Features: " + str(X[train]))
        print("Train Labels: " + str(y[train]))
        print("Test Features: " + str(X[test]))
        print("Test Labels: " + str(y[test]))
        print("\n")

## Shuffle Split

The dataset is shuffled every time before creating each split. This method does not guarantee that all folds will be different.

In [10]:
# Define the Splitter
shuffle_splitter = ShuffleSplit(n_splits=5,
                                test_size=.2,
                                random_state=0)

show_splits(shuffle_splitter, x, y)

Train Features: [ 5 10  2  7  8  4  1  6]
Train Labels: [0 1 0 0 1 0 0 1]
Test Features: [3 9]
Test Labels: [0 1]


Train Features: [ 2  3 10  9  1  7  8  5]
Train Labels: [0 0 1 1 0 0 1 0]
Test Features: [4 6]
Test Labels: [0 1]


Train Features: [ 9  5  6  2  1  7 10  8]
Train Labels: [1 0 1 0 0 0 1 1]
Test Features: [3 4]
Test Labels: [0 0]


Train Features: [10  3  8  6  9  1  4  5]
Train Labels: [1 0 1 1 1 0 0 0]
Test Features: [7 2]
Test Labels: [0 0]


Train Features: [ 8  5  2  1  7  9 10  4]
Train Labels: [1 0 0 0 0 1 1 0]
Test Features: [6 3]
Test Labels: [1 0]




This solution is not really good while handling imbalanced classes, since the proportion between the classes can ben 100%-0%.

## KFold

Split dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

In [12]:
# Define the Splitter
kfold = KFold(n_splits=5)

show_splits(kfold, x, y)

Train Features: [ 3  4  5  6  7  8  9 10]
Train Labels: [0 0 0 1 0 1 1 1]
Test Features: [1 2]
Test Labels: [0 0]


Train Features: [ 1  2  5  6  7  8  9 10]
Train Labels: [0 0 0 1 0 1 1 1]
Test Features: [3 4]
Test Labels: [0 0]


Train Features: [ 1  2  3  4  7  8  9 10]
Train Labels: [0 0 0 0 0 1 1 1]
Test Features: [5 6]
Test Labels: [0 1]


Train Features: [ 1  2  3  4  5  6  9 10]
Train Labels: [0 0 0 0 0 1 1 1]
Test Features: [7 8]
Test Labels: [0 1]


Train Features: [1 2 3 4 5 6 7 8]
Train Labels: [0 0 0 0 0 1 0 1]
Test Features: [ 9 10]
Test Labels: [1 1]




## StratifiedKFold

In [14]:
# Define the Splitter
stratified_kfold = StratifiedKFold(n_splits=5,shuffle=False)

show_splits(stratified_kfold, x, y)

Train Features: [ 3  4  5  6  7  8  9 10]
Train Labels: [0 0 0 1 0 1 1 1]
Test Features: [1 2]
Test Labels: [0 0]


Train Features: [ 1  2  4  5  7  8  9 10]
Train Labels: [0 0 0 0 0 1 1 1]
Test Features: [3 6]
Test Labels: [0 1]


Train Features: [ 1  2  3  5  6  7  9 10]
Train Labels: [0 0 0 0 1 0 1 1]
Test Features: [4 8]
Test Labels: [0 1]


Train Features: [ 1  2  3  4  6  7  8 10]
Train Labels: [0 0 0 0 1 0 1 1]
Test Features: [5 9]
Test Labels: [0 1]


Train Features: [1 2 3 4 5 6 8 9]
Train Labels: [0 0 0 0 0 1 1 1]
Test Features: [ 7 10]
Test Labels: [0 1]




