# Bootstrap

Bootstrapping is not supported in scikit-learn anymore. There are other techniques we can use, and other sampling methods in the cross-validation function. A common one is addressed below.

## The data

First, loading data again:

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

X,y = make_classification(n_samples=20, n_features=10,
                               n_informative=2, n_redundant=0, n_repeated=0,
                               n_classes=2,
                               n_clusters_per_class=1,
                               weights=(0.7,0.3),
                               class_sep=0.99, random_state=14)


# You already know about training and test splits:
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

## The ShuffleSplit function

Now let's look at said function. ShuffleSplit is performing cross-validation, but shuffles the data after each iteration to avoid a deterministic training and test set. Hence, some training and test sets have overlapping instances:

In [2]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate

metrics = ['accuracy']
ss = ShuffleSplit(n_splits = 10, test_size=0.3)

# Printing the indices:
for train_index, test_index in ss.split(X):
    print("Training indices:", train_index, "Test indices:", test_index)

Training indices: [ 3  0 16  9 11 14 18 19 15 17 12  8  2  6] Test indices: [ 5  7  1 10 13  4]
Training indices: [ 7 17 14 12  1  5  4  8  2  3 16  6 15 18] Test indices: [19 13  9 11  0 10]
Training indices: [ 0  5 15  7  2 12  6 17  3 19  8  1  4 16] Test indices: [13 10 11  9 14 18]
Training indices: [13  0 10  7  6  1  5  8 14 19 18 15 17  3] Test indices: [ 2 12 16  4 11  9]
Training indices: [16  2 11 18 17  6  3  5  7  9 15  4 12 14] Test indices: [ 0  8 19 13 10  1]
Training indices: [ 2  4 17  5 19  9 15  3  8 10  0 16 18 11] Test indices: [ 7 14 12  1 13  6]
Training indices: [18 12 11  5 19  9 10  7  3 14  2 13 17  0] Test indices: [ 4  8 15  6 16  1]
Training indices: [12 13 10 18  9 17  0 11  4  1  5  6 15  7] Test indices: [ 8 14 16  2  3 19]
Training indices: [14 13 17 12  9  7 18  6 15  8 11  1  4 10] Test indices: [19  0  5  2  3 16]
Training indices: [ 1 11 18 16  6 13 19 15 10  9  8  5 14  7] Test indices: [ 2  0  4 12 17  3]


In [3]:
print("\nMetrics: ")
outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=ss, return_train_score=False)
for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))


Metrics: 
fit_time value: [0.00080991 0.00051403 0.000561   0.00050068 0.00039959 0.00039387
 0.00045919 0.00040078 0.00061893 0.00070691]
score_time value: [0.00032234 0.00037289 0.00026035 0.00029945 0.00031805 0.00036788
 0.00031376 0.00030112 0.00036454 0.00032401]
test_accuracy value: [1.  0.8 0.8 1.  0.6 0.4 0.6 0.6 0.8 0.4]


Notice how some samples are returning in the test sets.

## Stratified shuffling

A stratified version exists as well:

In [4]:
from sklearn.model_selection import StratifiedShuffleSplit

metrics = ['accuracy']
ss = StratifiedShuffleSplit(n_splits = 10, test_size=0.3)

outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=ss, return_train_score=False)

for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00055051 0.00058889 0.00043964 0.00036931 0.00030208 0.00030017
 0.00029993 0.00032496 0.00030494 0.00030375]
score_time value: [0.00025678 0.0002501  0.00034833 0.00021386 0.00019121 0.00018954
 0.00019073 0.00019264 0.00018954 0.00018835]
test_accuracy value: [0.8 0.6 0.8 0.8 1.  0.8 0.6 0.6 0.8 0.8]
