## Sound Statistical Evaluation Methods

#### In supervised learning we want a sound method to determine how well our classifier is working. There are three standard methods used to make this determination.

### Data Splitting
#### If you have a lot of labeled data say more than 10,000 observations the data splitting is fine. Common data splits are 80/20 or 90/10. 

### Cross-Validation
#### When you have less data, say < 10,000 observations 10-fold cross-validation is an accepted method to evaluate the performance of your classifier. In 10-fold cross-validation we split our data into 10 equal folds and we hold out each fold for testing and we train with the other 9-folds. So essentially we are training with 90% of our data and testing with 10%. We iterate over each of the 10 folds, holding out a different fold for testing and training with the remaining 9-folds. We will do this 10 times and obtain 10 evaluation scores which we will average to determine the performance of our classifier.

### Bootstrapping
#### When we are dealing with small datasets < 1000 observations bootstrapping can be used to obtain an accurate picture of our classifier's performance characteristics. The benefit of bootstrapping is that we do not have to reduce our training set beyond the original number of observations that we have. We randomly sample with replacement from our original observation set for the number of observations that we started with. For example if we have 100 observations we sample with replacement 100 observations from our original data set. This training set is referred to as our "in-the-bag" (ITB) training set. The unselected cases in our original data set are then used for testing. This set is referred to as our "out-of-bag" (OOB) test set. We typically iterate on this process at least 2000 times and average the results to determine the final performance score. Surprisingly, and mathematically provable, the OOB test set will be made up of ~37% or our original data set and our ITB data set will be made up of 63% distinct observations with ~37% duplicates.




### Environment Setup

In [1]:
import numpy as np
from sklearn.model_selection import KFold

### 10-fold Cross-Validation

#### Note that each test set is a unique set of observations from our original data set.

In [2]:

fold = 1
kf = KFold(n_splits=10, shuffle=True)
X = [0 for i in range(100)]
for X_train, X_test in kf.split(X):
    print ("Fold - " + str(fold))
    print("Training Set: " + str(X_train))
    print("Test Set " + str(X_test))
    print()
    fold += 1

Fold - 1
Training Set: [ 0  1  2  3  4  5  6  7  9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25
 26 27 29 30 31 32 33 34 36 37 38 39 41 42 43 44 45 46 47 48 49 50 51 52
 53 54 55 56 57 58 59 60 61 62 64 65 66 67 68 70 71 72 73 74 75 76 77 78
 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 95 98 99]
Test Set [ 8 14 28 35 40 63 69 94 96 97]

Fold - 2
Training Set: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 21 22 23 25
 27 28 29 30 31 32 33 35 36 37 39 40 41 44 46 47 48 49 50 51 52 54 55 56
 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 81
 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
Test Set [20 24 26 34 38 42 43 45 53 80]

Fold - 3
Training Set: [ 0  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 40 41 42 43 44 45 47 48 50 52 53 54
 55 57 59 60 61 62 63 64 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
Test Set [ 1  2 13 39 

### Bootstrapping

#### Note that each training set is the same size as our original data set and that our test set is a set of distinct cases from our original data set.

In [3]:
X = np.arange(100);
n = len(X)
OOBAverage = 0;
for i in range(1, 100):
    print("Iteration - " + str(i))
    ITB = np.random.choice(n, n, replace=True)
    X_ITB = X[ITB]
    print("Training Set (%d): %s" % (len(X_ITB), str(X_ITB)))
    X_OOB = np.delete(X, list(set(ITB)), 0)
    print("Test Set (%d): %s" % (len(X_OOB), str(X_OOB)))
    print()
    OOBAverage += len(X_OOB)
print()
print("Average OOB:" + str(OOBAverage/i))

Iteration - 1
Training Set (100): [19 15 83  1 22 17 54 97 84 41 74 77 47 70 81 10 18 40 48 88 60 65 38 83
 37 17 24 49 58 47  9 57 99 40  8  5 47  9 26 31 99 82 38 23  5 70  1 98
 79 11  3 30 42 22 68 54  5 63 37 35 71  7  4 61 17 51 99  5 26 43 50  6
 28 19 70  3 67 40 64 88 77 11 28 82 59 71  4 21 91 40 93 65 38 13 71 37
 26 15 93 32]
Test Set (38): [ 0  2 12 14 16 20 25 27 29 33 34 36 39 44 45 46 52 53 55 56 62 66 69 72
 73 75 76 78 80 85 86 87 89 90 92 94 95 96]

Iteration - 2
Training Set (100): [ 7 12 77 68 78 56  8 84 46 61 22 77  1 19 38 79 29 48 30 92  8 50 75 71
 71 45 68 76 62 16 26 69 69 72 58 33 20  2 98 89 24  4 31 46 85 59 32 73
 97 72 99 86 66  6 11 62 76 73 68 68 88  5 56 75 68 74 45 89 19 95 22 98
 46 53  4 39 71 26 80 38 29 77 27 54 82 24 81 45 95 17 56 63 98 34 98 69
 63 27 25  5]
Test Set (37): [ 0  3  9 10 13 14 15 18 21 23 28 35 36 37 40 41 42 43 44 47 49 51 52 55
 57 60 64 65 67 70 83 87 90 91 93 94 96]

Iteration - 3
Training Set (100): [25  2 62 16 56 12 45 6