In [1]:
# - You need to know how well your algorithms perform on unseen data. 

# - The best way to evaluate the performance of an algorithm would be to make predictions for new data 
# to which you already know the answers. 

# The second best way is to use clever techniques from statistics called resampling methods that allow 
# you to make accurate estimates for how well your algorithm will perform on new data.

In [2]:
# 9.1 Evaluate Machine Learning Algorithms

In [3]:
# - In order to avoid over-fitting, we can't prepare machine learning algorithms on a training dataset and 
# use predictions from the same dataset to evaluate performance.

# - We must evaluate the machine learning algorithms on data that has not been used to train the algorithms.

# - The evaluation is an estimate that we can use to talk about how well we think the algorithm may actually 
# do in practice. 

# - It is not a guarantee of performance. 

# - Once we estimate the performance of our algorithm, we can then re-train the final algorithm on the entire 
# training dataset and get it ready for operational use.

In [4]:
from pandas import read_csv

In [5]:
import numpy

In [6]:
import sys

In [7]:
def print_data(_data):
    return numpy.savetxt(sys.stdout, _data[:5,:], '%5.3f')

In [8]:
_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'

In [9]:
_col_names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

In [10]:
_dataframe = read_csv(_uri, names=_col_names)

In [11]:
_array = _dataframe.values

In [12]:
print_data(_array)

6.000 148.000 72.000 35.000 0.000 33.600 0.627 50.000 1.000
1.000 85.000 66.000 29.000 0.000 26.600 0.351 31.000 0.000
8.000 183.000 64.000 0.000 0.000 23.300 0.672 32.000 1.000
1.000 89.000 66.000 23.000 94.000 28.100 0.167 21.000 0.000
0.000 137.000 40.000 35.000 168.000 43.100 2.288 33.000 1.000


In [13]:
_X = _array[:,0:8]

In [14]:
print_data(_X)

6.000 148.000 72.000 35.000 0.000 33.600 0.627 50.000
1.000 85.000 66.000 29.000 0.000 26.600 0.351 31.000
8.000 183.000 64.000 0.000 0.000 23.300 0.672 32.000
1.000 89.000 66.000 23.000 94.000 28.100 0.167 21.000
0.000 137.000 40.000 35.000 168.000 43.100 2.288 33.000


In [15]:
_Y = _array[:,8:]

In [16]:
print_data(_Y)

1.000
0.000
1.000
0.000
1.000


In [17]:
_Y = numpy.ravel(_Y)

In [18]:
print(_Y[:5])

[ 1.  0.  1.  0.  1.]


In [19]:
# 9.2 Split into Train and Test Sets

In [20]:
# - The size of the split can depend on the size and specifics of your dataset, although it is common
# to use 67% of the data for training and the remaining 33% for testing.

# - This algorithm evaluation technique is very fast. 

# - It is ideal for large datasets (millions of records) where there is strong evidence that both splits 
# of the data are representative of the underlying problem. Because of the speed, it is useful to use this 
# approach when the algorithm you are investigating is slow to train. 

# - A downside of this technique is that it can have a high variance. This means that differences in the 
# training and test dataset can result in meaningful differences in the estimate of accuracy.

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
from sklearn.linear_model import LogisticRegression

In [23]:
_test_size = 0.33

In [24]:
_seed = 7

In [25]:
_X_train, _X_test, _Y_train, _Y_test = train_test_split(_X, _Y, test_size=_test_size, random_state=_seed)

In [26]:
_model = LogisticRegression()

In [27]:
_model.fit(_X_train, _Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [28]:
_score = _model.score(_X_test, _Y_test)

In [29]:
_score

0.75590551181102361

In [30]:
print('accuracy: {:.2%}'.format(_score))

accuracy: 75.59%


In [31]:
# 9.3 K-fold Cross Validation

In [32]:
# - Cross-validation is an approach that you can use to estimate the performance of a machine learning 
# algorithm with less variance than a single train-test set split. 

# - It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called 
# a fold. The algorithm is trained on k âˆ’ 1 folds with one held back and tested on the held back fold. 
# This is repeated so that each fold of the dataset is given a chance to be the held back test set. 
# After running cross-validation you end up with k different performance scores that you can summarize 
# using a mean and a standard deviation.

# - The result is a more reliable estimate of the performance of the algorithm on new data. 
# It is more accurate because the algorithm is trained and evaluated multiple times on different data.

# - For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common.

In [33]:
from sklearn.model_selection import KFold

In [34]:
from sklearn.model_selection import cross_val_score

In [35]:
from sklearn.linear_model import LogisticRegression

In [36]:
_num_folds = 10

In [37]:
_seed = 7

In [38]:
_kfold = KFold(n_splits=_num_folds, random_state=_seed)

In [39]:
_model = LogisticRegression()

In [40]:
_score = cross_val_score(_model, _X, _Y, cv=_kfold)

In [41]:
print('accuracy: {:.2%}, {:.2%}'.format(_score.mean(), _score.std()))

accuracy: 76.95%, 4.84%


In [42]:
# 9.4 Leave One Out Cross Validation

In [43]:
# - You can configure cross-validation so that the size of the fold is 1 (k is set to the number 
# of observations in your dataset). 

# - This variation of cross-validation is called leave-one-out cross- validation. 

# - The result is a large number of performance measures that can be summarized in an effort to 
# give a more reasonable estimate of the accuracy of your model on unseen data. 

# - A downside is that it can be a computationally more expensive procedure than k-fold cross-validation.

In [44]:
from sklearn.model_selection import LeaveOneOut

In [45]:
_loo = LeaveOneOut()

In [46]:
_model = LogisticRegression()

In [47]:
_score = cross_val_score(_model, _X, _Y, cv=_loo)

In [48]:
print('accuracy: {:.2%}, {:.2%}'.format(_score.mean(), _score.std()))

accuracy: 76.82%, 42.20%


In [49]:
# - You can see in the standard deviation that the score has more variance than the k-fold 
# cross-validation results described above

In [50]:
# 9.5 Repeated Random Test-Train Splits

In [51]:
# - Another variation on k-fold cross-validation is to create a random split of the data like 
# the train/test split described above, but repeat the process of splitting and evaluation of 
# the algorithm multiple times, like cross-validation. 

# - This has the speed of using a train/test split and the reduction in variance in the estimated 
# performance of k-fold cross-validation. 

# - You can also repeat the process many more times as needed to improve the accuracy. 

# - A down side is that repetitions may include much of the same data in the train or the test 
# split from run to run, introducing redundancy into the evaluation.

In [52]:
from sklearn.model_selection import ShuffleSplit

In [53]:
_n_splits = 10

In [54]:
_test_size = 0.33

In [55]:
_seed = 7

In [56]:
_kfold = ShuffleSplit(n_splits=_n_splits, test_size=_test_size, random_state=_seed)

In [57]:
_model = LogisticRegression()

In [58]:
_score = cross_val_score(_model, _X, _Y, cv=_kfold)

In [59]:
print('accuracy: {:.2%}, {:.2%}'.format(_score.mean(), _score.std()))

accuracy: 76.50%, 1.70%


In [60]:
# 9.6 What Techniques to Use When

In [61]:
# - Generally k-fold cross-validation is the gold standard for evaluating the performance of a 
# machine learning algorithm on unseen data with k set to 3, 5, or 10.

# - Using a train/test split is good for speed when using a slow algorithm and produces performance 
# estimates with lower bias when using large datasets.
  
# - Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates 
# when trying to balance variance in the estimated performance, model training speed and dataset size.

# - The best advice is to experiment and find a technique for your problem that is fast and produces 
# reasonable estimates of performance that you can use to make decisions. 

# - If in doubt, use 10-fold cross-validation.