###  Python Basics Tutorial

#### Evaluate Performance of Machine Learning Algorithms with Resampling Basics Tutorial

####  Machine Learning Mastery with Python
####  Jason Brownlee

### In this recipe:

- Create Traning and Test sets: good for speed when using large data sets
- k-fold CV: 'gold standard'
- LOOCV: good comprimise when balancing model performance variance and dataset size
- Repeated Random Test/Train Splits: good comprimise when balancing model performance variance and dataset size

In [10]:
path = 'D:\\OneDrive - QJA\\My Files\\DataScience\\DataSets'
filename = 'pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 
         'mass', 'pedi', 'age', 'class']


### Split into Train and Test Sets

- take care to ensure a random split will not result in high variance between train and test
    - if not, find another split method

In [9]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

dataframe = read_csv(path + '\\' + filename, 
                    names = names)

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

test_size = 0.33
seed = 123

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                   test_size = test_size,
                                                   random_state = seed)

model = LogisticRegression(solver = 'liblinear')
model.fit(X_train, Y_train)

pred_test = model.score(X_test, Y_test)

print('Accuracy: %.3f%%' % (pred_test * 100.0))

Accuracy: 79.528%


### K-Fold Cross Validation

- data is split inot k folds
- alogo trained on k-1 folds where 1 is held back as test
- can be better be more reliable than train/test split method
- k value must allow test partition to be appropriate sample size

In [14]:
# 10-fold cross validation

#from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

# specify kfold parameters
kfold = KFold(n_splits = 10, random_state = 7)
# create object to represent LR algo
model = LogisticRegression(solver = 'liblinear')

# run CV LR model
results = cross_val_score(model, X, Y, cv = kfold)

print('Accuracy: %.3f%%, Stnd Dev: (%.3f%%)' % (results.mean()*100.0,
                                                results.std()*100.0))

Accuracy: 76.951%, Stnd Dev: (4.841%)


### Leave One Out Cross Validation

- k is set to number of observations in data set and 1 observation is left out each time
- results in many measurements that are summarized
- computationally expensive

In [18]:
#from pandas import read_csv
#from sklearn.model_selection import cross_val_score
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import LeaveOneOut

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

loocv = LeaveOneOut()
model = LogisticRegression(solver = 'liblinear')

results = cross_val_score(model, X, Y, cv = loocv)

print('Accuracy: %.3f%%, Stnd Dev: %.3f%%' % (results.mean()*100.0,
                                             results.std()*100.0))

# note with stdev that there is much more variance with this model
#   than with k-fold cv model above

Accuracy: 76.953%, Stnd Dev: 42.113%


### Repeated Random Test-Train Splits

- variation of k-fold
- creates random train/test split repeated mutiple times
- con: splits may contain much of same data in train and test from each run

In [21]:
#from pandas import read_csv
#from sklearn.model_selection import cross_val_score
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

n_splits = 10
test_size = .33
seed = 7

# create object to represent random repeat specs
kfold = ShuffleSplit(n_splits = n_splits,
                    test_size = test_size,
                    random_state = seed)

model = LogisticRegression(solver = 'liblinear')

results = cross_val_score(model, X, Y, cv = kfold)

print('Accuracy: %.3f%%, Stnd Dev: %.3f%%' % (results.mean()*100.0,
                                               results.std()*100.0))

# has low variance similar to k-fold cv

Accuracy: 76.496%, Stnd Dev: 1.698%
