### Evaluating performance of ML algorithms

2 ways:
    
1. Make predictions for unseen data that you already know the answer
2. Use statistics (resampling methods), to estimate performance on new data


#### Evaluate Algorithms

Overfitting, will occur if same data used during preparation is used to evaluate performance (perfect score in the same training set, new data performance very poor)

Evaluation estimates how well the algo will do in practice, no guarantees. After estimation, retrain algo in the whole dataset and make it ready:

Techniques to split dataset, and create useful estimates:

- Train and Test Sets
- k-fold Cross-Validation
- Leave One Out Cross-Validation
- Repeated Random Test-Train Splits


#### Split into Train and Test Sets

Break your data into training and testing datasets. 

From the original dataset: Train algo in part 1, use part 2 to make predictions against the expected results. A common split is 67% training, 33% testing but depends on size and specifics of each dataset

Fast, and ideal for large datasets where both splits represent the problem accurately. Can have strong variance (may result in noticeable differences between training and test sets).

For example, evaluating the accuracy of a Logistic Regression model:


In [2]:
from pandas import read_csv
from numpy import set_printoptions

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#size of the split
test_size = 0.33
# help results to be reproducible, same randoms each run; needed when comparing
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print ("Accuracy %.3f%%") % (result*100.0)



Accuracy 75.591%


#### k-fold Cross-Validation

Approach with less variance than Split (above), breaks the dataset in k-parts or folds. Algo is trained in k-1 folds, one held back. Repeated until each fold is held back once. End up with k different performance scores, summarized using a mean and a standard deviation.
    
The result is more accurate prediction in new data. k is chosen so each fold has a reasonable size (sample problem) -- While allowing enough repetitions to train and estimate unseen data.

modest datasets of 1000s - 10000s of records, k = 3, 5, 10 are common

In [8]:
from pandas import read_csv
from numpy import set_printoptions

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print ("Accuracy %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

Accuracy 76.951% (4.841%)


#### Leave One Out Cross-Validation

#### Repeated Random Test-Train Splits