# 5. Evaluating Model Performance

### Objectives
* Learn how to estimate a model's performance on unseen data
* Understand why evaluating on the training set is wrong
* Know what generalization performance is
* Learn how to create a training/testing split

### Resources
* [Four-part series][1] on model evaluation by Sebastian Raschka
* Scikit-Learn user guide on [cross validation][2]

## Introduction
In the previous notebooks, we have been evaluating our model performance by calling each estimator's `score` method which returned the accuracy. All of this scoring we've done so far is wrong. Evaluating our model's performance on the same data that we trained them on, will not give us a true measure of how likely it is to perform in the future.

This is akin to taking a test in school where the professor hands you the questions and answers beforehand. Our score gives us very little information about how well we would do on questions we've never encountered before.

## Goal: Estimate Accuracy on Future Unseen Data
One of the major goals of machine learning is to have a good idea of how well the model will perform on future unseen data. When the model is released into the wild, how will it perform? This is sometimes referred to as **generalization error** (or [generalization error][3]). In other words, it is a measurement of how well the model generalizes to data that it has not seen before. There are several ideas that have been developed to calculate this generalization performance.

# First idea, create a "holdout" test set that is not used during training
A simple idea is to partition the original dataset into two distinct datasets, one to be used during training, and another to be withheld for evaluating performance. This holdout dataset is also referred to as the **test** dataset.

### Use the `train_test_split` helper function
The helper function `train_test_split` may be used to split the data into training and test sets. It accepts both the input, `X`, and labels, `y`, and splits each one into train and test sets returning four total arrays. Typical splits place more data in the training than the test set. By default, a 75/25 split is used, but we can use the `test_size` parameter to change this. Below, we choose two columns to be in our input dataset and make a 70/30 split.

[1]: https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
[2]: https://scikit-learn.org/stable/modules/cross_validation.html
[3]: https://en.wikipedia.org/wiki/Generalization_error

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

heart = pd.read_csv('../data/heart.csv')
heart.head()

Unnamed: 0,age,sex,chest_pain,rest_bp,chol,fbs,rest_ecg,max_hr,exang,old_peak,slope,ca,thal,disease
0,63,Male,typical,145,233,1,left ventricular hypertrophy,150,0,2.3,3,0.0,fixed,0
1,67,Male,asymptomatic,160,286,0,left ventricular hypertrophy,108,1,1.5,2,3.0,normal,1
2,67,Male,asymptomatic,120,229,0,left ventricular hypertrophy,129,1,2.6,2,2.0,reversable,1
3,37,Male,nonanginal,130,250,0,normal,187,0,3.5,3,0.0,normal,0
4,41,Female,nontypical,130,204,0,left ventricular hypertrophy,172,0,1.4,1,0.0,normal,0


In [2]:
X = heart[['max_hr', 'rest_bp']].values
y = heart['disease'].values

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=.3)

In [5]:
X_train.shape

(212, 2)

In [6]:
y_train.shape

(212,)

In [7]:
X_test.shape

(91, 2)

In [8]:
y_test.shape

(91,)

## Fit model just on the training data
To get an accurate measure of future performance, the model must never be exposed to the data in the test set. Below, we build a decision tree classifier and train on just the training data.

In [9]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

In [10]:
dtc.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Evaluating (incorrectly) on the training data
Before we evaluate on the test data, let's repeat our mistakes from the previous notebooks and score ourselves on data that the model has already seen.

In [11]:
dtc.score(X_train, y_train)

0.9622641509433962

## Evaluate on the test data
Now that our model has not touched the test data, we can safely assess its performance.

In [12]:
dtc.score(X_test, y_test)

0.5274725274725275

### Model performance is worse than the baseline
The results from the test performance indicate we are slightly worse off than where we started. We have built a truly atrocious model unable to beat the baseline.

### Overfitting
When a model performs well on the training data but poorly on the test data, we say that it is **overfitting**. This is similar to memorizing all the answers to previous practice exams and then failing during the actual final exam. Overfitting will be discussed in detail later.

# Next Idea: Cross Validation

One issue with the first idea above is that that there is only a single test set to evaluate our results on. If this particular test set, by chance, was not a good representation of the data it might not accurately measure our generalization performance. 

**Cross Validation** is a set of ideas that improves upon evaluation based on a single test set. Instead, we can make many repetitive train/test splits of our data and record a score for each split. In this manner, we can calculate multiple performance scores, giving us more feedback on what to expect from unseen data.

## Many flavors of cross validation
There are multiple strategies of splitting the data that fall under the umbrella of cross validation. By far the most common form of cross validation is **K-fold cross validation**. In this method, the data is split into k distinct partitions, where k is usually between 5 and 10. One of these k partitions is used as the test set, while the other k-1 partitions are used for training. 

After the first round of testing, a different partition is used for testing and the other k-1 partitions are again used or training. The model is refit to the new training data and evaluated on the new test data. After all k partitions have been used for testing, the procedure ends.

The result is a total of k scores, which can then be averaged to yield a better overall performance metric. Take a look at the image below to see how the data would be split when doing a 5-fold cross validation.

![][1]

## K-Fold Cross Validation in Scikit-Learn

The **`cross_val_score`** function automates the process of doing cross validation for us in a single line. Pass it an estimator, the data, and the number of validation sets to use. The estimator does NOT need to be trained ahead of time. Below, we use 5 folds (splits) which places 80% of the data as training during each round and 20% as testing.

[1]: images/kf.png

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

logr = LogisticRegression()

In [14]:
scores = cross_val_score(logr, X, y, cv=5)
scores

array([0.60655738, 0.67213115, 0.80327869, 0.73770492, 0.61016949])

In [15]:
scores.mean()

0.685968324534593

### An array of scores is returned
An array of k scores is returned. We can take the mean and standard deviation to get a summary of our results. Notice there is quite a large amount of variance in the predictions, ranging from 60% to over 80%. This is likely due to the small number of samples in each test set. Since there are only 303 total observations there are only going to be 60 in each test set. A larger dataset would likely yield smaller variance.

In [16]:
scores.mean()

0.685968324534593

In [17]:
scores.std()

0.07573826739436063

## No model is returned
When using the `cross_val_score` function for k-fold cross validation, k models are trained, but none of those fitted models are returned. Only the scores are returned.

## Cross validation evaluates performance. Us all the data to build the final model
Cross validation is a procedure to evaluate the model performance, it does not return us a fitted model. As the final model to use in the future, you would train it on all the data.

In [18]:
# fit with all of the data
logr.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

## Reporting our generalization performance estimate
It's important to summarize the information produced from cross validation. We would say that, our logistic regression instance, `logr`, trained with `max_hr` and `rest_bp` variables can expect to achieve 69% accuracy with a standard deviation of 7.6%

# Other flavors of cross validation
There are several other [cross validation strategies][1] that scikit-learn offers and refers to them as [splitter classes][2]. Please click the previous link to see all of the possibilities.

### Splitter classes are not estimators
Most of the Scikit-Learn API can be divided up into estimators and helper functions. The splitter classes are a special case that don't fall into one of those categories. These objects need to be instantiated but they are not machine learning vehicles, and instead are used to do cross validation.

## The `KFold` splitter
When we first called `cross_val_score`, we passed the `cv` parameter an integer, which was used to determine the number of splits. Instead of passing it an integer, we can pass it an instance of one of the splitter classes. By default, `cross_val_score` does a stratified k-fold cross validation, which will be discussed down below. Instead, we can specify the type of cross validation exactly with the`KFold` splitter class. Here, we import it and instantiate it ready to do 5 splits.

[1]: https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators
[2]: https://scikit-learn.org/stable/modules/classes.html#splitter-classes

In [23]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True)

### No cross validation has happened at this point
`kf` is a simple object and is simply waiting to be used during cross validation, such as with `cross_val_score`. Let's pass it to the `cv` parameter.

In [24]:
logr = LogisticRegression()

In [25]:
scores = cross_val_score(logr, X, y, cv=kf)
scores

array([0.67213115, 0.67213115, 0.72131148, 0.73333333, 0.68333333])

In [26]:
scores.mean()

0.6964480874316941

In [27]:
scores.std()

0.025819888974716113

## Stratified K-Fold cross validation
Stratified k-fold cross validation will make k partitions as our k-fold did above, except that it will preserve the distribution of classes in each fold. For instance, in this dataset 46% of people have heart disease. Stratified k-fold will ensure that each fold has this same percentage of people with heart disease. Stratification is more important for heavily imbalanced datasets - those which have low-frequency classes.

Let's import and use it here.

In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

In [None]:
cross_val_score(logr, X, y, cv=skf)

## This will be the exact same as the original
The default cross validation strategy is stratified k-fold, so it produces the same results from the very first execution of this method with `cv=5`. We again verify this below.

In [None]:
cross_val_score(logr, X, y, cv=5)

## Shuffling the data
By default, scikit-learn splits the dataset from the beginning in the order that it was received. If the dataset is ordered in some way, this might affect the results. The model might not have seen any observations from a particular segment of the data and therefore be unable to train on it.

To handle this issue, use the `shuffle` parameter available in many of the splitter classes.

In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
cross_val_score(logr, X, y, cv=skf)

# Exercise

Practice using the `ShuffleSplit`, `RepeatedKFold`, and `LeavePOut` splitters. Read the documentation and find out how they work.