# Lecture 4 Cross Validation
__MATH 3480__ - Dr. Michael Olson

Topics:
* Cross Validation
    * Leave-one-out cross validation
    * k-fold cross validation
* Performance Measures

Reading:
* Geron, pp. 31, 73-80, 88-97

In Exploratory Data Analysis, we need to follow these steps:
1. Obtain and Clean the Data
2. Wrangle the Data
3. Look at statistical calculations
4. Graph the data 
5. Draw conclusions and make hypotheses from (3) and (4), looking for relationships that we might use

|              | Quantitative Data | Categorical Data |
| :----------- | :---------------- | :--------------- |
| Calculations | Mean, Mode<br>5-summary Statistics<br>Distributions (count, standard deviation/variance) | Probabilities<br>Expected Values<br>Probability/Binomial/etc. Distributions |
| Graphs       | Histogram/KDE (kernel density estimator)<br>Boxplot/Violinplot<br>Scatterplot<br>Timeseries<br>Heatmap | Barplot<br>Pie Chart<br>Venn Diagram<br>Tree Diagram |


In order to have data ready for modeling, we have to pre-process the data. For the pre-processing, we have a few steps, some of which we have seen:

1. Take care of missing data
2. Encoding categorical data
3. Feature Scaling
4. Splitting the Data (Cross Validation)

# Cross Validation

In [None]:
# Preprocessing data according to last lecture

import numpy as np
import pandas as pd

exercise = pd.read_csv('Data/exercise.csv')
X = exercise.drop(['Date','Weight Lost'], axis=1).values
y = np.array(exercise['Weight Lost'])

# Ordinal Encoder won't like nan values. Change to 'None'
# This fits with data since there was 0 activity for that day
X[:,3] = ['None' if x is np.nan else x for x in X[:,3]]

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# When putting in the columns in each imputer/encoder, indicate the column
# of the original matrix
  # [0]: Calories - fill missing values
  # [1]: Exercise Type - One-hot encoding
  # [3]: Quality of Exercise - Ordinal encoding

ct = ColumnTransformer(transformers=[
      ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean'), [0]),  # This is placed first in X
      ('onehot', OneHotEncoder(), [1]),                                         # This is placed second in X
      ('oe', OrdinalEncoder(categories=[['None','Low','Medium','High']]), [3])  # This is placed third in X
    ], remainder='passthrough')                     # Remaining columns placed in order after the last encoder



X = np.array(ct.fit_transform(X))

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=22)

## k-fold cross validation

We can improve cross validation by using __k-fold cross validation__. We do this by,
1. Dividing data into $n$ groups (called *folds*)
2. Train and run the model $n$ times
    * Each run sets aside one fold for validation and uses the other $n-1$ folds for training
    * Since it is run $n$ times, each fold with be set aside one time

In [None]:
n = 15

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, X, y, scoring="neg_mean_square_error", cv=n)

## Leave-one-out cross validation

Leave-one-out cross validation (LOOCV) is an extreme version of k-fold cross validation. Instead of dividing the data in $n$ groups, we set aside one datapoint at a time. That means the model is trained on all the data in $X$ except one datapoint, then evaluated on that one datapoint. This is repeated for the next, and the next,... $len(X)$ times.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, X, y, scoring="neg_mean_square_error", cv=len(X))

# Performance Measures

## Categorical measures

Ultimately, either your model worked or it didn't. We can break this evaluation into multiple key metrics:
* Accuracy - Fraction of correct predictions
  $$ accuracy = \frac{\#~correct}{\#~of~predictions}$$
  * Good choice for balanced classes (equal numbers of different options)
  * Not a good choice for unbalanced classes
    * Predict 100 pictures to be dogs - If 99 are dogs, that's 99% accuracy!
* Recall
  * Ability of a model to find all relevant cases
$$recall = \frac{\#~true~positives}{\#~true~positives + \#~false~negatives}$$
* Precision
  * Ability to identify only the relevant cases
$$precision = \frac{\#~true~positives}{\#~true~positives + \#~false~positives}$$
* F1-Score
  * Harmonic mean of precision and recall
$$F_1 = 2*\frac{precision\cdot recall}{precision + recall}$$

__Confusion Matrix__

| Total Population | Prediction Positive | Prediction Negative |
| ---: | :---: | :---: |
| Condition Positive | True Positive (TP) | False Negative (FN)<br>(Type II error) |
| Condition Negative | False Positive (FP)<br>(Type I error) | True Negative (TN) |

Prevalence, True Positive Rate (TPR), False Positive Rate (FPR), etc.
* https://en.wikipedia.org/wiki/Confusion_matrix

The method used depends on the situation.
* Situation determines if we fix the false positives or the false negatives

In [None]:
def accuracy(true, predicted):
    total_correct = (true == predicted).sum()
    return total_correct/len(true)

def precision(true, predicted, category):
    total_correct = ((true == predicted) & (predicted == category)).sum()
    total_predicted = (predicted == category).sum()
    return total_correct/total_predicted

def recall(true, predicted, category):
    total_correct = ((true == predicted) & (true == category)).sum()
    total_predicted = (true == category).sum()
    return total_correct/total_predicted

def f1score(true, predicted, category):
    p = precision(true, predicted, category)
    r = recall(true, predicted, category)
    return 2*p*r/(p+r)

*sci-kit learn* has this built in.

In [3]:
y_test = ['A','A','A','B','B','B','C','C','C']
y_predicted = ['A','A','B','B','B','B','C','C','C']

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_predicted))

from sklearn.metrics import classification_report
print(classification_report(y_test, y_predicted))

[[2 1 0]
 [0 3 0]
 [0 0 3]]
              precision    recall  f1-score   support

           A       1.00      0.67      0.80         3
           B       0.75      1.00      0.86         3
           C       1.00      1.00      1.00         3

    accuracy                           0.89         9
   macro avg       0.92      0.89      0.89         9
weighted avg       0.92      0.89      0.89         9



## Quantitative Measures