# MODEL EVALUATION

## REVIEW OF THE LAST CLASS

- Goal was to predict the species of an **unknown iris**
- Made predictions using KNN models with **different values of K**
- Need a way to choose the **"best" model**: the one that "generalizes" to "out-of-sample" data

**SOLUTION:** Create a procedure that **estimates** how well a model is likely to perform on out-of-sample data and use that to choose between models.

## EVALUATION PROCEDURE #1: TRAIN AND TEST ON THE ENTIRE DATASET

1. Train the model on the **entire dataset**.
2. Test the model on the **same dataset**, and evaluate how well we did by comparing the **predicted** response values with the **true** response values.

In [None]:
# read the iris data into a DataFrame
import pandas as pd
import numpy as np

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)

In [None]:
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [None]:
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]

In [None]:
# store response vector in "y"
y = iris.species_num

### KNN (K=50)

In [None]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model
knn = KNeighborsClassifier(n_neighbors=50)

# train the model on the entire dataset
knn.fit(X, y)

# predict the response values for the observations in X ("test the model")
knn.predict(X)

In [None]:
# store the predicted response values
y_pred = knn.predict(X)

## EVALUATION METRIC

To evaluate a model, we also need an **evaluation metric:**

- Numeric calculation used to **quantify** the performance of a model
- Appropriate metric depends on the **goals** of your problem

Most common choices for classification problems:

- **Classification accuracy**: percentage of correct predictions (reward function)
- **Classification error**: percentage of incorrect predictions (loss function)

In this case, we'll use classification accuracy.

In [None]:
# compute classification accuracy
from sklearn import metrics
print (metrics.accuracy_score(y, y_pred))

This is known as **training accuracy** because we are testing the model on the same data we used to train the model.

### KNN (K=1)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
y_pred = knn.predict(X)
print (metrics.accuracy_score(y, y_pred))

Does that mean that K=1 is the best value for K?

### PROBLEM WITH TRAINING AND TESTING ON THE SAME DATA

- Goal is to estimate likely performance of a model on **out-of-sample data**
- But, maximizing training accuracy rewards **overly complex models** that won't necessarily generalize
- Unnecessarily complex models **overfit** the training data:
    - Will do well when tested using the in-sample data
    - May do poorly on out-of-sample data
    - Learns the "noise" in the data rather than the "signal"
    - From Quora: [What is an intuitive explanation of overfitting?](http://www.quora.com/What-is-an-intuitive-explanation-of-overfitting/answer/Jessica-Su)

**Thus, training accuracy is not a good estimate of out-of-sample accuracy.**

## EVALUATION PROCEDURE #2: TRAIN/TEST SPLIT

#### STEPS

1. Split the dataset into two pieces: a **training set** and a **testing set**.
2. Train the model on the **training set**.
3. Test the model on the **testing set**, and evaluate how well we did.

#### WHAT DOES THIS ACCOMPLISH?

- Model can be trained and tested on **different data** (we treat testing data like out-of-sample data).
- Response values are **known** for the testing set, and thus **predictions can be evaluated**.
- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance.

### UNDERSTANDING THE `train_test_split` FUNCTION

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

In [None]:
# before splitting
print (X.shape)

# after splitting
print (X_train.shape)
print (X_test.shape)

In [None]:
# before splitting
print (y.shape)

# after splitting
print (y_train.shape)
print (y_test.shape)

In [None]:
X_train

In [None]:
y_train

![train_test_split](images/train_test_split.png)

### UNDERSTANDING THE `random_state` PARAMETER

In [None]:
# WITHOUT a random_state parameter
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

# print the first element of each object
print (X_train[:1])
print (X_test[:1])
print (y_train[:1])
print (y_test[:1])

In [None]:
# WITH a random_state parameter
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

# print the first element of each object
print (X_train[:1])
print (X_test[:1])
print (y_train[:1])
print (y_test[:1])

## USING THE TRAIN/TEST SPLIT PROCEDURE 
### START WITH K=1

In [None]:
# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

In [None]:
# STEP 2: train the model on the training set (using K=1)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

In [None]:
# STEP 3: test the model on the testing set, and check the accuracy
y_pred = knn.predict(X_test)
print (metrics.accuracy_score(y_test, y_pred))

### REPEAT FOR K=50

In [None]:
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print (metrics.accuracy_score(y_test, y_pred))

![Bias-variance tradeoff](images/bias_variance.png)

### SEARCH FOR THE "BEST" VALUE OF K

In [None]:
# calculate TRAINING ERROR and TESTING ERROR for K=1 through 50
k_range = range(1, 51)
training_error = []
testing_error = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # training error
    knn.fit(X, y)
    y_pred = knn.predict(X)
    training_error.append(1 - metrics.accuracy_score(y, y_pred))
    
    # testing error
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    testing_error.append(1 - metrics.accuracy_score(y_test, y_pred))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
# plot the relationship between K (HIGH TO LOW) and TESTING ERROR
plt.plot(k_range, testing_error)
plt.gca().invert_xaxis()
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Error')

### WHAT CAN WE CONCLUDE?

- A value of K around 11 is likely the **best value for K** when using KNN on the iris dataset.
- When given the measurements of an **unknown iris**, we estimate that we would be able to correctly predict its species 98% of the time.

### TRAINING ERROR Vs TESTING ERROR

In [None]:
# create a DataFrame of K, training error, and testing error
df = pd.DataFrame({'K': k_range, 'train':training_error, 'test':testing_error}).set_index('K').sort_index(ascending=False)
df

In [None]:
# plot the relationship between K (HIGH TO LOW) and both TRAINING ERROR and TESTING ERROR
df.plot(figsize=(10,5))

### ROUGHLY SPEAKING:

- **Training error** decreases as model complexity increases (lower value of K)
- **Testing error** is minimized at the optimum model complexity

## MAKING PREDICTION ON OUT-OF-SAMPLE DATA

#### Given the measurements of a (truly) unknown iris, how do we predict its species?

In [None]:
# instantiate the model with the best known parameters
knn = KNeighborsClassifier(n_neighbors=11)

# re-train the model with X and y (not X_train and y_train) - why?
knn.fit(X, y)

# make a prediction for an out-of-sample observation
flower_dimensions = np.array([3, 5, 4, 2])
knn.predict(flower_dimensions.reshape(1, -1))

## DISADVANTAGES OF TRAIN/TEST SPLIT

####  What would happen if the `train_test_split` function had split the data differently? Would we get the same exact results as before?

In [None]:
# try different values for random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=78978)
knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print (metrics.accuracy_score(y_test, y_pred))

- Testing accuracy is a **high-variance estimate** of out-of-sample accuracy
- **K-fold cross-validation** overcomes this limitation and provides more reliable estimates
- But, train/test split is still useful because of its **flexibility and speed**