# Lesson 6 Continued: Assessing Your Models

Today:
1. Assessing your models
    + Accuracy
    + Other ways ("metrics") to measure goodness of models
2. Improving your models
   + Incorporating more features
   + k-Nearest Neighbor Classifiers
   + Cross Validation

## 1. Measuring "Goodness" of Classifiers

**Example:**

Consider the cancer dataset. What we have done so far:
1. Split the dataset into training and test datasets.
2. Using the training dataset:
    - Visualize data
    - Identify patterns
    - Create a model
3. Using the test dataset:
    - Use the model to make a prediction about the test dataset
4. Assess the model
    - How good/bad were the predictions made on the test dataset?
  
      
In our linear regression discussion:
- Given a proposed line, compute the MSE. Roughly speaking: Smaller MSE = more accurate model.
- Among multiple possible lines, choose one with smallest MSE.

### 1.1 Metrics for measuring goodness of models

- Minimize ``MSE'':
$$\text{MSE} = \frac{\text{total (error)}^2} {\text{Total number of predictions}} $$


- Maximize accuracy:
$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

### Example: The Cancer Dataset

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
cancerdata = pd.read_csv('../../../shared/datasets/cancer.csv')
cancerdata.shape

In [None]:
# ---------------
# this part simply puts together the pieces we have done previously into one giant code cell

# 1. THE DATASET
#  split into training and test datasets:

from sklearn.model_selection import train_test_split

X = cancerdata[['Marginal Adhesion', 'Clump Thickness']]
Y = cancerdata['Class']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.5, random_state = 11)

# 2. THE CLASSIFIER 
# Encoding a simple classifier
#   (this was an example from lesson08)

def predict_tumor_class( x , y ):
    # x = marginal_adhesion
    # y = clump thickness
    
    if (x < 4 and y < 7):
        class_predicted = 0
    else:
        class_predicted = 1
    
    return( class_predicted )


# 3. PREDICT THE CLASS OF EACH ROW OF THE TEST DATASET, USING A FOR LOOP

num_rows_test = len(y_test)
print(num_rows_test)


# empty array
y_predicted = np.empty( num_rows_test )

# empty data frame

predictions = pd.DataFrame( np.empty( (num_rows_test, 2) ) )
predictions.rename( columns = {0:'actual', 1:'predicted'}, inplace = True)

for row in np.arange(0, num_rows_test):
    predictions.iloc[row, 1] = predict_tumor_class( X.iloc[row, 0], X.iloc[row, 1] )
    predictions.iloc[row, 0] = y_test.iloc[row]

predictions.head()


In [None]:
# 4. ASSESSMENT
# Next, check how good our predictions are, by comparing to the actual class

# count how many predictions are incorrect and how many are correct
#    add a new column called "error"
#    if actual class is equal to predicted class, error is 0; else, error is 1

predictions['error'] = (predictions['predicted'] - predictions['actual']) ** 2

print(np.mean(predictions['error']))

### Concept Check

**Test Data:**
- 20000 images of handwritten characters (A-Z)
- 800 of them are images of the letter “A” (label = 1)
- the remaining 19200 are images of other characters (label = 0)

**Classification Task:** Identify which characters are the letter “A”. 

**Suppose we have a “Lazy” Classifier:**

Whatever the image is, predict “0” (i.e., the image is not of the letter “A”).

What is the accuracy of this classifier if we use it to predict the labels of the test data?

A. 0 

B. $\dfrac{800 }{20,000}$

C. $\dfrac{800}{19,200}$

D. $\dfrac{19200}{20,000}$

E. None of the above

Answer: 

**Follow-Up Group Discussion:** Can you come up with an example test dataset for which the accuracy of this classifier’s predictions is very high (e.g., 100%)? Very low?

Answer: 

### Key Takeaways

- Even a bad classifier could have a high accuracy if we’re “lucky” with the test dataset that we have.
- A good classifier should perform relatively well given **any test dataset**.
- Sometimes we cannot rely on just one metric (e.g., accuracy) for evaluating the goodness of a classifier.

- **For example:**
    - The accuracy of our “lazy classifier” on the given test dataset is high (96%).
    - Accuracy captures “percentage of correct predictions out of all predictions that are made”.
    - Our “lazy classifier” has a high accuracy because most of the images in the test data happen to be not “A”.
    - Our “lazy classifier” fails to correctly identify any image that is an “A”.
    - Are there other metrics that capture the above failure of our “lazy classifier”?

### 1.2 Other ways ("metrics") to measure goodness of models

- **True Positive**: the number of data points where 
    - True label = 1 (+)
    - Predicted label = 1 (+)
- **False Positive**: the number of data points where
    - True label = 0 (-)
    - Predicted label = 1 (+)
- **True Negative**: the number of data points where 
    - True label = 0 (-)
    - Predicted label = 0 (-)
- **False Negative**: the number of data points where
    - True label = 1 (+)
    - Predicted label = 0 (-)

- True = prediction is correct
- False = prediction is incorrect
- Positive = “predict 1 (+)”
- Negative = “predict 0 (-)”

**Example:**

Predict if a patient has a disease or not.

(1 = has disease, 0 = does not have disease)

- A true positive: patient has disease, test says they do
- A false positive: patient does not have disease, test says they do
- A false negative: patient has disease, test says they don’t
- A true negative : patient does not have disease, test says they don’t

#### How to calculate the metrics

<img src='images/labels.png' width=600>

$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} = \frac{\text{True Positive + True Negative}}{\text{TP + TN + FP + FN}}$$

**Precision** = proportion of correct predictions out of all positive predictions

$$\text{Precision} = \frac{\text{TP }}{\text{TP + FP}}$$

**True Positive Rate** = proportion of correct predictions out of all data points whose actual label is 1

$$\text{True Positive Rate} = \frac{\text{TP }}{\text{TP + FN}}$$

**True Negative Rate** = proportion of correct predictions out of all data points whose actual label is 0

$$\text{True Negative Rate} = \frac{\text{TN }}{\text{FP + TN}}$$

**Example: The Cancer Dataset**

In [None]:
# true positive rate
# tp / (tp + fn)
tp = 0
fn = 0
for row in np.arange(0, num_rows_test):
    if predictions.iloc[row, 0] == 1:
        if predictions.iloc[row, 1] == 1:
            tp = tp + 1
        else:
            fn = fn + 1

print(tp/(tp+fn))

### Concept Check

**Test Data:**
- 20000 images of handwritten characters (A-Z)
- 800 of them are images of the letter “A” (label = 1)
- the remaining 19200 are images of other characters (label = 0)

**Classification Task:** Identify which characters are the letter “A”. 

**Suppose we have a “Lazy” Classifier:**

Whatever the image is, predict “0” (i.e., the image is not of the letter “A”).

What is the **true positive rate** of this classifier if we use it to predict the labels of the test data?

A. 0 

B. $\dfrac{800 }{20,000}$

C. $\dfrac{800}{19,200}$

D. $\dfrac{19200}{20,000}$

E. None of the above

Answer: 

## 2. Improving our models

- Tweak the current model
	- Incorporate more variables
	- Adjust cutoffs  
- Consider a different type of model.  
	- The Nearest Neighbor Classifier (a.k.a. the k-nearest neighbor classifier)
	- There are a lot of classification models out there.

### Example: The Cancer Dataset

**Example: Encoding a simple classifier (version 2)**

<table>
    <tr>
        <td><img src="images/lec20-knn-illustration2_wline2.jpg" width="600"></td>
        <td><img src="images/dec_tree1b.jpg" width="600"></td>
    </tr>
</table>  

We chose this classifier based on our training dataset, but we don't know if it fits our test dataset. We may want to adjust the cut-offs. 

For instance, maybe we want `Marginal Adhesion` to be less than 9 instead of 4.

**Example: Encoding a more complex classifier**

<table>
    <tr>
        <td><img src="images/illustration_ct_uocs.png" width="600"></td>
        <td><img src="images/Dec_tree3.jpg" width="600"></td>
    </tr>
</table>  

Or we want to also consider the `Uniformity of Cell Size` in addition to `Marginal Adhesion` and `Clump Thickness`, where we add another layer to the decision tree above. 

### 2.2 Example of a New Classifier

<img src='images/model_tree.png' width=800>

Given a scatterplot depicting the relationship between 2 variables, where is a datapoint likely to be classified into?

<img src='images/cluster1.png' width=500>


<sup> image source: https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter- plots.html</sup>

### The k Nearest Neighbor Classifier

**Example of a 3-Nearest Neighbor classifier**

<table>
    <tr>
        <td><img src='images/lec21-knn-fig3.png' width=400></td>
        <td>
            <p><b>Idea:</b></p>
<p>Given an unlabeled point, find the 3 labeled points closest to it (most similar to it).</p>
<p>Prediction: The “majority-vote winner” of the labels of the 3 nearest points.</p>
<p>This is the 3-Nearest Neighbor classifier!</p>
        </td>
    </tr>
</table>

**Example of a 5-Nearest Neighbor classifier**

<table>
    <tr>
        <td><img src='images/lec21-knn-fig4.png' width=400></td>
        <td>
            <p><b>Idea:</b></p>
<p>Given an unlabeled point, find the 5 labeled points closest to it (most similar to it).</p>
<p>Prediction: The “majority-vote winner” of the labels of the 5 nearest points.</p>
<p>This is the 5-Nearest Neighbor classifier!</p>
        </td>
    </tr>
</table>

**The k-Nearest Neighbor Classifier, more generally**

- Choose the number of nearest neighbor to be considered (e.g., k = 5).
- For each new (unlabeled) point, find its k nearest neighbors.
- Find the labels of these k nearest neighbors.
- Find which value appears most frequently, among these k labels.
- We predict the new point’s label to be this value.
- Typically,
    - if there are an even amount of labels, we want k to be odd, and
    - if there are an odd amount of labels, we want k to be even.

#### Measuring "Nearness"

**For example:** Suppose we want to know how close the points (1, 1) and (5, 4) are to each other. 

<img src='images/lec21-dist-fig2.png' width=400>

We will measure the distance between the two points. The distance between any two points $(x_1, y_1), (x_2, y_2)$ is

$$d = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$$

**Example:**

**Training Data:**

| Size | Uniformity | Class | Distance to new point |
| --- | --- | ---- | ---- |
| 6 | 3| 0 | $\sqrt{(6-4)^2+(3-5)^2}=\sqrt{18}$
| 3 | 2| 0 | $\sqrt{(3-4)^2+(2-5)^2}=\sqrt{10}$
| 2 | 8| 1 | $\sqrt{(2-4)^2+(8-5)^2}=\sqrt{13}$
| 10 | 1| 1 | $\sqrt{(10-4)^2+(1-5)^2}=\sqrt{52}$
| 3| 10| 1 | $\sqrt{(3-4)^2+(10-5)^2}=\sqrt{26}$

**New data point:**

| Size | Uniformity | Class | Predicted Class |
| --- | --- | ---- | ---- |
| 4 | 5|  | ?

Suppose k=3: 

**Similarly if we were to add another variable:**

**Training Data:**

Var3| Size | Uniformity | Class | Distance to new point |
| --- | --- | --- | ---- | ---- |
4| 6 | 3| 0 | $\sqrt{(4-3)^2+(6-4)^2+(3-5)^2}=3$

**New data point:**

Var3 | Size | Uniformity | Class | Predicted Class |
| --- | --- | ---- | ---- | --- |
|3 | 4 | 5|  | ?

#### kNN using python/scikit learn

    from sklearn.neighbors import KNeighborsClassifier
    model = KNeighborsClassifier(n_neighbors= NUMBER )
    model.fit(X_train, y_train)

    y_predicted = model.predict(X_test)

    accuracy = model.score(X_test, y_test)

where:
- `X_train`: training data (only the feature/attribute columns)
- `X_test`: test data (only the feature/attribute columns)
- `y_train`: a list containing the labels of the training data
- `NUMBER`: the number of nearest neighbors to be considered
- output of `model.predict(X_test)`: list of predicted labels of the test data

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors= 3 )
model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

accuracy = model.score(X_test, y_test)

print(type(y_predicted))
print(y_predicted)
print(np.array(y_test))

In [None]:
# knn default score is accuracy
accuracy

### 2.3 Cross Validation

#### Model Selection

At this point, we have a lot of different models to choose from, including 
- several different decision tree classifiers (+ any tweaks) and
- k-Nearest neighbors, for different values of k.

Question: How do we choose which model to use to make actual predictions?

Idea: Pick a model that performs the **best** when tested on the **test dataset**.

Issue: “Overfitting”
- Choosing a model because performs extremely well on on a particular test dataset, but the model is not “generalizable” and does not perform well when given other test datasets.

#### Overfitting Issue

We might end up choosing a model that performs best only on that particular test dataset.

**Example: The “Lazy Classifier” Example**
- A test dataset that happens to have 96% “0”
- A lazy classifier that always predict “0”
- The accuracy of the lazy classifier on this dataset would be 96%.
    - But this classifier might perform horribly if given other test datasets.
- Suppose we have other classifiers whose accuracy on this test dataset are 95%, 91%, 93%, etc. and they also perform similarly if given test datasets.
- If we choose the classifier that performs well on this particular test dataset, then we might end up choosing the “lazy” classifier.

#### Model Selection: Avoiding Overfitting

- Initial Idea:
    - Divide the dataset into two: training and test
    - Train each model using the same training dataset
    - Pick a model that performs the **best** when tested on the **same test dataset**

- Updated Idea (#1):
    - (Shuffle the rows of the dataset)
    - Divide the data into M parts
    - The first part serves as test dataset; the rest as training. Fit the model using this train-test split.
    - The second part serves as test dataset; the rest as training. Fit the model using this train-test split.
    - etc.
    - Compute the average accuracy
    - Pick a model that has the highest test accuracy

This method is an example of what’s called “**Cross Validation**”, a model evaluation and selection method that avoids overfitting.

- Updated Idea (# 2):
    - (Shuffle the rows of the dataset)
    - Divide the data into M parts
        - (for example, if we are selecting among 4 models, divide into 4 parts)
    - The first part serves as test dataset for the first model,
    - The second part serves as test dataset for the second model, etc.
    - Pick a model that performs the best on the test data assigned to it.

This method is an example of what’s called “**N-Fold Cross Validation**”, a model evaluation and selection method that avoids overfitting.

<img src='images/K-fold_cross_validation_EN.svg' width=800>

<sup>image source: By Gufosowa - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=82298768</sup>

**Example:**

Suppose we have 600 rows of data in our cancer dataset.

We are choosing between four models:

1. 1-nearest neighbor,
2. 3-nearest neighbor,
3. 5-nearest neighbor, and
4. 7-nearest neighbor

We will split the 600 rows into four equal parts. 

- Test data for Model 1: Rows 1-150
- Test data for Model 2: Rows 151-300
- Test data for Model 3: Rows 301-450
- Test data for Model 4: Rows 451-600

#### Cross Validation using python/scikit-learn + kNN

    from sklearn.model_selection import cross_validate
    
    model = KNeighborsClassifier(n_neighbors=K)
    
    cv_results = cross_validate( model, X, Y, cv=NUM )
    
    cv_results[ ’test_score’ ]

where:
- `K`: the number of the neighbors to consider in kNN
- `X`: a data frame (only the feature/attribute columns)
- `Y`: a list containing the data labels
- `NUM`: the number of folds/divisions of the dataset to use

`cross_validate()` is a K-fold cross-validation, where the data set is split into K equal groups.

### Example: The Cancer Dataset

In [None]:
# use a for loop
knnscores_df = pd.DataFrame( np.empty((50, 2)) )
knnscores_df.rename(columns = {0:'k', 1:'accuracy'}, inplace = True )
row = 0

for k in np.arange(1, 51) :
    model = KNeighborsClassifier(n_neighbors= k )
    model.fit(X_train, y_train)
    
    knnscores_df.iloc[row, 0] = k
    knnscores_df.iloc[row, 1] = model.score(X_test, y_test)
    
    row = row + 1

In [None]:
knnscores_df.sort_values('accuracy', ascending = False)

In [None]:
sns.lineplot(data=knnscores_df, x='k', y='accuracy')

In [None]:
from sklearn.model_selection import cross_validate

model = KNeighborsClassifier(n_neighbors= )

cv_results = cross_validate( model, X, Y, cv=5 )
np.mean(cv_results[ 'test_score' ])

In [None]:
# use a for loop
knnscores_df2 = pd.DataFrame( np.empty((50, 2)) )
knnscores_df2.rename(columns = {0:'k', 1:'accuracy'}, inplace = True )
row = 0

for k in np.arange(1, 51) :
    model = KNeighborsClassifier(n_neighbors=k)

    cv_results = cross_validate( model, X, Y, cv=10 )
    
    knnscores_df2.iloc[row, 0] = k
    knnscores_df2.iloc[row, 1] = np.mean(cv_results[ 'test_score' ])
    
    row = row + 1

In [None]:
sns.lineplot(data=knnscores_df2, x='k', y='accuracy')