<a href="https://colab.research.google.com/github/alimoorreza/CS167-sp25-notes/blob/main/Day08_Evaluation_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day09
## Metrics and Testing

#### CS167: Machine Learning, Spring 2025


📜 [Syllabus](https://analytics.drake.edu/~reza/teaching/cs167_sp25/cs167_syllabus_sp25.pdf)

In [1]:
#run this cell if you're using Colab:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import pdb

# How do we know if our model is a 'good' model?

We want to know how good our models are at making predictions... how can we test it?

Examples:
- what k-value should we use in knn algorithm?
- what is the effect on accuracy if I normalize the data?
- Should I use a weighted knn algorithm or a normal knn?

## Evaluation of Machine Learning Algorithms:

We want to know how good our model is at making predictions. How can we test it?

__Option 1:__ Deploy the model in a live setting and see how it does on new examples.

__Option 2:__ Run each of our training examples through the model and see how many it gets correct

__Option 3:__ Cross-Validation - set aside some of your training examples to be used for testing.
- don't use testing examples when you train the model, only the rest that were left over. Why?

## Cross-Validation

Don't train the model on the testing data!

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_cross_validation.png" width=400/>
</div>

##**Pandas sample method**

> changing the fraction to a number less than 1.0, let's say 0.90, will cause it to take 90% of the iris samples.

In [9]:
import pandas as pd
import numpy as np
iris = pd.read_csv('/content/drive/MyDrive/cs167_sp25/datasets/irisData.csv')

#shuffle the iris "sampling" a fraction of data in random order
shuffled_data = iris.sample(frac=0.15, random_state=41)
shuffled_data

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
119,6.0,2.2,5.0,1.5,Iris-virginica
128,6.4,2.8,5.6,2.1,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
91,6.1,3.0,4.6,1.4,Iris-versicolor
112,6.8,3.0,5.5,2.1,Iris-virginica
71,6.1,2.8,4.0,1.3,Iris-versicolor
123,6.3,2.7,4.9,1.8,Iris-virginica
85,6.0,3.4,4.5,1.6,Iris-versicolor
147,6.5,3.0,5.2,2.0,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica


> Setting the fraction to a number to 1.0 will cause it to fully shuffle the samples in a random order

In [10]:
#shuffle the iris "sampling" the full set in random order
shuffled_data = iris.sample(frac=1.0, random_state=41)
shuffled_data

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
119,6.0,2.2,5.0,1.5,Iris-virginica
128,6.4,2.8,5.6,2.1,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
91,6.1,3.0,4.6,1.4,Iris-versicolor
112,6.8,3.0,5.5,2.1,Iris-virginica
...,...,...,...,...,...
26,5.0,3.4,1.6,0.4,Iris-setosa
89,5.5,2.5,4.0,1.3,Iris-versicolor
65,6.7,3.1,4.4,1.4,Iris-versicolor
80,5.5,2.4,3.8,1.1,Iris-versicolor


## Cross-Validation Code:

A good rule of thumb is that we like to train our model with 80% of the given data examples (training set), and test it on 20% of the given data examples (training set).

Splitting datasets into training and testing sets with a Pandas DataFrame:

In [3]:
import pandas as pd
import numpy as np
iris = pd.read_csv('/content/drive/MyDrive/cs167_sp25/datasets/irisData.csv')

#shuffle the iris "sampling" the full set in random order
shuffled_data = iris.sample(frac=1, random_state=41)

# set up training and testing set
number_of_test_samples = 20
test_data = shuffled_data.iloc[0:number_of_test_samples] #test on the first 20 rows of shuffled
train_data = shuffled_data.iloc[number_of_test_samples:] #train on the rest
train_data.shape

(130, 5)

In [4]:
# Notice the labels in the leftmost column of the first five samples in the 'train_data' split.
# They will remain the same for the same random_state=41.
# If you change the random_state to 3 (or any other value, such as 100), the samples will be shuffled into a different order.
train_data.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
81,5.5,2.4,3.7,1.0,Iris-versicolor
120,6.9,3.2,5.7,2.3,Iris-virginica
43,5.0,3.5,1.6,0.6,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
64,5.6,2.9,3.6,1.3,Iris-versicolor


In [5]:
# Notice the labels in the leftmost column of the first five samples in the 'test_data' split.
# These are the remaining shuffled samples that were separated from 'train_data'.
test_data.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
119,6.0,2.2,5.0,1.5,Iris-virginica
128,6.4,2.8,5.6,2.1,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
91,6.1,3.0,4.6,1.4,Iris-versicolor
112,6.8,3.0,5.5,2.1,Iris-virginica


## Cross-Validation Metrics:
When doing cross-validation, how do we tell how well our model performed?

How can we measure it?

- depends on the task and what we want to know.

### Classification metrics are different than regression metrics

## Classification Metrics: `Accuracy`

__Accuracy__: The fraction of test examples your model predicted correctly

*Example*: 17 out of 20 = 0.85 accuracy

### Issues with accuracy:
- Suppose that a blood test for cancer has 99% accuracy
    - *can we safely assume this is a really good test?*
        -  If the dataset is *unbalanced*, accuracy is not a reliable metric for the real performance of a classifier because it will yield misleading results.
        - __Example__: Most people don’t have cancer.

    - Beware of what your metrics don't tell you.

- What about __false negatives__ and __false positives__?
    - __false negative__: a test result which incorrectly indicates that a particular condition or attribute is absent
    - __false positives__: a test result which incorrectly indicates that a particular condition or attribute is present

## Classification Metrics: `Confusion Matrix`

__confusion matrix__: A specific table layout that allows the visualiztion of the performance of an algorithm. Each row represents instances in an actual class while each column represents the instances in a predicted class.
- It makes it easy to see where your model is confusing the predicted and actual results


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_iris_confusionmatrix.png" width=400/>
</div>


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_pretty_confusionmatrix.png" width=500/>
</div>

## Confusion Matrix Exercise:

Given the following confusion matrix:
- how many false positive?
- how many false negatives
- what is the accuracy?


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_confusionmatrix_ex.png" width=200/>
</div>


# Classification v Regression:

What's the difference?

The output variable in __regression__ is numerical (or continuous).

The output variable in __classification__ is categorical (or discrete).

### Is accuracy a good metric for regression? Why or why not?

# Regression Metrics: `Mean Absolute Error (MAE)`

__Mean Absolute Error (MAE)__: the average difference between the actual and predicted target values.

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_mae.png"/ width=500 height=100>
</div>


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_mae_calc.png" width=500/>
</div>


## Regression Metrics: `Mean Squared Error (MSE)`

__Mean Squared Error__: The average squared difference between the actual and predicted targets.

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_mse.png"/ width=500 height=100>
</div>


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_mse_calc.png" width=500/>
</div>


## MAE v MSE:

What effect does the squaring have on the error measurements?

Can you think of any scenarios where it might be better to use `MAE` over `MSE` or vis versa?

## Regression Metrics: $R^2$

Consider this naive prediction method: always predict the average target value

Do you think this is a good predictor algorithm?

No.

So, we should be able to beat it-- if we can't, we're in trouble. However, we can use this as a point of comparison.
- An $R^2$ values of 0 means that you have done no better than the naive strategy of predicting the average

In [None]:
from sklearn.metrics import r2_score
predictions= [12, 15.2, 21, 29]
actual = [14, 16, 19, 21]
r2 = r2_score(predictions, actual)
print(r2)

0.5652382092410821


## Interpreting $R^2$

Things you should know:
- Usually $R^2$ values fall between 0 and 1
- 1 means you perfectly fit the data
- 0 means you've done no better than average
- Negative numbers mean that the naive model that predicts the average is actually a better predictor--yours is really bad.