<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Regression Metrics

_Authors:_ Tim Book, Matt Brems, Riley Dallas, Noelle Brown, Dan Wilhelm


> All models are wrong, but some are useful.
>
> -- <cite>George EP Box</cite>

Today, we'll cover metrics in more depth. **A metric allows us to state precisely how well our model is performing.**

As a real-life example shared before, we can think of a metric like the score from an examination to gauge how well a student performed vs actual answers. Other business examples could be KPI metrics related to hiring or employee retention which the organization has to meet and that then translates to downstream incentives like corporate bonus. 

> We've already evaluated our model with some metrics -- the Sum of Squared Errors (SSE), Mean of Squared Errors (MSE) and $R^2$

In a mathematical context, you can consider a *metric* to be a ***distance function*** -- it measures how *far away our predictions are from the actual* targets.


### What is a metric?

A metric is a _distance function_! It is used to give us a **numerical measure of our model's performance**.

Metrics have a _mathematical_ definition. Given some set $X$ (just a collection of elements), a **distance function** $d$ maps any two elements $x, y \in X$ (where x and y are elements of set X) to a real number. The distance function $d$ must satisfy some axioms as captured below:

_(I find it easier to interpret these, by looking at them in relation to a metric we already evaluated in our modeling, like $MSE$)_
1. $d(x, y) \geq 0$.  --> $MSE$ is always positive because it is made of squared elements
2. $d(x, y) = 0$ if and only if $x = y$.  --> $MSE$ = 0 when y_true = y_pred
3. $d(x, y) = d(y, x)$. --> refer intuition deck
4. $d(x, y) \leq d(x, z) + d(z, y)$, for any $z \in X$. --> refer intuition deck

In this class, $x, y, z$ are usually ***vectors of real numbers***. However, we will see examples later in the class where they can be other things, such as sets of words in an email.

---

### What is a loss function?

A **loss function** is often a *metric*. It is used along with an _optimization algorithm_ to determine a **"good" set of parameter values**. Like, finding the best line of fit in a linear regression model.

A required characteristic for a loss function in an optimization context is for it to be _differentiable_ (this is referring to [mathematical differentiation](https://en.wikipedia.org/wiki/Derivative) which revolves around: the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value) and differentiation is the process of finding this derivative)

> Be careful! In machine learning, you will see the loss function referred to variously as **cost function**, **scoring function**, **evaluation function**, **objective function**, and **entropy function**. These all refer to the same thing -- they ***measure how far our predictions are from the actual values***.


<details><summary>QUESTION: The Sum of Squared Errors (SSE) is a metric. Assuming the predictive model is differentiable, can we use it as a loss function? --> refer to intuition deck
    
$$SSE = \sum_{i=1}^N{(\hat{y}(x_i) - y_i)^2}.$$</summary>

Yes! The sum of squared terms is differentiable, so it would be a good choice for optimization. In fact, SSE is a very common loss function in regression problems -- even for neural networks.

$$SSE = \sum{(y_{true} - y_{pred})^2}$$

$$SSE = \sum{(y_{true} - (x_1 a_1 + x_0 a_0))^2}$$

$$SSE = \sum{(y_{true} - x_1 a_1 - x_0 a_0)^2}$$

$$\frac{\partial{SSE}}{\partial{a_1}} = \sum{2(y_{true} - x_1 a_1 - x_0 a_0)(0 -x_1 -0)}$$

$$\frac{\partial{SSE}}{\partial{a_1}} = \sum{-2x_1(y_{true} - x_1 a_1 - x_0 a_0)}$$
</details>

---

## Example -- Distance between emails as a metric

**Metrics are anything that measures distance**. So, they _do not have to be differentiable!_

For example, suppose we want to know how similar two emails (email_1, email_2) are. Supposing each email is a non-empty _set_ of words (i.e. there are no duplicates), we propose the following metric:

$$d(\text{email}_1, \text{email}_2) = 1 - \frac{\text{# words in common}}{\text{# total words}}$$

This is effectively a "distance" between two sets of words! It is called the [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). 

- A Jaccard _coefficient_ measures **similarity** between finite sample sets, and is defined as the _size of the intersection between the sets_ divided by the _size of the union_ of the sample sets.

$$\text{Jaccard coeff} = \frac{\text{# words in common}}{\text{# total words}}$$

- A Jaccard _distance_ measures **dissimilarity (or, what is different)** between sample sets, and is complementary to the Jaccard coefficient. It is obtained by **subtracting the Jaccard coefficient from 1** 

    - $\text{email}_1 = \{\text{the, meeting, is, today}\}$.
    - $\text{email}_2 = \{\text{the, prince, will, see, you}\}$.
        - we can straightway see, the only similarity from the above set of words is "the"
    - Then: $d(\text{email}_1, \text{email}_2) = 1 - \frac{\text{# words in common}}{\text{# total words}} = 1 - \frac{1}{8} = \frac{7}{8}$, i.e. they are far apart.


<details><summary>QUESTION: The Jaccard distance is a metric. Can we also use it as a loss function?</summary>

No -- at least not easily! It is a function, but it is definitely not differentiable (recap from our intuition deck, the derivative of a constant --> 0).
    
</details>

In [1]:
# Can you calculate the Jaccard distance using Python?

# What is the Jaccard distance between user1 and user2, based on their brand preferences?
user1 = {'nike', 'adidas', 'starbucks'}
user2 = {'nike', 'adidas'}

# -- Hint: refer to Jaccard coefficient definition above
# -- Hint: Use the set intersection(), union() and len() (- this is for size) functions!

1 - len(user1.intersection(user2)) / len(user1.union(user2))

0.33333333333333337

---

## Example -- Euclidean distance is a metric

Let's take a look at a classic distance function -- the **Euclidean distance**. Don't be scared by the word _Euclidean_ -- this just indicates the geometry of the world around us!  Euclid wrote the math textbook "The Elements," the second-most widely printed book after the Bible.

> We will use a different notation from grade-school math. First, $\mathbb{R}^2$ refers to the set of all _pairs_ of real numbers, i.e. two-dimensional points, or tuples. So, $x \in \mathbb{R}^2$ refers to a tuple $(x_1, x_2)$ and $y \in \mathbb{R}^2$ refers to a tuple $(y_1, y_2)$. --> refer to intuition slide

_Note: this is NOT the coefficient of determination $R^2$_.

> We write them in this way because each tuple will have millions of components later in this class. We can then easily write e.g. $(x_1, x_2, x_3, \cdots, x_{1000})$! We would quickly run out of letters if we use the grade-school notation $(x, y, z)$.

Let's start with 2-D points. Suppose we have two-dimensional points $x, y \in \mathbb{R}^2$. Using the [Pythagorean Theorem](https://en.wikipedia.org/wiki/Pythagorean_theorem) _(that relates the length of the 3 sides of a hypotenuse triangle, $a^2$ + $b^2$ = $c^2$ (or) $c = \sqrt{(a)^2 + (b)^2}$)_, we can prove that the Euclidean distance between $x$ and $y$ (or, simply x - y) by substituting $a$, $b$ and $c$ as:

$$d(x, y) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}.$$

By applying the Pythagorean Theorem once more, we discover a similar function for **three**-dimensional points $x, y \in \mathbb{R}^3$:

$$d(x, y) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2}.$$

Seeing a pattern and applying the concept we had picked up about **summation**, we then define the Euclidean distance for **N-dimensional points** $x, y \in \mathbb{R}^N$:

$$d(x, y) = \sqrt{\sum_{i=1}^N{(x_i - y_i)^2}}.$$

<details><summary>QUESTION: What is the relationship between $d, x, y$ and Sum of Squared Errors?</summary>
    
- Let $x$ be a vector of predicted values.
- Let $y$ be a vector of actual values.
- Then, $d^2(x, y)$ is the SSE.
$$d(x, y) := \sqrt{SSE}.$$
    
So, the SSE is the squared distance between the two vectors!
</details>
<details><summary>QUESTION: What is the relationship between $d, x, y$ and the population variance?</summary>

Population variance (denoted by $\sigma^2$) tells us how data points in a specific population (like a vector of numbers) are spread out - which mathematically could be distance of each vector elements from the overall mean
$$\sigma^2 = \frac{\sum_{i=1}^N{(x_i-\mu)^2}}{N}$$
- Let $x$ be any vector of real numbers, where $N = len(x)$. 
- Let $y$ be the mean of $x$ (denoted by $\mu$) repeated $N$ times.
$$\sigma^2 = \frac{d^2(x, \mu)}{N}$$
- Then, $d^2(x, y)/N$ is the variance of $x$!
</details>

In [2]:
import numpy as np

from math import sqrt # alternative to (variable)**0.5

In [3]:
# Can you calculate the distance between x and y using Python?
x = (4, 18, 9, 2, 7)
y = (3, 3, 8, 1, 8)

In [4]:
# zip yields tuples until an input is exhausted
# we know from our distance equation above that it is made up of several tuples
list(zip(x,y))

[(4, 3), (18, 3), (9, 8), (2, 1), (7, 8)]

In [5]:
# calculating euclidean distance 
sqrt(sum((xi - yi)**2 for xi, yi in zip(x, y)))

15.132745950421556

In [6]:
# same solution using list comprehension?
# steps:
    #1: loop - for xi, yi in zip(x, y)
    #2: operation - calculate difference x - y per tuple and square
    #3: operation - sum all outputs from #2
    #4: operation - calculate sqrt of output from #3
sqrt(sum([(xi - yi)**2 for xi, yi in zip(x, y)]))

15.132745950421556

In [7]:
# Can you calculate the distance between x and y using numpy?
x = np.array([4, 18, 9, 2, 7])
y = np.array([3, 3, 8, 1, 8])

In [8]:
print(x, y, x-y)

[ 4 18  9  2  7] [3 3 8 1 8] [ 1 15  1  1 -1]


In [9]:
# calculating euclidean distance using numpy
np.sqrt(((x - y)**2).sum())

15.132745950421556

# Why do metrics matter?
---

Nearly all models require use of a **metric to assess our performance**.
- For example: linear regression, logistic regression, and neural networks.

Many algorithms work by finding the "nearest" points. _(finding **distance** between points)_
- For example: k-Nearest Neighbors, clustering, and recommendation systems.


# How are loss functions used?
---

Over the next few weeks, we will use the same process for nearly every (parametric) model _(parametric model is every model that can be written with a equation)_. We will choose an equation, a loss function, and an optimization algorithm.

For example, for multiple linear regression:

- Model: $\hat{y} = \beta_0 + \beta_1x_1 + \cdots + \beta_Nx_N$
- Loss Function: $SSE = \sum_{i=1}^N{(\hat{y}_i - y_i)^2}$
- Optimization Algorithm: Linear Algebra _(direct solution, used under the hood when we use sklearn, instead of fitting several lines through our data points)_

For the same model, we could have made other (less common) choices:

- Model: $\hat{y} = \beta_0 + \beta_1x_1 + \cdots + \beta_Nx_N$
- Loss Function: $E = SSE + \alpha\sum_{i=1}^N{\beta^2}$ (SSE with regularization)
- Optimization Algorithm: Gradient Descent (using derivative of the loss function to decide direction of parameter change iteratively until an optimal best fit is reached)

This same framework is even used with neural networks _(right now, neural networks is the basis of everything AI - more on these topics later in the course)_:
- Model: $\hat{y} = \sigma(\sigma(\frac{1}{1 + e^{-(x_1+\cdots+x_N)}} + \cdots) + \cdots)$, for a non-linear function $\sigma$.
- Loss Function: $E = SSE + \alpha\sum_{i=1}^N{\beta^2}$ (SSE with regularization)
- Optimization Algorithm: Gradient Descent

In [10]:
# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# modeling imports - linear regression
from sklearn.linear_model import LinearRegression # needs scipy 1.6.3 
from sklearn import metrics

## Load the Data
---

Today's dataset (`Advertising.csv`) is from the [ISLR](https://faculty.marshall.usc.edu/gareth-james/ISL/data.html) website.

Drop `"Unnamed: 0"` once you've loaded the CSV into a DataFrame.

In [11]:
# bring in advertising data
ads = pd.read_csv('../datasets/Advertising.csv')
ads.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [12]:
# drop Unnamed: 0 column
ads = ads.drop(columns='Unnamed: 0')
ads.columns

Index(['TV', 'radio', 'newspaper', 'sales'], dtype='object')

## Data Cleaning
---

Run the following checks in the cells provided:

- Are there any null values (`NaN`)?
- Are there any corrupted columns?



In [13]:
# Check for nulls
ads.isnull().sum()

TV           0
radio        0
newspaper    0
sales        0
dtype: int64

In [14]:
# Check datatypes for any corrupted columns
ads.dtypes

TV           float64
radio        float64
newspaper    float64
sales        float64
dtype: object

## Create our feature matrix (`X`) and target vector (`y`)
---

The following columns will be our features:

- `'TV'`
- `'radio'`
- `'newspaper'`

The `sales` column is our label (/target): the column we're trying to predict.

In the cell below, create your `X` and `y` variables.

In [15]:
# create X and y
FEATURES = ['TV', 'radio', 'newspaper']

X = ads[FEATURES]
y = ads['sales'].values

# Verify dimensions
print('X: ', X.shape)
print('y: ', y.shape)

X:  (200, 3)
y:  (200,)


In [16]:
X.head()

Unnamed: 0,TV,radio,newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4


In [17]:
y[:5]

array([22.1, 10.4,  9.3, 18.5, 12.9])

In [18]:
# one possible alternate approach when you have too many features to manually specify
response = ['sales']
features = ads.columns[~ads.columns.isin(response)] # tilde reverses .isin()
features
# X = ads[features] yields the same output as above

Index(['TV', 'radio', 'newspaper'], dtype='object')

## Linear Regression model
---

In the cell below, create an instance of `LinearRegression`.

In [19]:
# instantiate
lr = LinearRegression()

## Fitting the model
---

The `.fit()` method is how our model will learn the coefficients for each of the features (`'TV'`, `'radio'` and `'newspaper'`).

Once it's fit, you can see the bias (aka intercept) and coefficients by running:

```python
model.coef_
model.intercept_
```

In [20]:
# fit
lr.fit(X, y)

LinearRegression()

In [21]:
# coefficients
lr.coef_

array([ 0.04576465,  0.18853002, -0.00103749])

In [22]:
# y intercept
lr.intercept_

2.9388893694594103

## Interpreting Slopes

In [23]:
pd.Series(lr.coef_, index=FEATURES)

TV           0.045765
radio        0.188530
newspaper   -0.001037
dtype: float64

## Making predictions
---

To get predictions from our model, all we have to do is run `.predict(X_to_predict)`. This will return a list (`np` array) of predictions, one for each row in our `X_to_predict`. Which, can be compared against the actual `y_response` values to gauge model effectiveness. Normally you'll use this method for making predictions on unseen data, but today we'll be evaluating the data that was fed into the model during `.fit()`.

In [24]:
# get predictions from our X data
predictions = lr.predict(X)

y[:5], predictions[:5]

(array([22.1, 10.4,  9.3, 18.5, 12.9]),
 array([20.52397441, 12.33785482, 12.30767078, 17.59782951, 13.18867186]))

---

## Back to Metrics!

Now that we've computed some predictions, let's understand the rationale behind why Mean Squared Error is so commonly used as a metric and loss function.

## Sum Squared Error (SSE)
---

$$SSE = \sum_{i=1}^n(y_i - \hat{y}_i)^2$$

SSE forms the basis for several loss/optimization functions that we'll cover next:

- Mean squared error (MSE)
- Root mean squared error (RMSE)
- R2

> **NOTE:** Depending on what corner of the internet you're in, the SSE goes by many different names. Some common names are the **residual sum of squares (RSS)** and **total sum of squares (TSE)**. However, some other sources abbreviate something else RSS. Be careful!

<details><summary>QUESTION: Can you think of a reason why NOT to use this as a metric?</summary>

Its value is dependent on the number of data points! So we cannot use it to reliably compare between two datasets, e.g. our training and test sets, where training will always have much more data points to train the model (we'll cover this soon).
    
</details>

In [25]:
# Calculate SSE by hand
sse = ((y - predictions)**2).sum()
sse

556.8252629021872

## Mean Squared Error (MSE)
---

$$MSE = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Goal: Get $MSE$ as close to 0 as possible. _(we covered in previous lesson, with errors, ALWAYS lesser is better)_

Pros: 
- Very common; part of other calculations.
- Represents average distance squared from the predicted value.
- Punishes outliers severely.
- Coincides directly with the metric used to fit OLS model.

Cons: 
- Can be heavily affected by outliers. _(mostly from outlier data in y_true)_
- Not in the original units of $Y$. _(we squared all residuals to neutralize negative deltas between y_true vs y_pred, so we are dealing with values in a squared scale)_
- Depends on scale of $Y$. (i.e. housing prices vs. GPA)
- Uninterpretable to humans

In [26]:
# MSE by hand
((y - predictions)**2).sum() / len(y)

2.784126314510936

In [27]:
# MSE using sklearn
metrics.mean_squared_error(y, predictions)

2.784126314510936

## Root Mean Squared Error (RMSE)
---

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

$$RMSE = \sqrt{MSE}$$

Goal: Get $RMSE$ as close to 0 as possible.

Pros: 
- Pretty common.
- Represents (approximately) average distance from the predicted value.
    - Looks similar to standard deviation.
- In the original units of $Y$. _(sqrt(MSE), scales down the amplification that happens from all the residual squaring in MSE)_

Cons: 
- Can be heavily affected by outliers. _(mostly from outlier data in y_true)_
- Depends on scale of $Y$. (i.e. housing prices vs. GPA)
- Only a _little_ interpretable.

In [28]:
# RMSE
np.sqrt(metrics.mean_squared_error(y, predictions))

1.6685701407225697

## Mean Absolute Error (MAE)
---

$$MAE = \frac{1}{n}\sum_{i=1}^n(|y_i-\hat{y}_i|)$$

Goal: Get MAE as close to 0 as possible. _(still an error)_

Pros: 
- Represents mean distance from the predicted value.
- In the original units of $Y$. 
- Is not heavily affected by outliers.

Cons: 
- Depends on scale of $Y$. (i.e. housing prices vs. GPA)
- Punishes all errors with same "severity".
- Cannot be easily used as a loss function, since it is non-differentiable.

In [29]:
# Calcuate MAE by hand
np.abs(y - predictions).mean()

1.2520112296870685

In [30]:
# Via sklearn.metrics
metrics.mean_absolute_error(y, predictions)

1.2520112296870685

## Coefficient of Determination, $R^2$
---

$$R^2 = \frac{\text{Explained Variance}}{\text{Total Variance}} = 1 - \frac{\text{Model Error}}{\text{Total Variance}} = 1 - \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}$$

recap: $\bar{y}$ = mean(y) associated with the null model

Goal: Get $R^2$ as close to 1 as possible.

> - score = 0: Model explains none of the variability of the response data around its mean.
> - score = 1: Model explains all the variability of the response data around its mean.

Pros:
- Easy interpretation. "An $R^2$ value of 0.8 means that 80% of the variability in _y_ is explained by the _x_-variables in our model."
- Common metric.
- Does not depend on the scale of $Y$.
- Works with more than just _linear_ regression.

Cons:
- As you add more variables, $R^2$ will never decrease (with linear regression). [Adjusted $R^2$](https://www.statisticshowto.com/adjusted-r2/) can handle this assumption better.
- **ONLY INTERPRETABLE WITH LINEAR REGRESSION!**
- Outside linear regression values outside 0 and 1 are possible.

> Are low $R^2$ scores bad?
>
> I'm glad you asked! Not everything in regression is about getting the best predictions. In some fields, such as human behavior, you would expect to achieve scores much lower then 50\%! For inference, perhaps 0.3 is enough to measure an effect! Yes, there is more to machine learning than prediction. Inference can be the goal as well!

[Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)

In [31]:
# Null prediction for null model
null_prediction = y.mean()

In [32]:
# Null residuals
null_residuals = y - null_prediction

In [33]:
# Null sum of squares
null_ss = (null_residuals ** 2).sum()

In [34]:
# R2 by hand
1 - sse / null_ss

0.8972106381789522

In [35]:
# R2 from sklearn
metrics.r2_score(y, predictions)

0.8972106381789522

In [36]:
# R2 from model
lr.score(X, y)

0.8972106381789522