# Dealing with unusual points

This notebook shows how we can calculate the [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for
the points we use in a least squares regression.

The Cook's distance estimates the influence of the data points
for the regression, and the particularly influential points are points that we can think of as important for determining the model's parameters. We should check these points more closely for
validity (they can be outliers!). The Cook's distance is calculated using the projection matrix
$\textbf{H}$ (see page 49 in our textbook for a definition of this matrix).

The data file [outliers.csv](./outliers.csv) contains some x-values and 4 corresponding y-values labeled
"y1", "y2", "y3", and "y4". For the x-values, there is nothing special except for one point which is
far away from the others (this point is "unusual" compared to the other x-values). Further:

* y1: These are points from the equation y1 = 1 + 2x with some random noise.
* y2: These points are similar to y1, but one y value (approximately at x=-3) has been multiplied by 4.
* y3: These points are similar to y1, but one y value (approximately at x=0) has been
  multiplied by 4.
* y4: These points are similar to y1, but one y value (the one corresponding to the unusual x-value) has
  been multiplied by 4.
  
  
We begin by loading this data, making some plots, and creating least squares models.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import scale
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

sns.set_theme(style="ticks", context="notebook", palette="muted")
%matplotlib notebook

In [None]:
# We load the data and plot them:
data = pd.read_csv("outliers.csv")
X = data["x"].to_numpy().reshape(-1, 1)
y1 = data["y1"].to_numpy()
y2 = data["y2"].to_numpy()
y3 = data["y3"].to_numpy()
y4 = data["y4"].to_numpy()

all_y = [y1, y2, y3, y4]

fig, axes = plt.subplots(
    constrained_layout=True,
    sharex=True,
    sharey=True,
    ncols=2,
    nrows=2,
    figsize=(8, 8),
)
axes = axes.flatten()
axes[0].set_ylabel("y")
axes[2].set_ylabel("y")
for i, axi in enumerate(axes):
    axi.scatter(X, all_y[i])
    axi.set(xlabel="x", title=f"Data set: {i+1}")
sns.despine(fig=fig)

## Creating some linear models

We will now create least squares models for the 4 data sets to see how the unusual points
influence the models we get.

In [None]:
models = []
predicted = []
r2_scores = []

for y in all_y:
    new_model = LinearRegression(fit_intercept=True)
    new_model.fit(X, y)
    y_hat = new_model.predict(X)
    models.append(new_model)
    predicted.append(y_hat)
    r2_scores.append(r2_score(y, y_hat))

In [None]:
def model_equation(model):
    """Return a string with the parameters for a linear model."""
    return f"y = {model.intercept_:.3g} + {model.coef_[0]:.3g} * x"

In [None]:
# Plot the predicted values for the different data sets
fig, axes = plt.subplots(
    constrained_layout=True,
    sharex=True,
    sharey=True,
    ncols=2,
    nrows=2,
    figsize=(8, 8),
)
axes = axes.flatten()
axes[0].set_ylabel("y")
axes[2].set_ylabel("y")

for i, axi in enumerate(axes):
    axi.scatter(X, all_y[i])
    axi.plot(
        X,
        predicted[i],
        label=f"R² = {r2_scores[i]:.3g}\n{model_equation(models[i])}",
        color="k",
    )
    axi.legend()
    axi.set(xlabel="x", title=f"Data set: {i+1}")
sns.despine(fig=fig)

Here, we see that the influence of the unusual points differs in the different data sets.
Essentially, there are two contributions: how unusual the $y$-value is and how unusual the $x$-value is.
We will now look for the unusual data points.

## Finding unusual $x$-values

We will first focus on finding points that are unusual along the $x$-values. The $x$-values are
common for all 4 data sets, so we will only do it for data set number 1.

To do this,
we will calculate the so-called **leverage score** which can be found from
the diagonal elements of the projection matrix. The leverage scores will help us locate
unusual $x$-values. The motivation is as follows:

If we know the projection matrix, $\textbf{H}$, then we can get the y-values
estimated by the model ($\hat{\textbf{y}}$) directly from the measured y-values ($\textbf{y}$):

\begin{equation}
\hat{\textbf{y}} = \textbf{H} \textbf{y}
\end{equation}

Let us say that we have $m$ $y$-values and that the elements of the matrix $\textbf{H}$ are $h_{ij}$.
If we write out the esimation of $\hat{y}_i$ we get:

\begin{equation}
\hat{y}_i = h_{i1} y_1 + h_{i2} y_2 + \ldots + h_{ii} y_i + \ldots + h_{im} y_m
\end{equation}

and we see here that $h_{ii} \times y_i$ gives the contribution of point $y_i$ to the estimation of point
$\hat{y}_i$. One can show the following properties for $h_{ii}$:

* the sum of $h_{ii}$ equals the number of coefficients in the linear model
* $0 \leq h_{ii} \leq 1$
* $h_{ii}$ measures the distance between $x_i$ and the mean of all $x$-values

So, if $h_{ii}$ is "large" this means that obervation no. $i$ is very important for predicting $\hat{y}$.
Let us see what these $h_{ii}$ elements look like for our current $\textbf{X}$:

In [None]:
# Calculate projection matrix
H = X @ np.linalg.pinv(X)
h = np.diagonal(H)
# Check the sum:
print("Sum of diagonal elements (should be 1):", h.sum())
print("Smallest diagonal element:", h.min())
print("Largest diagonal element:", h.max())
print("Second largest diagonal element:", np.sort(h)[-2])

Here, we see that the largest diagonal element is 4 times greater than the second largest. So this can potentially
be an unusual point. Different progams use different rules-of-thumb to flag "large" $h_{ii}$'s. Two common
choices are:

* $h_{ii} > 3 \times \langle h \rangle = 3 \times \frac{k}{m}$
* $h_{ii} > 2 \times \langle h \rangle = 2 \times \frac{k}{m}$

where $\langle h \rangle$ denotes the average of the diagonal elements, $k$ is the number of
coefficients, and $m$ is the number of observations. Let us plot all $h_{ii}$'s and add the thresholds above:

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
pos = range(len(h))
ax.bar(pos, h)
ax.set(xlabel="Observation number", ylabel="Leverage score")
threshold_3 = 3.0 * h.mean()
threshold_2 = 2.0 * h.mean()
ax.axhline(y=threshold_3, ls=":", color="k", label="Threshold (3 × mean)")
ax.axhline(y=threshold_2, ls="--", color="k", label="Threshold (2 × mean)")
ax.legend()
sns.despine(fig=fig)

In general, it can be difficult to use the thresholds above. Another approach is to look for points in the
figure above that seem to be significantly different from the others. Here, it seems to be point no. 31.
Let us label this in the original data set:

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
ax.set(xlabel="x", ylabel="y", title="Data set 1")
ax.scatter(X, y1, label="All points")  # Draw all points

idx = np.where(h > threshold_3)[0]  # Select points with h > threshold_3
# You can swap the threshold_3 to threshold_2 to see if there is any difference

ax.scatter(X[idx], y1[idx], alpha=0.5, label="Unusual point(s)!", s=100)
ax.legend()
sns.despine(fig=fig)

This is sort of what we expect: one of the $x$-values is significantly different from the other points, and
now we have found which one it is. We should check this point in more detail to see if there is anything unusual about it and if we should keep it when training our model.

## Finding unusual $y$-values

Next, we will look for points that are important for the calculation of regression parameters. We do this by
calculating the [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance).
For each observation ($i$), we calculate the Cook's distance $D_i$ by:

\begin{equation}
D_i = \frac{(y_i - \hat{y}_i)^2}{k s^2} \left( \frac{h_{ii}}{(1 - h_{ii})^2} \right)
\end{equation}

where $s^2 = \frac{\sum_i^m (y_i - \hat{y})^2}{m - k}$ is the mean squared error. We could do
the same for observation number $i$ by training a new least squares model on a data set with observation number $i$
removed, and comparing the new model with the old. This is a lot of work to do for all data points, so we
prefer to use the simple formula above.

Again, the Cook's distance is just a number, and we need some way of determining if a distance is "large".
That is, we need some way of saying that a point influences the parameters a lot. A rule-of-thumb is to use the
value $4/m$ as a cut-off. Different programs might use different threshold values, so we can also
trust ourselves and look for distances that "look" unusual. Let us see it in action:

In [None]:
# Calculate Cook's distance:
rank = np.linalg.matrix_rank(X)
dof = X.shape[0] - rank
k = X.shape[1]

cook = []

for i, model in enumerate(models):
    y_hat = predicted[i]
    y = all_y[i]
    residual = y - y_hat
    mse = np.dot(residual, residual) / dof  # this is s^2
    # Calculate all distances:
    dist = (residual**2 / (k * mse)) * (h / (1 - h) ** 2)
    cook.append(dist)

In [None]:
fig, axes = plt.subplots(
    constrained_layout=True, ncols=2, nrows=2, figsize=(8, 8), sharex=True
)
axes = axes.flatten()
axes[0].set_ylabel("Cook's distance")
axes[2].set_ylabel("Cook's distance")
axes[2].set_xlabel("Observation number")
axes[3].set_xlabel("Observation number")

for i, axi in enumerate(axes):
    dist = cook[i]
    pos = range(len(dist))
    axi.bar(pos, dist)
    threshold_cook = 4.0 / len(dist)
    axi.axhline(y=threshold_cook, color="k", ls=":", label="Threshold")
    axi.set(title=f"Data set {i+1}")
    if i == 0:
        axi.legend()
sns.despine(fig=fig)

From the figure above, we conclude that data set 1 and 3 does not contain any very influential points, while
data set 2 and 4 does! Let us label these points:

In [None]:
# Plot the predicted values for the different data sets
fig, axes = plt.subplots(
    constrained_layout=True,
    sharex=True,
    sharey=True,
    ncols=2,
    nrows=2,
    figsize=(8, 8),
)
axes = axes.flatten()
axes[0].set_ylabel("y")
axes[2].set_ylabel("y")

for i, axi in enumerate(axes):
    axi.scatter(X, all_y[i])
    axi.plot(
        X,
        predicted[i],
        label=f"R² = {r2_scores[i]:.3g}\n{model_equation(models[i])}",
        color="k",
    )
    axi.set(xlabel="x", title=f"Data set: {i+1}")
    dist = cook[i]
    threshold_cook = 4.0 / len(dist)
    idx = np.where(dist > threshold_cook)[0]
    if len(idx) > 0:
        axi.scatter(
            X[idx],
            all_y[i][idx],
            alpha=0.5,
            label="Influential point(s)!",
            s=100,
        )
    axi.legend()
sns.despine(fig=fig)

In the figure above we have marked the points that influence the parameters of the linear model a lot.
These are points we should investigate further. If they are outliers, we can try to delete them, and remake
the model. Let us try this for data set number 4:

In [None]:
# Plot the predicted values for the different data sets
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(X, y4)
ax.plot(
    X,
    predicted[i],
    label=f"All points:\nR² = {r2_scores[i]:.3g}\n{model_equation(models[i])}",
    lw=2,
)
dist = cook[-1]
threshold_cook = 4.0 / len(dist)
idx = np.where(dist > threshold_cook)[0]
if len(idx) > 0:
    ax.scatter(
        X[idx], all_y[i][idx], alpha=0.5, label="Influential point(s)!", s=100
    )

# Remove the unusual points, and make a new model:
X_removed = np.delete(X, idx).reshape(-1, 1)
y_removed = np.delete(y4, idx)

new_model = LinearRegression(fit_intercept=True)
new_model.fit(X_removed, y_removed)
y_hat = new_model.predict(X)
text = (
    "Without influential point(s):\n"
    f"R² = {r2_score(y_removed, new_model.predict(X_removed)):.3g}\n"
    f"{model_equation(new_model)}"
)
ax.plot(X, y_hat, ls=":", label=text, lw=3)  # color="#964a8b")
ax.set(xlabel="x", ylabel="y", title=f"Data set: 4")
ax.legend(labelspacing=1.2)
sns.despine(fig=fig)

**Note**: The Python package [yellowbrick](https://www.scikit-yb.org/en/latest/) can calculate and display
Cook's distances for us, see [this example](https://www.scikit-yb.org/en/latest/api/regressor/influence.html) for more information.

You can also experiment with methods that are robust to outliers, for instance
[RANSAC](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RANSACRegressor.html),
[Theil Sen](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TheilSenRegressor.html), or
[Huber regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor). [Here is a short comparison](https://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors)
of these options.