# Heteroscedasticity

[Weighted least squares](https://en.wikipedia.org/wiki/Weighted_least_squares) is not a part
of TKJ4175, but there was a question in Lecture 4 on how we can set
the weights if we want to use them. Here is an example of how this can be done. We first create a model with
ordinary least squares, and then we use the residuals from this model to set weights for the weighted least squares method.

In some cases, heteroscedasticity can be "fixed" by a suitable data transformation. There are also two examples of this here, and then a link to an article with additional information on what
such transformations might do to our model.

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns


sns.set_theme(style="ticks", context="notebook", palette="muted")
%matplotlib notebook

The file [noise.csv](./noise.csv) contains a set of x and y values where the noise is heteroscedastic:

In [None]:
data = pd.read_csv("noise.csv")
x = data["x"].to_numpy()
y = data["y"].to_numpy()
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(x, y)
ax.set(xlabel="x", ylabel="y", title='Raw data: "noise.csv"')
sns.despine(fig=fig)

## Standard least squares regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression(fit_intercept=True)
X = x.reshape(-1, 1)
model.fit(X, y)

In [None]:
def plot_results(X, y, model, weights=None):
    fig, (ax1, ax2) = plt.subplots(
        constrained_layout=True, ncols=2, figsize=(8, 4)
    )
    ax1.scatter(X, y)
    y_hat = model.predict(X)
    r2 = model.score(X, y, sample_weight=weights)
    if weights is not None:
        r2_2 = r2_score(y, y_hat)
        text = f"ŷ = {model.intercept_:.3g} + {model.coef_[0]:.3g}*x\nR² (weighted) = {r2:.3g}\nR² (non-weighted) {r2_2:.3g}"
    else:
        text = f"ŷ = {model.intercept_:.3g} + {model.coef_[0]:.3g}*x\nR² = {r2:.3g}"
    ax1.plot(X, y_hat, color="k", label=text)
    ax1.set(xlabel="x", ylabel="y")
    ax1.legend()

    residual = y - y_hat
    ax2.scatter(y_hat, residual, label="Residuals (non-weighted)")
    if weights is not None:
        ax2.scatter(
            y_hat, residual * weights, label="Residuals (weighted)", marker="s"
        )
        ax2.legend()
    ax2.set(xlabel="ŷ", ylabel="Residuals (y - ŷ)", title="Residuals")
    ax2.axhline(y=0, ls=":", color="k")
    sns.despine(fig=fig)
    return residual

In [None]:
residuals = plot_results(X, y, model)

## Weighted least squares

Let us try weighted least squares. Here, we say that the weights are equal to the residuals we got
from the ordinary least squares fit:

In [None]:
weights = 1.0 / abs(residuals)  # Make sure weights are positive
# or, alternatively:
# weights = 1.0 / residuals**2
weights = weights / np.sqrt(np.dot(weights, weights))  # Normalize weights
model2 = LinearRegression(fit_intercept=True)
model2.fit(X, y, sample_weight=weights)
_ = plot_results(X, y, model2, weights=weights)

**Note:** R² looks a lot better for the weighted model. If we just calculate R² without weights, it will be similar to R² for the ordinary least squares model we made first.

## Data transformations

Sometimes, heteroscedasticity can be "removed" by transforming the y variables. For instance, we can take the
square root of the y-values (note here that the y-values are shifted so that they are all positive). This
is a so-called [variance-stabilizing transformation](https://en.wikipedia.org/wiki/Variance-stabilizing_transformation):

In [None]:
y_new = np.sqrt(y - y.min() + 1)
model3 = LinearRegression(fit_intercept=True)
model3.fit(X, y_new)
_ = plot_results(X, y_new, model3)

Another option is to log-transform the (x and) y values. Note also here that we shift the x- and y-values so
that they are all greater than zero.

In [None]:
y_new2 = np.log(y - y.min() + 1)
x_new2 = np.log(x - x.min() + 1)
X_new2 = x_new2.reshape(-1, 1)
model4 = LinearRegression(fit_intercept=True)
model4.fit(X_new2, y_new2)
_ = plot_results(X_new2, y_new2, model4)

**If you are interested:** You can read more about what this transformation is doing to the data in this article: [Regression analysis of log-transformed data: Statistical bias and its correction](https://setac.onlinelibrary.wiley.com/doi/10.1002/etc.5620120618).