# Least squares for a more difficult case
Here, we will try to predict the age of [abalone](https://en.wikipedia.org/wiki/Abalone) from physical measurements. The data is taken from the
[UCI Machine Learning Repository](https://doi.org/10.24432/C55C7W), and to quote that page, 

> The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

The data we have available contains 4177 samples, and the following information is available:


| name           | description                           | units   |
|:---------------|:--------------------------------------|:--------|
| Sex            | (M)ale, (F)emale, and (I)nfant        |         |
| Length         | Longest shell measurement             | mm      |
| Diameter       | Perpendicular to length               | mm      |
| Height         | With meat in shell                    | mm      |
| Whole_weight   | Whole abalone                         | grams   |
| Shucked_weight | Weight of meat                        | grams   |
| Viscera_weight | Gut weight (after bleeding)           | grams   |
| Shell_weight   | After being dried                     | grams   |
| Rings          | +1.5 gives the age in years           |         |

We will now attempt to predict the age using these variables.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="muted")

In [None]:
data = pd.read_csv("abalone.csv")
data

In [None]:
data["age"] = data["rings"] + 1.5
data

In [None]:
data.describe()

## Initial exploration

Before we start making models, we must have a look at our raw data. We are going to check for missing values, patterns or anomalies.

### Missing values?

Missing values are like "holes" in our data and many methods can not be applied if data is missing.

One way to check if there are missing values is to ask pandas to check if
some columns contain one or more [Not a number (NaN)](https://en.wikipedia.org/wiki/NaN):

In [None]:
data.isnull().values.any()

We can also ask pandas to write out how many NaN's there are in each column:

In [None]:
data.isnull().sum()

We are lucky! There are no missing numbers and we do not have to deal with the potential problems this may cause. How to deal with missing numbers will be a topic for a later lecture in the course.

### Interesting distributions?

Before we model, we should look at distributions of the variables.

In [None]:
sns.displot(data, x="age", kind="kde", hue="sex")
# test with hue and kind and y

### Scatter plot matrix

In [None]:
grid = sns.pairplot(data, diag_kind="kde", hue="sex")

### Correlations

The Scatter Plot Matrix can be difficult to read for many variables. We can reduce the plots to just numbers by
calculating correlations between different pairs of variables. We will here use the
[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). This
is a number between -1 and 1 that quantifies the correlation between a pair of variables. Here is a picture
from Wikipedia that shows different situations:

![Pearson correlation coefficient - picture](https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Correlation_coefficient.png/600px-Correlation_coefficient.png)

In [None]:
xvariables = [
    "length",
    "diameter",
    "height",
    "whole weight",
    "shucked weight",
    "viscera weight",
    "shell weight",
]
yvariables = ["age"]

variables = xvariables + yvariables
corr = data[variables].corr()
corr.style.background_gradient(cmap="vlag")

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
sns.heatmap(
    corr,
    cmap="vlag",
    vmin=-1,
    vmax=1,
    annot=True,
    ax=ax,
);

## Model 1: Least squares using all variables

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
xvariables = [
    "length",
    "diameter",
    "height",
    "whole weight",
    "shucked weight",
    "viscera weight",
    "shell weight",
]
y = data["age"]
X = data[xvariables]
model1 = LinearRegression(fit_intercept=True)
model1.fit(X, y)
y_hat = model1.predict(X)

In [None]:
def score_model(model, X, y_true):
    """Caclulate some metrics for a model and plot predicted values and residuals."""
    y_predict = model.predict(X)
    fig, (ax1, ax2) = plt.subplots(
        constrained_layout=True, ncols=2, figsize=(6, 3), sharex=True
    )
    r2 = r2_score(y_true, y_predict)

    try:
        coefficients = model.coef_
    except:
        reg = model.named_steps["regression"]
        coefficients = reg.coef_
    n = len(X)
    r2_adj = 1 - (1 - r2) * (n - 1) / (n - len(coefficients) - 1)

    mse = mean_squared_error(y_true, y_predict)
    ax1.scatter(y_predict, y_true)
    ax1.set_title(
        f"R² = {r2:.3f}, R²(adj) = {r2_adj:.3f},\nMSE = {mse:.3f}", loc="left"
    )
    ax1.set(xlabel="ŷ", ylabel="y")
    ax2.scatter(y_predict, y_true - y_predict)
    ax2.axhline(y=0, ls=":", color="k")
    ax2.set(xlabel="ŷ", ylabel="(y - ŷ)")
    ax2.set_title("Residuals", loc="left")

In [None]:
score_model(model1, X, y)

In [None]:
def show_coefficients(model, variables=None, add_label=True):
    """Display coefficients for a linear model."""
    figi, axi = plt.subplots(constrained_layout=True)
    try:
        coefficients = model.coef_
    except:
        reg = model.named_steps["regression"]
        coefficients = reg.coef_
        # Attempt to generate variable names:
        poly = model.named_steps["polynomial"]
        variables = poly.get_feature_names_out(input_features=variables)

    pos = list(range(len(variables)))
    bars = axi.bar(pos, coefficients)
    if add_label:
        axi.bar_label(bars, fmt="{:.2f}")
    axi.axhline(y=0, ls=":", color="k")
    axi.set_xticks(pos)
    axi.set_xticklabels(variables, rotation=90);

In [None]:
show_coefficients(model1, variables=xvariables)

### Model 1.1: Does it help changing variables?

In [None]:
MY_SELECTION = ["length"]
y = data["age"]
X = data[MY_SELECTION]

model11 = LinearRegression(fit_intercept=True)
model11.fit(X, y)

show_coefficients(model11, variables=MY_SELECTION)
score_model(model11, X, y)

### Model 1.2: Does it help focusing on infants?

In [None]:
data2 = data[data["sex"] == "I"]

y = data2["age"]
X = data2[xvariables]

model12 = LinearRegression(fit_intercept=True)
model12.fit(X, y)
score_model(model12, X, y)

### Model 2: Adding higher order terms
The first linear model are not too impressive. We shall now try to add higher order terms and interactions.
Interactions are terms of the tyoe (as an example) "length × diameter".

In [None]:
data_modified = data.copy()
data_modified["length * diameter"] = data["length"] * data["diameter"]

xvariables = [
    "length",
    "diameter",
    "height",
    "whole weight",
    "shucked weight",
    "viscera weight",
    "shell weight",
    "length * diameter",
]


X = data_modified[xvariables]
y = data_modified["age"]

model2 = LinearRegression(fit_intercept=True)
model2.fit(X, y)
show_coefficients(model2, variables=xvariables)
score_model(model2, X, y)

One way to add many higher order terms is to use [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) from sklearn:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

In [None]:
# Add all second order terms and interactions

xvariables = [
    "length",
    "diameter",
    "height",
    "whole weight",
    "shucked weight",
    "viscera weight",
    "shell weight",
]

X = data[xvariables]
y = data["age"]

polynomial = PolynomialFeatures(degree=7, include_bias=False)
steps = [
    ("polynomial", polynomial),
    ("regression", LinearRegression(fit_intercept=True)),
]
model2 = Pipeline(steps=steps)
model2.fit(X, y)
score_model(model2, X, y)
# show_coefficients(model2, variables=xvariables, add_label=False)

## Checking the performance by using a training and test set.
We have certainly added many variables now. But the R² value did not improve that much. When adding variables,
we might overfit our model. One way to check for this is to use a strategy with training and tests sets. The main
idea is: we make our model on one part of the data (the training set), and test it on another (the test set).
The test set is not used when creating the model!

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

In [None]:
xvariables = [
    "length",
    "diameter",
    "height",
    "whole weight",
    "shucked weight",
    "viscera weight",
    "shell weight",
]


X = scale(data[xvariables])
y = scale(data["age"])

# Note: For scaling, we should fit the scaler to the training set
# and the apply it to the test set. The code above is a
# simplification.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333)

In [None]:
def score_train_test(model, X_train, X_test, y_train, y_test):
    """Do some scoring for models made with a test and training set."""
    y_train_predict = model.predict(X_train)
    y_test_predict = model.predict(X_test)
    r2_train = r2_score(y_train, y_train_predict)
    r2_test = r2_score(y_test, y_test_predict)
    mse_train = mean_squared_error(y_train, y_train_predict)
    mse_test = mean_squared_error(y_test, y_test_predict)
    fig, axes = plt.subplots(
        ncols=2, nrows=2, constrained_layout=True, sharex=True
    )

    axes[0, 0].scatter(y_train_predict, y_train)
    axes[0, 0].set_title(
        f"Training: R² = {r2_train:.3g}, MSE = {mse_train:.3g}"
    )

    axes[0, 1].scatter(y_test_predict, y_test)
    axes[0, 1].set_title(f"Test: R² = {r2_test:.3g}, MSE = {mse_test:.3g}")

    axes[0, 0].set(xlabel="ŷ", ylabel="y")
    axes[0, 1].set(xlabel="ŷ", ylabel="y")

    axes[1, 0].scatter(y_train_predict, y_train - y_train_predict)
    axes[1, 1].scatter(y_test_predict, y_test - y_test_predict)

    axes[1, 0].set(xlabel="ŷ", ylabel="y-ŷ")
    axes[1, 1].set(xlabel="ŷ", ylabel="y-ŷ")

In [None]:
model1 = LinearRegression(fit_intercept=False)
model1.fit(X_train, y_train)
score_train_test(model1, X_train, X_test, y_train, y_test)

In [None]:
steps = [
    ("polynomial", PolynomialFeatures(degree=2, include_bias=False)),
    ("leastsquares", LinearRegression(fit_intercept=False)),
]
model2 = Pipeline(steps=steps)
model2.fit(X_train, y_train)
score_train_test(model2, X_train, X_test, y_train, y_test)

### Can alternative methods help us?

It can be a lot of work to compare different models and try different selections of variables. Let us
try an alternative, the [least absolute shrinkage and selection operator (LASSO)](https://en.wikipedia.org/wiki/Lasso_(statistics)).
This one modifies the error we minimize. In least squares we minimize the
squared errors,

\begin{equation}
J = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2.
\end{equation}

where $\hat{y}_i = b_0 + b_1 x_1 + \ldots = b_0 + \sum_{j=1}^m b_j x_j$,
while in LASSO, we minimize,

\begin{equation}
J = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^m | b_j | .
\end{equation}

The practical outcome of this is that the minimization penalizes large coefficients and can now find solutions where some $b_j$'s are zero (= not important
for the model!)

In [None]:
poly = PolynomialFeatures(degree=2, include_bias=False)

data_p = poly.fit_transform(data[xvariables])


data_poly = pd.DataFrame(
    data_p,
    columns=poly.get_feature_names_out(),
)


X = scale(data[xvariables])
y = scale(data["age"])

print(X.shape)

# Note: For scaling, we should fit the scaler to the training set
# and the apply it to the test set. The code above is a
# simplification.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333)

In [None]:
from sklearn.linear_model import Lasso

model3 = Lasso(alpha=0.01, fit_intercept=False, max_iter=10000)
model3.fit(X_train, y_train)
score_train_test(model3, X_train, X_test, y_train, y_test)

In [None]:
show_coefficients(model3, variables=xvariables, add_label=False)

### Concluding remarks
OK, we do not have super impressive results. Maybe we should try something completely different?

What we have done with the training and test set is completely general. If we try other supervised
learning methods, we can still calculate $R^2$, the mean squared error, and use the training/testing strategy.
Here are some tests for three extra methods:

In [None]:
from sklearn.svm import SVR  # Support Vector Machine

model5 = SVR()
model5.fit(X_train, y_train)
score_train_test(model5, X_train, X_test, y_train, y_test)

In [None]:
from sklearn.neural_network import MLPRegressor  # A multi-layer Perceptron

model7 = MLPRegressor(
    max_iter=10000,
)
model7.fit(X_train, y_train)
score_train_test(model7, X_train, X_test, y_train, y_test)

In [None]:
from catboost import CatBoostRegressor

model8 = CatBoostRegressor()
model8.fit(X_train, y_train)
score_train_test(model8, X_train, X_test, y_train, y_test)