# Least squares for a more difficult case
Here, we will try to predict the compressive strength of concrete. 
The data is taken from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength) and parts of it was used in
[this article](https://doi.org/10.1016/S0008-8846(98)00165-3).

The data set contains 1030 samples where the strength has been measured as a function of
the amounts of several components:

* *Cement*
* *Blast Furnace Slag*
* *Fly Ash*
* *Water*
* *Superplasticizer*
* *Coarse Aggregate*
* *Fine Aggregate*

and the *Age* measured in days. On the UCI Machine Learning Repository page it says that the strength is
a "highly nonlinear function of age and ingredients", but we will see how well linear models can do in this case.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

sns.set_theme(style="ticks", context="notebook", palette="muted")
%matplotlib notebook

In [None]:
data = pd.read_csv("concrete.csv")

In [None]:
data.describe()

## Initial exploration - Scatter Plot Matrix & Heatmap

In [None]:
grid = sns.pairplot(data, kind="reg")

In [None]:
corr = data.corr()
corr.style.background_gradient(cmap="Blues")

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
sns.heatmap(corr, cmap="PiYG", vmin=-1, vmax=1, annot=True, ax=ax);

## Model 1: Least squares using all variables

In [None]:
from sklearn.preprocessing import scale

# We prepare the data: Here we scale y and  X:
y = scale(data["Strength"].to_numpy())
variables = [i for i in data.columns if i != "Strength"]
X = scale(data[variables].to_numpy())

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
model1 = LinearRegression(fit_intercept=False)
model1.fit(X, y)
y_hat = model1.predict(X)

In [None]:
def score_model(model, X, y_true):
    """Caclulate some metrics for a model and plot predicted values and residuals."""
    y_predict = model.predict(X)
    fig, (ax1, ax2) = plt.subplots(
        constrained_layout=True, ncols=2, figsize=(8, 4), sharex=True
    )
    r2 = r2_score(y_true, y_predict)
    mse = mean_squared_error(y_true, y_predict)
    ax1.scatter(y_predict, y_true)
    ax1.set_title(f"R² = {r2:.3g}, MSE = {mse:.3g}")
    ax1.set(xlabel="ŷ", ylabel="y")
    ax2.scatter(y_predict, y_true - y_predict)
    ax2.axhline(y=0, ls=":", color="k")
    ax2.set(xlabel="ŷ", ylabel="y - ŷ")
    sns.despine(fig=fig)

In [None]:
def show_coefficients(model, variables=None):
    """Display coefficients for a linear model."""
    fig, axi = plt.subplots(constrained_layout=True)
    try:
        coefficients = model.coef_
    except:
        reg = model.named_steps["regression"]
        coefficients = reg.coef_
        # Attempt to generate variable names:
        poly = model.named_steps["polynomial"]
        variables = poly.get_feature_names_out(input_features=variables)

    pos = list(range(len(variables)))
    axi.bar(pos, coefficients)
    axi.axhline(y=0, ls=":", color="k")
    axi.set_xticks(pos)
    axi.set_xticklabels(variables, rotation=90)
    sns.despine(fig=fig)

In [None]:
score_model(model1, X, y)

In [None]:
show_coefficients(model1, variables=variables)

## Model 2: Adding higher order terms
The first linear model is not too impressive. We shall now try to add higher order terms and interactions.
Interactions are terms of the tyoe (as an example) "age × water".

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

In [None]:
steps = [
    (
        "polynomial",
        PolynomialFeatures(degree=2, include_bias=False),
    ),  # Add all second order terms and interactions
    ("regression", LinearRegression(fit_intercept=False)),
]
model2 = Pipeline(steps=steps)
model2.fit(X, y)
score_model(model2, X, y)
show_coefficients(model2, variables=variables)

## Checking the performance by using a training and test set
We have certainly added many variables now. But the R² value did not improve that much. When adding variables,
we might overfit our model. One way to check for this is to use a strategy with training and tests sets. The main
idea is: we make our model on one part of the data (the training set), and test it on another (the test set).
The test set is not used when creating the model!

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333)

In [None]:
def score_train_test(model, X_train, X_test, y_train, y_test):
    """Do some scoring for models made with a test and training set."""
    y_train_predict = model.predict(X_train)
    y_test_predict = model.predict(X_test)
    r2_train = r2_score(y_train, y_train_predict)
    r2_test = r2_score(y_test, y_test_predict)
    mse_train = mean_squared_error(y_train, y_train_predict)
    mse_test = mean_squared_error(y_test, y_test_predict)
    fig, axes = plt.subplots(
        ncols=2, nrows=2, constrained_layout=True, sharex=True
    )

    axes[0, 0].scatter(y_train_predict, y_train)
    axes[0, 0].set_title(
        f"Training: R² = {r2_train:.3g}, MSE = {mse_train:.3g}"
    )

    axes[0, 1].scatter(y_test_predict, y_test)
    axes[0, 1].set_title(f"Test: R² = {r2_test:.3g}, MSE = {mse_test:.3g}")

    axes[0, 0].set(xlabel="ŷ", ylabel="y")
    axes[0, 1].set(xlabel="ŷ", ylabel="y")

    axes[1, 0].scatter(y_train_predict, y_train - y_train_predict)
    axes[1, 1].scatter(y_test_predict, y_test - y_test_predict)

    axes[1, 0].set(xlabel="ŷ", ylabel="y-ŷ")
    axes[1, 1].set(xlabel="ŷ", ylabel="y-ŷ")
    sns.despine(fig=fig)

In [None]:
model1 = LinearRegression(fit_intercept=False)
model1.fit(X_train, y_train)
score_train_test(model1, X_train, X_test, y_train, y_test)

In [None]:
steps = [
    ("polynomial", PolynomialFeatures(degree=2, include_bias=False)),
    ("leastsquares", LinearRegression(fit_intercept=False)),
]
model2 = Pipeline(steps=steps)
model2.fit(X_train, y_train)
score_train_test(model2, X_train, X_test, y_train, y_test)

## Model 3: Can LASSO help us?
Let us try another method to see if all the variables we have added are needed!

In [None]:
from sklearn.linear_model import Lasso

steps = [
    ("polynomial", PolynomialFeatures(degree=2, include_bias=False)),
    ("regression", Lasso(alpha=0.04, fit_intercept=False)),
]
model3 = Pipeline(steps=steps)
model3.fit(X_train, y_train)
score_train_test(model3, X_train, X_test, y_train, y_test)

In [None]:
show_coefficients(model3, variables=variables)

Inspired by the results above, we try another least squares model, but with fewer variables:

In [None]:
data2 = data[
    [
        "Age",
        "Cement",
        "Slag",
    ]
].copy()  # Make a selection of variables here!
data2["Age²"] = data["Age"] ** 2  # Maybe the Age² should be used?
data2

In [None]:
X2 = scale(data2.to_numpy())

In [None]:
model4 = LinearRegression(fit_intercept=False)
model4.fit(X2, y)

In [None]:
score_model(model4, X2, y)

In [None]:
show_coefficients(model4, variables=data2.columns)

## Concluding remarks
OK, we do not have super impressive results. Maybe we should try something completely different?

What we have done with the training and test set is completely general. If we try other supervised
learning methods, we can still calculate $R^2$, the mean squared error, and use the training/testing strategy.
Here are some tests for three extra methods:

In [None]:
from sklearn.svm import SVR  # Support Vector Machine

model5 = SVR()
model5.fit(X_train, y_train)
score_train_test(model5, X_train, X_test, y_train, y_test)

In [None]:
from sklearn.tree import DecisionTreeRegressor  # A decision tree

model6 = DecisionTreeRegressor(max_depth=8)
model6.fit(X_train, y_train)
score_train_test(model6, X_train, X_test, y_train, y_test)

In [None]:
from sklearn.neural_network import MLPRegressor  # A multi-layer Perceptron

model7 = MLPRegressor(max_iter=1000)
model7.fit(X_train, y_train)
score_train_test(model7, X_train, X_test, y_train, y_test)