# Partial least squares regression "by hand" for 1D $\mathbf{y}$ = PLS1

Here, we will find the latent variables in partial least squares (PLS) regression by hand. As an
example, we will consider the solubility data again, but we will use:

- the number of hydrogen atoms and carbon atoms as the X-variables

- the sum of hydrogen atoms and carbon atoms as the y-variable.

In this example, y consists of a single variable. PLS performed when y is a single variable is also
referred to as PLS1.

## Loading the data

In [None]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.linalg import svd
from numpy.linalg import norm


%matplotlib notebook
sns.set_theme(style="ticks", context="notebook", palette="muted")

In [None]:
data = pd.read_csv("solubility_descriptors.csv.zip")
data

In [None]:
data["nC + nH"] = data["nC"] + data["nH"]

xvars = [
    "nC",
    "nH",
]
yvars = ["nC + nH"]

scaler_x, scaler_y = StandardScaler(), StandardScaler()

y = scaler_y.fit_transform(data[yvars].to_numpy())
X = scaler_x.fit_transform(data[xvars].to_numpy())

### Plotting X- and Y-data

In [None]:
fig = plt.figure()
ax = fig.add_subplot(projection="3d")
ax.scatter(X[:, 0], X[:, 1], y)
ax.set(xlabel=xvars[0], ylabel=xvars[1], zlabel=yvars[0])

## The PLS method

In PLS, we are looking for scores $\mathbf{t}$ (for X) so that the covariance with $\mathbf{y}$
is maximised. The main idea is that the scores should explain the variance in X and the
covariance between X and y.

The covariance can be calculated by $\mathbf{t}^\top \mathbf{y}$.
We shall see later how we can find the directions in X and Y to maximise the covariance, but let us assume that we have found them. In PLS, these directions are referred to as the *weights* $\mathbf{w}$ (for X).
We can use these *weights* to calculate the scores:


\begin{equation}
\mathbf{t} = \mathbf{X} \mathbf{w}
\end{equation}

Assume that we will predict $\mathbf{y}$ from some measured $\mathbf{X}$. 
Ideally, we would have a perfect correlation between the scores so
that $\mathbf{y} = g \mathbf{t}$ where $g$ is some number.
But in general, this is not the case. Instead, we will approximate $\mathbf{y}$ from $\mathbf{t}$ using
this relation. The least squares approximation for $g$ is,

\begin{equation}
g = \frac{\mathbf{t}^\top \mathbf{y}}{\mathbf{t}^\top \mathbf{t}},
\end{equation}


so that the y approximated from the X-scores are $\hat{\mathbf{y}} = g \mathbf{t}$.
We can rewrite this as,
\begin{equation}
\mathbf{y} = g \mathbf{t} = g \mathbf{X} \mathbf{w},
\end{equation}


and comparing with a linear equation on the form $\mathbf{y} = \mathbf{X} \mathbf{b}_\text{PLS}$ with regression
coefficients $b_\text{PLS}$ we see that

\begin{equation}
\mathbf{b}_\text{PLS} = g \mathbf{w} .
\end{equation}

## A short example

We set set $\mathbf{w} = (0.0, 1.0)^\top$, then 
the product $\mathbf{X} \mathbf{w}$ will just pick out
the second column of $\mathbf{X}$. Then a plot of
$\mathbf{t}$ vs $\mathbf{y}$ will show how the sum of carbon and hydrogen atoms is correlated with the number of hydrogen atoms.

In [None]:
def make_plot(X, y, w):
    """Plot X and t vs. y (calculated using w)"""
    fig, axes = plt.subplots(
        constrained_layout=True,
        ncols=2,
        figsize=(8, 4),
        sharex=True,
        sharey=True,
    )
    axes[0].scatter(X[:, 0], X[:, 1])
    axes[0].set(xlabel=xvars[0], ylabel=xvars[1], title="X")

    vecx = w.flatten()

    axes[0].quiver(
        0,
        0,
        vecx[0],
        vecx[1],
        color="black",
        angles="xy",
        scale_units="xy",
        scale=0.2,
        width=0.015,
    )

    t = X @ (w / norm(w))

    cov = t.T @ y

    axes[1].scatter(t[:, 0], y[:, 0])
    axes[1].set(
        xlabel="X-scores (t)", ylabel="y", title=f"Covariance: {cov[0][0]:.4g}"
    )
    sns.despine(fig=fig)

In [None]:
# Weights by hand:
w = np.array([0.0, 1.0])
w = w.reshape(2, -1)  # Make it a column vector
print(f"w.T = {w.T}, shape of w: {w.shape}")

make_plot(X, y, w)

The X-data is projected onto the black vector in the plot above, giving the X-scores. The rightmost plot shows the X-scores plotted against y. Let us calculate the regression coefficients:

In [None]:
def get_regression_coefficients(X, y, w):
    """Calculates the regression coefficients given w."""
    t = X @ w
    # Find g (approximate u from t):
    g = t.T @ y / (t.T @ t)
    # Find b:
    B = g * w
    return B

In [None]:
B = get_regression_coefficients(X, y, w)
print(B)

In the regression coefficients above, we see that the first row is just zero. Effectively this means that we
are predicting y using only the number of hydrogens:

In [None]:
y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(y, y_hat, label=f"R² = {r2_score(y, y_hat):.3f}")
ax.legend()
ax.set(xlabel="y", ylabel="ŷ")
sns.despine(fig=fig)

## Maximizing the covariance

The covariance between $\mathbf{t}$ and $\mathbf{y}$ is


\begin{equation}
\mathbf{t}^\top \mathbf{y} = (\mathbf{X} \mathbf{w})^\top (\mathbf{y})
= \mathbf{w}^\top \mathbf{X}^\top \mathbf{y}
\end{equation}

We set $\mathbf{v} = \mathbf{X}^\top \mathbf{y}$. What vector $\mathbf{w}$ will
maximize the dot product $\mathbf{w}^\top \mathbf{v}$? If we make $\mathbf{w}$ parallel
to $\mathbf{v}$, then the dot product will be maximized! So we can set: $\mathbf{w} = \lambda \mathbf{v} =
\lambda \mathbf{X}^\top \mathbf{y}$ for some number $\lambda$. If we simultaneously normalize $\mathbf{w}$,
then we don't need to find $\lambda$ (it will drop out in the normalization).

In [None]:
w = X.T @ y
w /= norm(w)
make_plot(X, y, w)

In [None]:
B = get_regression_coefficients(X, y, w)
print(B)
y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(y, y_hat, label=f"R² = {r2_score(y, y_hat):.3f}")
ax.set(xlabel="y", ylabel="ŷ")
ax.legend()
sns.despine(fig=fig)

## Comparing with `PLSRegression` from `sklearn`

In [None]:
model = PLSRegression(scale=False, n_components=1)
model.fit(X, y)
make_plot(X, y, model.x_weights_)

In [None]:
print(B / model.coef_.T)

## Finding the next latent variables

So far, we have found one latent variable. The strategy to find the next PLS components is
to essentially repeat the process, but we first "subtract" the latent variable we just found
from $\mathbf{X}^T$ and $\mathbf{y}$. This is referred to as deflation.
In addition, we make sure that the next latent variable
we find will have scores orthogonal to the previous one. A consequence of this is
that loadings and scores we find in the next steps of the method do not
directly operate on $\mathbf{X}$, but on the deflated versions of this one.

It would be nice to have loadings that we could apply directly 
to $\mathbf{X}$ (and not the deflated version).
PLS methods will typically calculate these as well, and in `sklearn`, they
are called *rotations*. We will use these later for the interpretation of correlations between
variables.

Finally, here is an example of an implementation of the PLS regression method:

In [None]:
def run_pls1(X, y, n_components=2):
    """PLS1 regression given X and y."""
    Xk, yk = np.copy(X), np.copy(y)

    W = np.zeros((Xk.shape[1], n_components))
    P = np.zeros((Xk.shape[1], n_components))
    T = np.zeros((Xk.shape[0], n_components))
    G = np.zeros((1, n_components))

    for i in range(n_components):
        wxi = Xk.T @ yk
        wxi /= norm(wxi)
        # For convention, make largest value in wxi positive:
        idx = np.argmax(abs(wxi))
        sign = np.sign(wxi[idx])
        wxi *= sign
        # Store X-weights
        W[:, i] = wxi.flatten()

        # Calculate scores:
        t = Xk @ wxi
        # Calculate the predicted y:
        g = t.T @ yk / (t.T @ t)
        # Store regression coefficients
        G[:, i] = g
        # Subtract the predicted y:
        yk = yk - g * t
        # Calculate loadings to remove the part of
        # X we have described with t:
        p = (Xk.T @ t) / (t.T @ t)
        Xk = Xk - t @ p.T
        # Store X-loadings and X-scores:
        P[:, i] = p.flatten()
        T[:, i] = t.flatten()

    R = W @ np.linalg.pinv(P.T @ W)  # Rotations, can be used for T = X @ R
    B = R @ G.T  # Regression coefficients, for Ŷ = X @ B
    return W, P, T, R, B

In [None]:
# We test the method above by comparing with sklearn:
xvars = ["MolLogP", "nAtom", "Polar Surface Area", "MW"]
yvars = ["Molecular Weight"]
scaler_x, scaler_y = StandardScaler(), StandardScaler()

y = scaler_y.fit_transform(data[yvars].to_numpy())
X = scaler_x.fit_transform(data[xvars].to_numpy())

model = PLSRegression(scale=False, n_components=4)
model.fit(X, y)
W, P, T, R, B = run_pls1(X, y, n_components=model.n_components)
print(np.allclose(R, model.x_rotations_))
print(np.allclose(W, model.x_weights_))
print(np.allclose(P, model.x_loadings_))
print(np.allclose(T, model.x_scores_))
print(np.allclose(B, model.coef_.T))