# Partial least squares regression "by hand"

Here, we will find the latent variables in partial least squares (PLS) regression by hand. As an
example, we will consider the solubility data again but in a reduced form:

* We will only consider alcohols

* We use only logP and the number of atoms in the molecule as our X

* We use the solubility and the molecular weight as our Y

From the regression we did on the complete data set, we know that logP is important for predicting the
solubility and we expect the molecular weight to correlate with the number of atoms.

## Loading the data

The data is in the file [solubility_alc.csv](./solubility_alc.csv):

In [None]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.linalg import svd
from numpy.linalg import norm


%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="muted")

In [None]:
data = pd.read_csv("solubility_alc.csv")
data

In [None]:
xvars = ["MolLogP", "nAtom"]
yvars = ["solubility (mol/L)", "Molecular Weight"]

scaler_x, scaler_y = StandardScaler(), StandardScaler()

Y = scaler_y.fit_transform(data[yvars].to_numpy())
X = scaler_x.fit_transform(data[xvars].to_numpy())

### Plotting X- and Y-data

Before we start, let us first plot the X- and Y-data:

In [None]:
fig, axes = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4))
axes[0].scatter(X[:, 0], X[:, 1])
axes[0].set(xlabel=xvars[0], ylabel=xvars[1])
axes[0].set_title("X-data", loc="left")

axes[1].scatter(Y[:, 0], Y[:, 1])
axes[1].set(xlabel=yvars[0], ylabel=yvars[1])
axes[1].set_title("Y-data", loc="left")

sns.despine(fig=fig)

In [None]:
fig, axes = plt.subplots(
    constrained_layout=True, ncols=2, nrows=2, figsize=(8, 8)
)
for i in (0, 1):
    xi = X[:, i]
    for j in (0, 1):
        ax = axes[i, j]
        yj = Y[:, j]
        model = LinearRegression(fit_intercept=False)
        model.fit(xi.reshape(-1, 1), yj)
        y_hat = model.predict(xi.reshape(-1, 1))

        ax.scatter(xi, yj)
        ax.plot(
            xi,
            y_hat,
            label=f"R² = {r2_score(yj, y_hat):.3f}",
            color="red",
            lw=3,
        )
        ax.set(xlabel=xvars[i], ylabel=yvars[j])
        ax.set_title(f"Predicing {yvars[j]} from {xvars[i]}", loc="left")
        ax.legend()

sns.despine(fig=fig)

## The PLS method

In PLS, we are looking for scores $\mathbf{t}$ (for X) and $\mathbf{u}$ (for y) so that the covariance
is maximised. The main idea is that the scores should explain the variance in X and Y and the
covariance between X and Y.

The covariance can be calculated by $\mathbf{t}^\top \mathbf{u}$ (the dot product between
the scores). We shall see later how we can find the directions in X and Y to maximise the covariance, but let us assume that we have found them. In PLS, these directions
are referred to as the *weights* $\mathbf{w}_x$ (for X) and $\mathbf{w}_y$ (for Y).
We can use these *weights* to calculate the scores:


\begin{align}
\mathbf{t} &= \mathbf{X} \mathbf{w}_x, \\
\mathbf{u} &= \mathbf{Y} \mathbf{w}_y .
\end{align}

Assume that we will predict $\mathbf{Y}$ from some measured $\mathbf{X}$. Then, the relations
above do not seem too helpful since they do not directly relate $\mathbf{Y}$ and $\mathbf{X}$. So we need
to connect the scores.

Ideally, we would have a perfect correlation between the scores so
that $\mathbf{u} = g \mathbf{t}$ where $g$ is some number.
But in general, this is not the case. Instead, we will approximate $\mathbf{u}$ from $\mathbf{t}$ using
this relation. The least squares approximation for $g$ is,

\begin{equation}
g = \frac{\mathbf{t}^\top \mathbf{u}}{\mathbf{t}^\top \mathbf{t}},
\end{equation}


so that the Y-scores approximated from the X-scores are $\hat{\mathbf{u}} = g \mathbf{t}$. Note the following:

- When we train the model, we supply both $\mathbf{X}$ and $\mathbf{Y}$.
  These are used to find $\mathbf{w}_x$ and $\mathbf{w}_y$ and the scores, which
  are used to find $g$, connecting the scores.
- We supply only $\mathbf{X}$ when we use the model. We can then calculate the X-scores using $\mathbf{w}_x$. We then calculate the Y-scores (via $g$ calculated in training) from these new X-scores.


So we now have a strategy for estimating the Y-scores $\hat{\mathbf{u}}$ from the X-scores. We now only
need a way of calculating Y from the estimated Y-scores. If you remember principal component analysis (PCA),
we used the *loadings* for this. In PCA we write $\mathbf{T} = \mathbf{X} \mathbf{P}$ where $\mathbf{T}$ are
the scores and $\mathbf{P}$ are the loadings. Also, in PCA, we could invert this relation to
$\mathbf{X} = \mathbf{T} \mathbf{P}^\top$. In PLS, this is no longer valid for the weights, and PLS introduces
separate *loadings* to do this,


\begin{equation}
\mathbf{Y} = \hat{\mathbf{u}} \mathbf{q}^T .
\end{equation}


The equation above tells us that if we know $\mathbf{q}$, then we can estimate $\mathbf{Y}$ from the
(estimated) Y-scores.
If we left-multiply this equation with $\hat{\mathbf{u}}^\top$ we get,


\begin{equation}
 \mathbf{q}^T = \frac{\hat{\mathbf{u}}^\top \mathbf{Y}}{\hat{\mathbf{u}}^\top \hat{\mathbf{u}}} \implies  
 \mathbf{q} = \frac{\mathbf{Y}^\top \hat{\mathbf{u}}}{\hat{\mathbf{u}}^\top \hat{\mathbf{u}}} .
\end{equation}


Using the estimated Y-scores ($\hat{\mathbf{u}} = g \mathbf{t}$) we get,


\begin{equation}
 \mathbf{q} = \frac{\mathbf{Y}^\top \hat{\mathbf{u}}}{\hat{\mathbf{u}}^\top \hat{\mathbf{u}}} =
 \frac{g \mathbf{Y}^\top \mathbf{t}}{g^2 \mathbf{t}^\top \mathbf{t}} \implies g  \mathbf{q} =
 \frac{\mathbf{Y}^\top \mathbf{t}}{\mathbf{t}^\top \mathbf{t}} .
\end{equation}

We now have everything in place to predict $\mathbf{Y}$ from $\mathbf{X}$. If we use $\mathbf{t} = \mathbf{X} \mathbf{w}_x$, then we can write,


\begin{equation}
\mathbf{Y} = \hat{\mathbf{u}} \mathbf{q}^T = g \mathbf{t} \mathbf{q}^\top =
\mathbf{X} g \mathbf{w}_x \mathbf{q}^\top .
\end{equation}


Comparing this
to a linear equation on the form $\mathbf{Y} = \mathbf{X} \mathbf{B}_\text{PLS}$ with regression
coefficients $B_\text{PLS}$ we see that

\begin{equation}
\mathbf{B}_\text{PLS} = g \mathbf{w}_x \mathbf{q}^\top .
\end{equation}


What we will do now is to create some $\mathbf{w}_x$ and $\mathbf{w}_y$ vectors by hand and check what
the correlation is (by calculating the correlation and by plotting the scores $\mathbf{t}$ vs. $\mathbf{u}$).
Both $\mathbf{X}$ and $\mathbf{Y}$ have
dimensions $n \times 2$ (we have $n$ samples and $2$ variables). So if $\mathbf{w}_x$ and $\mathbf{w}_y$
have dimensions $2 \times 1$ (that is, they are *column* vectors), then the scores will
have a dimension of $(n \times 2) \times (2 \times 1) = n \times 1$.

## A short example

We set set $\mathbf{w}_x = (0.0, 1.0)^\top$ and $\mathbf{w}_y = (0.0, 1.0)^\top$. Then 
the product $\mathbf{X} \mathbf{w}_x$ will just pick out
the second column of $\mathbf{X}$ (just the number of atoms) and $\mathbf{Y} \mathbf{w}_y$ will
pick out the second column of $\mathbf{Y}$ (just the molecular weight). Then a plot of
$\mathbf{t}$ vs $\mathbf{u}$ will show how the number of atoms is correlated with the
molecular weight.

In [None]:
def make_plot(X, Y, wx, wy):
    """Plot X, Y and t vs. u (calculated using wx and wy)"""
    fig, axes = plt.subplots(
        constrained_layout=True,
        ncols=3,
        figsize=(9, 3),
        sharex=True,
        sharey=True,
    )
    axes[0].scatter(X[:, 0], X[:, 1])
    axes[0].set(xlabel=xvars[0], ylabel=xvars[1], title="X")

    axes[1].scatter(Y[:, 0], Y[:, 1])
    axes[1].set(xlabel=yvars[0], ylabel=yvars[1], title="Y")

    vecx = wx.flatten()
    vecy = wy.flatten()

    vecx /= norm(vecx)
    vecy /= norm(vecy)

    axes[0].quiver(
        0,
        0,
        vecx[0],
        vecx[1],
        color="black",
        angles="xy",
        scale_units="xy",
        scale=0.2,
        width=0.015,
    )

    axes[1].quiver(
        0,
        0,
        vecy[0],
        vecy[1],
        color="red",
        angles="xy",
        scale_units="xy",
        scale=0.2,
        width=0.015,
    )

    t = X @ (wx / norm(wx))
    u = Y @ (wy / norm(wy))
    # Technical detail: we norm both wx and wy here to get similar reults from different methods

    cov = t.T @ u

    axes[2].scatter(t[:, 0], u[:, 0])
    axes[2].set(
        xlabel="X-scores (t)",
        ylabel="Y-scores (u)",
        title=f"Covariance: {cov[0][0]:.2f}",
    )
    sns.despine(fig=fig)

In [None]:
# Weights by hand:
wx = np.array([0.0, 1.0])
wx = wx.reshape(2, -1)  # Make it a column vector

wy = np.array([0.0, 1.0])
wy = wy.reshape(2, -1)  # Make it a column vector

print(f"wx.T = {wx.T}, shape of w: {wx.shape}")
print(f"wy.T = {wy.T}, shape of q: {wy.shape}")
make_plot(X, Y, wx, wy)

In the plot above, the X-data is projected onto the black vector, giving the X-scores. Similarly, the Y-data is projected onto the red vector, giving the Y-scores.
The rightmost plot shows the scores plotted against each other.

Here, we can use the X-scores to predict the Y-scores. And we could make a good prediction! But we aim to predict the whole $\mathbf{Y}$! If we convert the Y-scores to $\mathbf{Y}$, we would probably
predict the molecular weight quite well but fail at predicting the solubility. Let us check what the
regression coefficients are in this case:

In [None]:
def get_regression_coefficients(X, Y, wx, wy):
    t = X @ wx
    u = Y @ wy
    # Find g (approximate u from t):
    g = t.T @ u / (t.T @ t)
    # Find q:
    q = (Y.T @ t / (t.T @ t)) / g
    # Find b:
    B = g * wx * q.T
    return B

In [None]:
B = get_regression_coefficients(X, Y, wx, wy)
print(B)

The regression coefficients above show that the first row is just zero. Effectively this means that we
are predicting *both* Y-variables using only the number of atoms. From one of the first figures in this
notebook, we expect this to give us a $R²$ around 0.48. Let us check this:

In [None]:
Y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(
    Y[:, 0], Y_hat[:, 0], label=f"R² = {r2_score(Y[:, 0], Y_hat[:, 0]):.3f}"
)
ax.scatter(
    Y[:, 1], Y_hat[:, 1], label=f"R² = {r2_score(Y[:, 1], Y_hat[:, 1]):.3f}"
)
ax.legend()
sns.despine(fig=fig)

*Below, you will find the same code for creating $\mathbf{w}_x$ and $\mathbf{w}_y$ as above. Can you find
two other vectors that give a larger correlation?*

In [None]:
wx = np.array([0.0, 1.0])
wx = wx.reshape(2, -1)  # Make it a column vector
wx /= norm(wx)

wy = np.array([0.0, 1.0])
wy = wy.reshape(2, -1)  # Make it a column vector
wy /= norm(wy)

B = get_regression_coefficients(X, Y, wx, wy)
print(B)
make_plot(X, Y, wx, wy)
Y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(
    Y[:, 0], Y_hat[:, 0], label=f"R² = {r2_score(Y[:, 0], Y_hat[:, 0]):.3f}"
)
ax.scatter(
    Y[:, 1], Y_hat[:, 1], label=f"R² = {r2_score(Y[:, 1], Y_hat[:, 1]):.3f}"
)
ax.legend()
sns.despine(fig=fig)

## Maximizing the covariance

The covariance between $\mathbf{t}$ and $\mathbf{u}$ is


\begin{equation}
\mathbf{t}^\top \mathbf{u} = (\mathbf{X} \mathbf{w}_x)^\top (\mathbf{Y} \mathbf{w}_y)
= \mathbf{w}_x^\top \mathbf{X}^\top \mathbf{Y} \mathbf{w}_y
\end{equation}

Maximizing this covariance turns out to be the
same as finding the [singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition)
of $\mathbf{X}^\top \mathbf{Y}$. Let us also try this:

In [None]:
U, _, Vt = np.linalg.svd(X.T @ Y)  # Find the singular value decomposition.
wx = U[:, 0].reshape(2, -1)
wy = Vt[0, :].reshape(2, -1)
make_plot(X, Y, -wx, -wy)

In [None]:
B = get_regression_coefficients(X, Y, wx, wy)
print(B)
Y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(
    Y[:, 0], Y_hat[:, 0], label=f"R² = {r2_score(Y[:, 0], Y_hat[:, 0]):.3f}"
)
ax.scatter(
    Y[:, 1], Y_hat[:, 1], label=f"R² = {r2_score(Y[:, 1], Y_hat[:, 1]):.3f}"
)
ax.legend()
sns.despine(fig=fig)

## Comparing with `PLSRegression` from `sklearn`

In [None]:
model = PLSRegression(scale=False, n_components=1)
model.fit(X, Y)
make_plot(X, Y, model.x_weights_, model.y_weights_)

In [None]:
print(B / model.coef_.T)
Y_hat = model.predict(X)
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(
    Y[:, 0], Y_hat[:, 0], label=f"R² = {r2_score(Y[:, 0], Y_hat[:, 0]):.3f}"
)
ax.scatter(
    Y[:, 1], Y_hat[:, 1], label=f"R² = {r2_score(Y[:, 1], Y_hat[:, 1]):.3f}"
)
ax.legend()
sns.despine(fig=fig)

## Finding the next latent variables

So far, we have found one latent variable. The strategy to find the next PLS components is
to essentially repeat the process, but we first "subtract" the latent variable we just found
from $\mathbf{X}^T \mathbf{Y}$. This is referred to as deflation.
In addition, we make sure that the next latent variable
we find will have scores orthogonal to the previous one. A consequence of this is
that loadings and scores we find in the next steps of the method do not
directly operate on $\mathbf{X}$
and $\mathbf{Y}$, but on the deflated versions of these matrices.

It would be nice to have loadings that we could apply directly 
to $\mathbf{X}$ and $\mathbf{Y}$ (and not the deflated version).
PLS methods will typically calculate these as well, and in `sklearn`, they
are called *rotations*. We will use these later for the interpretation of correlations between
variables.

For the curious: In the paper
[SIMPLS: An alternative approach to partial least squares regression](https://doi.org/10.1016/0169-7439(93)85002-X)
pseudocode for a PLS algorithm is given. This algorithm is different from the algorithm `sklearn` uses
by default, and it shows how one can calculate the variance explained by the PLS components.