# Partial least squares regression "by hand"

Here, we will find the latent variables in partial least squares (PLS) regression by hand. As an
example, we will consider the solubility data again but in reduced form:

* We will only consider alcohols

* We use only logP and the number of atoms in the molecule as our X

* We use the molecular weight as our Y

From the regression we did on the full data set, we know that logP is important for predicting the
solubility and we also expect that the molecular weight is correlated with the number of atoms.

## Loading the data

The data is in the file [solubility_alc.csv](./solubility_alc.csv):

In [None]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.linalg import svd
from numpy.linalg import norm



%matplotlib notebook
sns.set_theme(style="ticks", context="notebook", palette="muted")

In [None]:
data = pd.read_csv("solubility_descriptors.csv.zip")
data

In [None]:
from sklearn.preprocessing import scale
xvars = ["MolLogP", "nAtom"]
yvars = ["Molecular Weight"]

data["nC + nH"] = scale(data["nC"]) + scale(data["nH"])

#xvars = ["nAtom", "LabuteASA",]# "Polar Surface Area"]
xvars = ["nC", "nH",]# "Polar Surface Area"]
#yvars = ["Molecular Weight"]
yvars = ["nC + nH"]
scaler_x, scaler_y = StandardScaler(), StandardScaler()

y = scaler_y.fit_transform(data[yvars].to_numpy())
X = scaler_x.fit_transform(data[xvars].to_numpy())

### Plotting X- and Y-data

Before we start, let us first plot the X- and Y-data:

In [None]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(X[:, 0], X[:, 1], y)
ax.set(xlabel=xvars[0], ylabel=xvars[1], zlabel=yvars[0])

## The PLS method

In PLS, we are looking for scores $\mathbf{t}$ (for X) so that the covariance with $\mathbf{y}$
is maximized. The main idea is that the scores should explain the variance in X and the
covariance between X and y.

The covariance can be calculated by $\mathbf{t}^\top \mathbf{y}$.
We shall see later how we can find the directions in X and Y so that the covariance is
maximized, but let us assume that we have found them. In PLS, these directions
are referred to as the *weights* $\mathbf{w}_x$ (for X).
We can use these *weights* to calculate the scores:


\begin{equation}
\mathbf{t} = \mathbf{X} \mathbf{w}_x
\end{equation}

Assume now that we are going to predict $\mathbf{y}$ from some measured $\mathbf{X}$. 
Ideally, we would have a perfect correlation between the scores so
that $\mathbf{y} = g \mathbf{t}$ where $g$ is some number.
But in general, this is not the case. Instead, we will approximate $\mathbf{y}$ from $\mathbf{t}$ using
this relation. The least squares approximation for $g$ is,

\begin{equation}
g = \frac{\mathbf{t}^\top \mathbf{y}}{\mathbf{t}^\top \mathbf{t}},
\end{equation}


so that the y approximated from the X-scores are $\hat{\mathbf{y}} = g \mathbf{t}$.
We can rewrite this as,
\begin{equation}
\mathbf{y} = g \mathbf{t} = g \mathbf{X} \mathbf{w}_x,
\end{equation}


and comparing with a linear equation on the form $\mathbf{y} = \mathbf{X} \mathbf{b}_\text{PLS}$ with regression
coefficients $b_\text{PLS}$ we see that

\begin{equation}
\mathbf{b}_\text{PLS} = g \mathbf{w}_x .
\end{equation}

## A short example

We set set $\mathbf{w}_x = (0.0, 1.0)^\top$, then 
the product $\mathbf{X} \mathbf{w}_x$ will just pick out
the second column of $\mathbf{X}$. Then a plot of
$\mathbf{t}$ vs. $\mathbf{y}$ will show how the number of carbon + hydrogen is correlated with the number of hydrogens.

In [None]:
def make_plot(X, y, wx):
    """Plot X and t vs. y (calculated using wx)"""
    fig, axes = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4), sharex=True, sharey=True)
    axes[0].scatter(X[:, 0], X[:, 1])
    axes[0].set(xlabel=xvars[0], ylabel=xvars[1], title='X')

    vecx = wx.flatten()
    vecy = wy.flatten()
    
    axes[0].quiver(0, 0, vecx[0], vecx[1], color='black',
                   angles='xy', scale_units='xy', scale=0.2, width=0.015)
    
    
    t = X @ (wx / norm(wx)) 
    # Technical detail: we norm both wx and wy here to get similar reults from different methods
    
    cov = t.T @ y

    axes[1].scatter(t[:, 0], y[:, 0])
    axes[1].set(
        xlabel='X-scores (t)',
        ylabel='y',
        title=f'Covariance: {cov[0][0]:.4g}')
    sns.despine(fig=fig)

In [None]:
# Weights by hand:
wx = np.array([0.0, 1.0])
wx = wx.reshape(2, -1)  # Make it a column vector
print(f"wx.T = {wx.T}, shape of w: {wx.shape}")

make_plot(X, y, wx)

In the plot above, the X-data is projected onto the black vector and this gives the X-scores.
The rightmost plot shows the X-scores plotted against y.

In [None]:
def get_regression_coefficients(X, y, wx):
    t = X @ wx
    # Find g (approximate u from t):
    g = t.T @ y / (t.T @ t)
    # Find b:
    B = g * wx
    return B

In [None]:
B = get_regression_coefficients(X, y, wx)
print(B)

In the regression coefficients above, we see that the first row is just zero. Effectively this means that we
are predicting y-variables using only the number of hydrogens:

In [None]:
y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(y, y_hat, label=f"R² = {r2_score(y, y_hat):.3f}")
ax.legend()

## Maximizing the covariance

The covariance between $\mathbf{t}$ and $\mathbf{y}$ is


\begin{equation}
\mathbf{t}^\top \mathbf{y} = (\mathbf{X} \mathbf{w}_x)^\top (\mathbf{y})
= \mathbf{w}_x^\top \mathbf{X}^\top \mathbf{y}
\end{equation}

We set $\mathbf{v} = \mathbf{X}^\top \mathbf{y}$. What vector $\mathbf{w}_x$ will
maximize the dot product $\mathbf{w}_x^\top \mathbf{v}$? Well, if we make $\mathbf{w}_x$ parallel
to $\mathbf{v}$, then the dot product will be maximized! So we can set: $\mathbf{w}_x = \lambda \mathbf{v} =
\lambda \mathbf{X}^\top \mathbf{y}$ for some number $\lambda$. If we at the same time normalize $\mathbf{w}_x$,
then we don't need to find $\lambda$.

In [None]:
wx = X.T @ y
wx /= norm(wx)
make_plot(X, y, wx)

In [None]:
B = get_regression_coefficients(X, y, wx)
print(B)
y_hat = X @ B
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(y, y_hat, label=f"R² = {r2_score(y, y_hat):.3f}")
ax.legend()

## Comparing with `PLSRegression` from `sklearn`

In [None]:
model = PLSRegression(scale=False, n_components=2)
model.fit(X, y)
make_plot(X, y, model.x_weights_[:, 0].reshape(2, -1))
make_plot(X, y, model.x_weights_[:, 1].reshape(2, -1))
print(model.coef_)

In [None]:
#print(B / model.coef_)
y_hat = model.predict(X)
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(y, y_hat, label=f"R² = {r2_score(y, y_hat):.3f}")
ax.legend()

## Finding the next latent variables

So far, we have found one latent variable. The strategy to find the next PLS components is
to essentially repeat the process, but we first "subtract" the latent variable we just found
from $\mathbf{X}^T \mathbf{y}$. This is referred to as deflation.
In addition, we make sure that the next latent variable
we find will have scores orthogonal to the previous one. A consequence of this is
that loadings and scores we find in the next steps of the method do not
directly operate on $\mathbf{X}$
and $\mathbf{Y}$, but on the deflated versions of these matrices.

It would be nice to have loadings that we could apply directly 
to $\mathbf{X}$ and $\mathbf{Y}$ (and not the deflated version).
PLS methods will typically calculate these as well, and in `sklearn`, they
are called *rotations*. We will use these later for interpretation of correlations between
variables. Here is an example:

In [None]:
# Find the first direction:
wx1 = X.T @ y
wx1 /= norm(wx1)
print(wx1 / model.x_weights_[:, 0].reshape(-2, 1))

# Calculate scores:
t1 = X @ wx1
# Calculate the predicted y:
g = t1.T @ y / (t1.T @ t1)
y_hat = g * t1
# Subtract the predicted y:
y2 = y - y_hat
# Calculate loadings to remove the part of
# X we have described with t:
p = (X.T @ t1) / (t1.T @ t1)
X2 = X - t1 @ p.T

# Find next direction:
wx2 = X2.T @ y2
wx2 /= norm(wx2)
print(wx2 / model.x_weights_[:, 1].reshape(-2, 1))
make_plot(X, y, wx1)
make_plot(X2, y2, wx2)
# Convert the next direction to rotations:
r2 = wx2 - wx1 @ p.T @ wx2