# Partial least squares regression "by hand"

Here, we will find the latent variables in partial least squares (PLS) regression by hand. As an
example we will consider the solubility data again but in reduced form:

* We will only consider alcohols

* We use only logP and the number of atoms in the molecule as our X

* We use the solubility and the molecular weight as our Y

From the regression we did on the full data set, we know that logP is important for predicting the
solubility, and we also expect that the molecular weight is correlated with the number of atoms.

## Loading the data

The data we will use can be found in the file [solubility_alc.csv](./solubility_alc.csv):

In [None]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.linalg import svd
from numpy.linalg import norm


%matplotlib notebook
sns.set_theme(style="ticks", context="notebook", palette="muted")

In [None]:
data = pd.read_csv("solubility_alc.csv")
data

In [None]:
xvars = ["MolLogP", "nAtom"]
yvars = ["measured log solubility in mols per litre", "Molecular Weight"]

scaler_x, scaler_y = StandardScaler(), StandardScaler()

Y = scaler_y.fit_transform(data[yvars].to_numpy())
X = scaler_x.fit_transform(data[xvars].to_numpy())

### Plotting X- and Y-data

Before we start, let us first plot the X- and Y-data:

In [None]:
fig, axes = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4))
axes[0].scatter(X[:, 0], X[:, 1])
axes[0].set(xlabel=xvars[0], ylabel=xvars[1])
axes[0].set_title("X-data", loc="left")

axes[1].scatter(Y[:, 0], Y[:, 1])
axes[1].set(xlabel=yvars[0], ylabel=yvars[1])
axes[1].set_title("Y-data", loc="left")

sns.despine(fig=fig)

## Finding a latent variable for PLS


In PLS, we are looking for scores $\mathbf{t}$ (for X) and $\mathbf{u}$ (for y) so that the covariance
is maximized. The main idea is that the scores should explain the variance in X, Y, and the
covariance between X and Y.

The covariance can be calculated by $\mathbf{t}^\top \mathbf{u}$ (the dot product between
the scores). Further, the scores are calculated using the *weights* $\mathbf{w}$ (for X) and $\mathbf{q}$ (for Y):


\begin{align}
\mathbf{t} &= \mathbf{X} \mathbf{w} \\
\mathbf{u} &= \mathbf{Y} \mathbf{q}
\end{align}


Note that we refer to $\mathbf{w}$ and $\mathbf{q}$ as *weights*. In PCA, we can invert the relation
giving the scores by multiplying with the transpose of the *loadings*.
In PLS, this is no longer valid for the weights, and PLS introduces
separate *loadings* to do this. For instance, the loadings $\mathbf{p}$ for X so that we can write $\mathbf{X} = \mathbf{t} \mathbf{p}^\top$.

What we will do now, is to create some $\mathbf{w}$ and $\mathbf{q}$ vectors by hand, and check what
the correlation is (by calculating the correlation and by plotting the scores $\mathbf{t}$ vs. $\mathbf{u}$).
Both $\mathbf{X}$ and $\mathbf{Y}$ have
dimensions $n \times 2$ (we have $n$ samples and $2$ variables). So if $\mathbf{w}$ and $\mathbf{q}$
have dimensions $2 \times 1$ (that is, they are *column* vectors), then the scores will
have a dimension of $(n \times 2) \times (2 \times 1) = n \times 1$.

## A short example

We set set $\mathbf{w} = (0.0, 1.0)^\top$ and $\mathbf{q} = (0.0, 1.0)^\top$. Then 
the product $\mathbf{X} \mathbf{w}$ will just pick out
the second column of $\mathbf{X}$ (just the number of atoms) and $\mathbf{Y} \mathbf{q}$ will
pick out the second column of $\mathbf{Y}$ (just the molecular weight). Then a plot of
$\mathbf{t}$ vs. $\mathbf{q}$ will show how the number of atoms is correlated with the
molecular weight.

In [None]:
def make_plot(X, Y, w, q):
    """Plot X, Y and t vs. u (calculated using w and q)"""
    fig, axes = plt.subplots(constrained_layout=True, ncols=3, figsize=(9, 3))
    axes[0].scatter(X[:, 0], X[:, 1])
    axes[0].set(xlabel=xvars[0], ylabel=xvars[1], title='X')

    axes[1].scatter(Y[:, 0], Y[:, 1])
    axes[1].set(xlabel=yvars[0], ylabel=yvars[1], title='Y')

    axes[0].quiver(0, 0, w[0][0], w[1][0], color='black',
                   angles='xy', scale_units='xy', scale=0.25, width=0.015)
    
    axes[1].quiver(0, 0, q[0][0], q[1][0], color='red',
                   angles='xy', scale_units='xy', scale=0.25, width=0.015)
    
    t = X @ (w / norm(w)) 
    u = Y @ (q / norm(q))
    # Technical detail: we norm both w and q here to get similar reults from different methods
    
    correlation = t.T @ u

    axes[2].scatter(t[:, 0], u[:, 0])
    axes[2].set(
        xlabel='X-scores (t)',
        ylabel='Y-scores (u)',
        title=f'Correlation: {correlation[0][0]:.2f}')
    sns.despine(fig=fig)

In [None]:
# Loadings by hand:
w = np.array([0.0, 1.0])
w = w / norm(w)  # Normalize to unit vector
w = w.reshape(2, -1)  # Make it a column vector
    
q = np.array([0.0, 1.0])
q = q / norm(q)
q = q.reshape(2, -1)  # Make it a column vector

print(f"w.T = {w.T}, shape of w: {w.shape}")
print(f"q.T = {q.T}, shape of q: {q.shape}")
make_plot(X, Y, w, q)

In the plot above, the X-data is projected onto the black vector and this gives the X-scores (t). In the
same way, the Y-data is projected onto the red vector and this gives the Y-scores.
The rightmost plot show the scores plotted against each other.

Here, we see that we can use the X-scores to predict the Y-scores. And we would probably make a good prediction.
But our aim is to predict the whole $\mathbf{Y}$! If we convert the Y-scores to $\mathbf{Y}$, we would probably
predict the molecular weight quite good, but fail at predicting the solubility.

Below, you will find the same code for creating $\mathbf{w}$ and $\mathbf{q}$ as above. Can you find
two other vectors that give a larger correlation?

In [None]:
w = np.array([0.0, 1.0])
w = w / norm(w)  # Normalize to unit vector
w = w.reshape(2, -1)  # Make it a column vector
    
q = np.array([0.0, 1.0])
q = q / norm(q)
q = q.reshape(2, -1)  # Make it a column vector

make_plot(X, Y, w, q)

## Making use of `PLSRegression` from `sklearn`

In [None]:
model = PLSRegression(scale=False, n_components=1)
model.fit(X, Y)
make_plot(X, Y, model.x_weights_, model.y_weights_)

## Calculating the loadings

One can show that maximizing the covariance,


\begin{equation}
\max  \mathbf{t}^\top \mathbf{u} = (\mathbf{X} \mathbf{w})^\top (\mathbf{Y} \mathbf{q})
= \mathbf{w}^\top \mathbf{X}^\top \mathbf{Y} \mathbf{q}
\end{equation}


is the same as finding the
[singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition)
of $\mathbf{X}^\top \mathbf{Y}$. Let us also try this:

In [None]:
U, _, Vt = np.linalg.svd(X.T @ Y)  # Find the singular value decomposition.

w = U[:, 0].reshape(2, -1)
q = Vt[0, :].reshape(2, -1)

In [None]:
make_plot(X, Y, -w, -q)