PLSR will not work as well if features are uncorrelated, or if the only feature correlations are paired (feature 1 is only correlated with feature 2, feature 3 is only correlated with feature 4, etc).

The trick to successful PLSR is to select the right number of components to keep.  Create new partial least square regressions with different numbers of components, then see how those changes affect the ability of your models to reproduce the predicted Y values as well as the regular linear regression.

Since this data is randomly generated, you can also play with it by changing how $y$ is computed, then observing how different relationships between $y$ and $X$ play out in PLSR.

In [1]:
import math
import warnings

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
from sklearn.cross_decomposition import PLSRegression

%matplotlib inline
sns.set_style('white')

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

In [4]:
# Number of datapoints
n = 1000

# Number of features
p = 20

# Create random normally distributed data for parameters
X = np.random.normal(size=n * p).reshape((n, p))

# Create normally distrubuted outcome related to parameters, but with noise
Y = X[:, 0] + 2 * X[:, 1] + np.random.normal(size=n * 1) + 5

# Fit a linear model w/ all features
regr = linear_model.LinearRegression()
regr.fit(X, Y)

y_pred = regr.predict(X)
print('R-squared regression:', regr.score(X, Y))

pls1 = PLSRegression(n_components=2)
pls1.fit(X, Y)

y_pls_pred = pls1.predict(X)
print('R-squared PLSR:', pls1.score(X, Y))

R-squared regression: 0.8250277978803279
R-squared PLSR: 0.8249167479474937


In [None]:
# Note: find another way to compute y, then see how the new y interacts with X