<h1>PLS Example</h1>

Author: Nathan A. Mahynski

Date: 2023/08/23

Description: This is an example of using PLS to create a model, following the procedures outlined in ["Detection of Outliers in Projection-Based Modeling," Rodionova and Pomerantsev, Analytical
    Chemistry 92 (2020) 2656−2664.](https://doi.org/10.1021/acs.analchem.9b04611)
    
Figure 1 from this paper illustrates the workflow:

![](https://raw.githubusercontent.com/mahynski/pychemauth/main/docs/jupyter/gallery/pls_example_fig1.png)

In [None]:
using_colab = 'google.colab' in str(get_ipython())
if using_colab:
    !git clone https://github.com/mahynski/pychemauth.git --depth 1
    !cd pychemauth; pip3 install .; cd ..

import pychemauth

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

In [None]:
import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd

In [None]:
%watermark -t -m -v --iversions

Load the Data
---

In [None]:
# Let's load some data from the tests/ for this example
if using_colab:
    loc = 'https://raw.githubusercontent.com/mahynski/pychemauth/main/tests/data/pls_train.csv'
else:
    loc = '../tests/data/pls_train.csv'
df = pd.read_csv(loc)

In [None]:
df.head()

In [None]:
raw_x = np.array(df.values[:,3:], dtype=float) # Extract features
raw_y = np.array(df['Water'].values, dtype=float) # Take the water content as the target

Model the Data with PLS
---

In [None]:
from pychemauth.regressor.pls import PLS

<h3>Training</h3>

In [None]:
model = PLS(n_components=1, alpha=0.05, gamma=0.01, scale_x=True)

In [None]:
_ = model.fit(raw_x, raw_y)

In [None]:
_ = model.visualize(raw_x, raw_y)

In [None]:
# We can predict the water content with model.predict(raw_x)
model.predict(raw_x)

In [None]:
# We can see what X data (if any) is extreme or is an outlier
extremes_mask, outliers_mask = model.check_x_outliers(raw_x)

In [None]:
# We could extract that data as follows:
extremes = raw_x[extremes_mask]
outliers = raw_x[outliers_mask]

In [None]:
# Number of X outliers, for example?
np.sum(outliers_mask)

In [None]:
# Number of X extremes, for example?
np.sum(extremes_mask)

In [None]:
# We can see what XY data (if any) is extreme or is an outlier
extremes_mask, outliers_mask = model.check_xy_outliers(raw_x, raw_y)

In [None]:
# Number of X outliers, for example?
np.sum(outliers_mask)

In [None]:
# Number of X extremes, for example?
np.sum(extremes_mask)

<h3>Testing</h3>

In [None]:
if using_colab:
    loc = 'https://raw.githubusercontent.com/mahynski/pychemauth/main/tests/data/pls_test.csv'
else:
    loc = '../tests/data/pls_test.csv'
df = pd.read_csv(loc, header=None)

raw_x_t = np.array(df.values[:,3:], dtype=float)
raw_y_t = np.array(df.values[:,2], dtype=float)

In [None]:
_ = model.visualize(raw_x_t, raw_y_t)

In [None]:
extremes_mask, outliers_mask = model.check_x_outliers(raw_x_t)

In [None]:
# Number of X extremes, for example?
np.sum(extremes_mask)

In [None]:
# Number of X outliers, for example?
np.sum(outliers_mask)

Optimizing the Model
---

In [None]:
# Here I've used an imblearn pipeline, but you can also use scikit-learn's pipeline if you don't want to 
# do any class balancing.

pipeline = imblearn.pipeline.Pipeline(steps=[
    # Insert other preprocessing steps here...
    # ("smote", ScaledSMOTEENN(random_state=1)), # For example, class balancing
    ("pls", PLS(n_components=1, alpha=0.05, gamma=0.01, scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    # 'smote__k_enn':[1, 2, 3],
    # 'smote__k_smote':[1, 3, 3],
    # 'smote__kind_sel_enn':['all', 'mode'],
    'pls__n_components':np.arange(1, 10),
    # 'pls__alpha':[0.07, 0.05, 0.03, 0.01],
    # 'pls__scale_x': [True, False]
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(raw_x, raw_y)

In [None]:
# The best parameters found can be accessed like this:
gs.best_params_

In [None]:
gs.best_score_ # The best (default is R^2, coefficient of determination) score it recieved was...

In [None]:
# You can see detailed CV results here
gs.cv_results_

In [None]:
# For a 1D optimization you can easily visualize where the best value is
plt.errorbar(gs.cv_results_['param_pls__n_components'].data, 
             gs.cv_results_['mean_test_score'], 
             yerr=gs.cv_results_['std_test_score'])
plt.xlabel('n_components')
plt.ylabel(r'$R^2$')

plt.axvline(gs.best_params_['pls__n_components'], color='red')

In [None]:
# The refit=True (default) refits the model on the data in the end so you can use it directly on the test set.
gs.score(raw_x_t, raw_y_t)

In [None]:
plt.plot(raw_y_t, gs.predict(raw_x_t), 'o')
plt.plot(np.linspace(10,14,100), np.linspace(10,14,100), 'k-')
plt.xlim(10,14)
plt.ylim(10,14)
_ = plt.axis('equal')
plt.xlabel('Actual Water Content')
plt.ylabel('Predicted Water Content')

Outlier Detection
---

Steps 1 and 2 in the workflow at the beginning of this document are handled by the last step with CV. Now we can turn to optimizing the training set by removing outliers.

In [None]:
optimal_model = PLS(n_components=8, alpha=0.05, gamma=0.01, scale_x=True)
_ = optimal_model.fit(raw_x, raw_y)

<h3>Step 3</h3>

In [None]:
_ = optimal_model.visualize(raw_x, raw_y)

In [None]:
extremes, outliers = optimal_model.check_xy_outliers(raw_x, raw_y)

In [None]:
np.sum(outliers) # Indeed, we have 1 outlier

In [None]:
raw_x[outliers, :]

In [None]:
raw_y[outliers]

<h3>Step 4</h3>

In [None]:
# Select data that is NOT an outlier (regular and extreme points)
new_x = raw_x[~outliers]
new_y = raw_y[~outliers]

In [None]:
# After retraining the model we see there are no outliers - a "clean" training set
_ = optimal_model.fit(new_x, new_y)
_ = optimal_model.visualize(new_x, new_y)

<h3>Step 5</h3>

In [None]:
# _ = optimal_model.visualize(raw_x[outliers], raw_y[outliers])

Outliers remain outliers, so the model is "stable" and Step 6 is not required.  This code always uses robust statistical methods for estimating internal parameters, so Step 7 is not performed.

<h3>Automatic Loop</h3>

This iteration can be handled automaticall with the `sft` variable.

In [None]:
optimal_model_2 = PLS(n_components=8, alpha=0.05, gamma=0.01, scale_x=True, 
                    sft=True
                   )
_ = optimal_model_2.fit(raw_x, raw_y)

In [None]:
optimal_model_2.sft_history # This is the point we removed manually