<a href="https://colab.research.google.com/github/mahynski/chemometric-carpentry/blob/main/notebooks/3_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
❓ ***Objective***: This notebook will introduce some pre-processing steps common in chemometric pipelines.  

🔁 ***Remember***: You can always revisit this notebook for reference again in the future.  Ideas and best practices will be reinforced in future notebooks, so don't worry about remembering everything the first time you see something new.

🧑 Author: Nathan A. Mahynski

📆 Date: May 9, 2024

---

<img src="https://pychemauth.readthedocs.io/en/latest/_images/pipeline.png" height=200 align="right"/>

Pre-processing steps consist of all the steps in the pipeline except the last one, which is the model.

The PyChemAuth [preprocessing subpackage](https://pychemauth.readthedocs.io/en/latest/pychemauth.preprocessing.html#pychemauth-preprocessing-package) contains a number of different pre-processing steps useful in chemometric applications.

sklearn has a nice explanation of their preprocessing tools [here](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing); however, [dimensionality reduction](https://scikit-learn.org/stable/modules/decomposition.html#decompositions) steps may also be considered a preprocessing step that we include in the modeling pipeline (as shown).

In [None]:
%pip install git+https://github.com/mahynski/pychemauth@main

In [4]:
import sklearn
import pychemauth
import pandas as pd

import watermark
%load_ext watermark
%watermark -t -m -v --iversions

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.58+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

pandas    : 1.5.3
watermark : 2.4.3
pychemauth: 0.0.0b4
sklearn   : 1.3.0



# Scaling and Centering

In many instances, features of the data being used for modeling are recorded in different units, or have very different scales.  Consider the classic wine dataset available from sklearn.

In [5]:
from sklearn.datasets import load_wine
data = load_wine(as_frame=True,)

# print(data['DESCR'])
X = data['data']

In [6]:
X

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [23]:
X.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


The term "autoscaling" refers to the subtraction of a column's mean and division by its sample standard deviation.

$$
X_{\rm auto} = \frac{X - \bar{X}}{s_X}
$$

This puts all features on the same length scale and centers the distribution around 0.  Models involving PCA often require the matrix to be centered (if not scaled), and in fact, in sklearn this is done [automatically](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA).

In [15]:
from sklearn.preprocessing import StandardScaler

s = StandardScaler(with_mean=True, with_std=True)
X_auto = pd.DataFrame(s.fit_transform(X), columns=X.columns)

In [16]:
# mean is 0 and std is 1 now
X_auto.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,-8.382808e-16,-1.197544e-16,-8.370333e-16,-3.991813e-17,-3.991813e-17,0.0,-3.991813e-16,3.592632e-16,-1.197544e-16,2.4948830000000002e-17,1.995907e-16,3.19345e-16,-1.596725e-16
std,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821
min,-2.434235,-1.432983,-3.679162,-2.671018,-2.088255,-2.107246,-1.695971,-1.868234,-2.069034,-1.634288,-2.094732,-1.895054,-1.493188
25%,-0.7882448,-0.6587486,-0.5721225,-0.6891372,-0.8244151,-0.885468,-0.8275393,-0.7401412,-0.5972835,-0.7951025,-0.7675624,-0.9522483,-0.7846378
50%,0.06099988,-0.423112,-0.02382132,0.001518295,-0.1222817,0.09596,0.1061497,-0.1760948,-0.06289785,-0.1592246,0.03312687,0.2377348,-0.2337204
75%,0.8361286,0.6697929,0.6981085,0.6020883,0.5096384,0.808997,0.8490851,0.6095413,0.6291754,0.493956,0.7131644,0.7885875,0.7582494
max,2.259772,3.109192,3.156325,3.154511,4.371372,2.539515,3.062832,2.402403,3.485073,3.435432,3.301694,1.960915,2.971473


In [18]:
test = (X['alcohol'] - X['alcohol'].mean()) / X['alcohol'].std(ddof=0)
test.describe()

count    1.780000e+02
mean    -8.382808e-16
std      1.002821e+00
min     -2.434235e+00
25%     -7.882448e-01
50%      6.099988e-02
75%      8.361286e-01
max      2.259772e+00
Name: alcohol, dtype: float64

However, this sort of scaling can be strongly influenced by outliers, so a more robust method is to use the median and interquartile range instead.

$$
X_{\rm rob} = \frac{X - {\rm med}(X)}{{\rm iqr}(X)}
$$


In [26]:
from sklearn.preprocessing import RobustScaler

s = RobustScaler(with_centering=True, with_scaling=True)
X_rob = pd.DataFrame(s.fit_transform(X), columns=X.columns)

In [28]:
# median is now at 0
X_rob.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,-0.037553,0.3184786,0.018754,-0.001176,0.091662,-0.056631,-0.06331158,0.130471,0.05128411,0.1235201,-0.022372,-0.136564,0.151482
std,0.617359,0.7548284,0.789479,0.776643,0.75171,0.591821,0.598119,0.743005,0.8176555,0.7779483,0.677249,0.576057,0.649964
min,-1.536122,-0.7601351,-2.877698,-2.069767,-1.473684,-1.300236,-1.07485,-1.253731,-1.635714,-1.144295,-1.437037,-1.225152,-0.816305
25%,-0.522814,-0.1773649,-0.431655,-0.534884,-0.526316,-0.579196,-0.5568862,-0.41791,-0.4357143,-0.4932886,-0.540741,-0.68357,-0.357069
50%,0.0,-7.502679000000001e-17,0.0,0.0,0.0,0.0,1.327063e-16,0.0,-1.587272e-16,1.491862e-16,0.0,0.0,0.0
75%,0.477186,0.8226351,0.568345,0.465116,0.473684,0.420804,0.4431138,0.58209,0.5642857,0.5067114,0.459259,0.31643,0.642931
max,1.353612,2.658784,2.503597,2.44186,3.368421,1.44208,1.763473,1.910448,2.892857,2.788591,2.207407,0.989858,2.077399


There are other normalizers available in [sklearn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) as well.  However, in chemometric applications we often encounter two differences:

1. Some more rigorous practitioners use the [corrected sample standard deviation](https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation) when autoscaling, which reduces the degrees of freedom by one.
2. Pareto scaling divides by the square root of the standard deviation instead, which leaves "some" scale information behind.

PyChemAuth provides a [CorrectedScaler](https://pychemauth.readthedocs.io/en/latest/pychemauth.preprocessing.html#pychemauth.preprocessing.scaling.CorrectedScaler) and [RobustScaler](https://pychemauth.readthedocs.io/en/latest/pychemauth.preprocessing.html#pychemauth.preprocessing.scaling.RobustScaler) which are analogs from sklearn's StandardScaler and RobustScaler, but have `biased` and `pareto` options to perform the above modifications.

In [1]:
from pychemauth.preprocessing.scaling import CorrectedScaler, RobustScaler
?CorrectedScaler

In [10]:
s = CorrectedScaler(with_mean=True, with_std=True, pareto=True, biased=False)
X_cor = pd.DataFrame(s.fit_transform(X), columns=X.columns)

In [11]:
# std is not exactly 1 now, but mean is 0
X_cor.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,-7.584445e-16,-7.983626e-17,-4.241301e-16,-1.596725e-16,0.0,-7.983626e-17,-3.592632e-16,1.097749e-16,-1.197544e-16,6.985673e-17,9.979533e-17,2.794269e-16,-6.386901e-16
std,0.9010142,1.056951,0.5237786,1.827447,3.779217,0.7911075,0.9994292,0.3527794,0.756544,1.522592,0.4780916,0.8426093,17.74563
min,-2.187111,-1.510333,-1.921646,-4.867415,-7.869771,-1.662369,-1.690234,-0.6572206,-1.560912,-2.481354,-0.9986568,-1.592298,-26.42302
25%,-0.7082219,-0.6943066,-0.2988226,-1.255819,-3.10688,-0.6985301,-0.8247404,-0.2603721,-0.4506002,-1.207211,-0.3659329,-0.8001163,-13.88473
50%,0.05480715,-0.4459508,-0.012442,0.002766799,-0.460829,0.07570102,0.1057907,-0.06194786,-0.04745114,-0.2417522,0.01579313,0.199754,-4.135849
75%,0.7512446,0.7059471,0.3646257,1.09719,1.920616,0.6382036,0.8462134,0.2144288,0.4746599,0.7499778,0.3399988,0.662602,13.41777
max,2.030359,3.277021,1.648565,5.748486,16.473894,2.003378,3.052473,0.8451345,2.629194,5.216047,1.574072,1.647637,52.58233


# Filtering

In [2]:
from pychemauth.preprocessing.filter import MSC, SNV, SavGol

# Missing Values and Imputation

# Class Balancing (SMOTE)

# Feature Selection