# 🔬 Chemometric Analysis Notebook
## FBMFOR — Food Fraud Analysis
**MSc Food Technology and Quality Assurance** | University of Reading

---

This notebook provides a complete chemometric analysis pipeline for spectroscopic data (NMR, FTIR) used in food fraud detection. You will work through:

1. **Data upload and exploration** — load your spectral data and metadata
2. **Preprocessing** — baseline correction, normalisation, and derivatives
3. **Principal Component Analysis (PCA)** — unsupervised exploration
4. **Hierarchical Cluster Analysis (HCA)** — sample grouping
5. **PLS-DA** — supervised classification
6. **OPLS-DA** — orthogonal signal correction for enhanced discrimination

Each section includes explanatory text describing the *what* and *why* of each step. You are encouraged to modify parameters and observe the effects.

> **Data format:** CSV files with samples as rows and spectral variables (wavenumbers/chemical shifts) as columns. The first column should contain sample IDs, and a separate metadata CSV maps sample IDs to class labels.

## 1. Setup and Installation

Run this cell to install and import all required packages. On Google Colab, most are pre-installed; only `plotly` may need updating.

In [None]:
# ============================================================
# INSTALL & IMPORT
# ============================================================
# Uncomment the next line if running on Colab and plotly is outdated
# !pip install --upgrade plotly

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from scipy import signal, sparse
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, LeaveOneOut, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import warnings
warnings.filterwarnings("ignore")

print("✅ All packages imported successfully.")

## 2. Demo Dataset: Simulated Olive Oil FTIR

This section generates a **realistic simulated FTIR dataset** for olive oil authentication. The dataset mimics mid-infrared spectra (4000–400 cm⁻¹) for three classes:

- **Extra Virgin Olive Oil (EVOO)** — authentic
- **Refined Olive Oil (ROO)** — lower grade, sometimes mislabelled
- **Adulterated** — EVOO blended with hazelnut or sunflower oil

Key spectral differences are introduced in regions known to be diagnostic:
- ~1745 cm⁻¹ (carbonyl C=O stretch — ester linkage, differs with fatty acid profile)
- ~2925 and ~2854 cm⁻¹ (C–H stretching — related to chain length and saturation)
- ~1160 cm⁻¹ (C–O stretch — differs with triglyceride composition)
- ~3005 cm⁻¹ (=C–H stretch — indicator of unsaturation)

> **Skip this section** if you want to upload your own data — proceed to Section 3.

In [None]:
# ============================================================
# GENERATE SIMULATED OLIVE OIL FTIR DATA
# ============================================================
np.random.seed(42)

# Wavenumber axis (4000 to 400 cm-1, typical FTIR range)
wavenumbers = np.linspace(4000, 400, 1800)

def gaussian_peak(x, centre, width, height):
    """Generate a Gaussian absorption peak."""
    return height * np.exp(-0.5 * ((x - centre) / width) ** 2)

def generate_spectrum(class_type, sample_idx):
    """Generate a simulated FTIR spectrum for a given oil class."""
    spectrum = np.zeros_like(wavenumbers)
    
    # Common peaks (all olive oils)
    spectrum += gaussian_peak(wavenumbers, 2925, 30, 0.8)   # C-H asymmetric stretch
    spectrum += gaussian_peak(wavenumbers, 2854, 25, 0.6)   # C-H symmetric stretch
    spectrum += gaussian_peak(wavenumbers, 1745, 20, 0.9)   # C=O ester stretch
    spectrum += gaussian_peak(wavenumbers, 1465, 15, 0.35)  # C-H bending
    spectrum += gaussian_peak(wavenumbers, 1160, 25, 0.5)   # C-O stretch
    spectrum += gaussian_peak(wavenumbers, 720, 12, 0.25)   # C-H rocking
    
    # Class-specific differences
    if class_type == "EVOO":
        spectrum += gaussian_peak(wavenumbers, 3005, 15, 0.30)  # Strong =C-H (high unsaturation)
        spectrum += gaussian_peak(wavenumbers, 1650, 12, 0.08)  # Minor polyphenol-related
    elif class_type == "ROO":
        spectrum += gaussian_peak(wavenumbers, 3005, 15, 0.18)  # Weaker =C-H
        spectrum += gaussian_peak(wavenumbers, 1745, 20, 0.05)  # Slightly different ester
    elif class_type == "Adulterated":
        spectrum += gaussian_peak(wavenumbers, 3005, 15, 0.12)  # Much weaker =C-H
        spectrum += gaussian_peak(wavenumbers, 2925, 30, 0.10)  # Extra C-H (sunflower)
        spectrum += gaussian_peak(wavenumbers, 1160, 25, 0.08)  # Different C-O profile
    
    # Add realistic noise + slight baseline drift
    noise = np.random.normal(0, 0.012, len(wavenumbers))
    baseline_drift = 0.02 * np.sin(wavenumbers / 800) + 0.01 * (wavenumbers / 4000)
    spectrum += noise + baseline_drift
    
    return spectrum

# Generate samples: 15 EVOO, 12 ROO, 10 Adulterated
classes = ["EVOO"] * 15 + ["ROO"] * 12 + ["Adulterated"] * 10
sample_ids = [f"{c}_{i+1:02d}" for i, c in enumerate(classes)]

spectra = np.array([generate_spectrum(c, i) for i, c in enumerate(classes)])

# Create DataFrames
spectral_columns = [f"{w:.1f}" for w in wavenumbers]
df_spectra = pd.DataFrame(spectra, columns=spectral_columns)
df_spectra.insert(0, "Sample_ID", sample_ids)

df_metadata = pd.DataFrame({"Sample_ID": sample_ids, "Class": classes})

print(f"Generated {len(df_spectra)} spectra with {len(wavenumbers)} wavenumber points.")
print(f"Classes: {dict(zip(*np.unique(classes, return_counts=True)))}")
print(f"\nSpectral data shape: {df_spectra.shape}")
print(f"Wavenumber range: {wavenumbers[-1]:.0f} – {wavenumbers[0]:.0f} cm⁻¹")

## 3. Data Upload (Your Own Data)

Use this section to upload your own spectral data. You need **two CSV files**:

1. **Spectral data** — rows = samples, columns = spectral variables (wavenumbers or chemical shifts). First column should be `Sample_ID`.
2. **Metadata** — at minimum two columns: `Sample_ID` and `Class`.

> **If using the demo dataset**, skip this cell — the data is already loaded above.

In [None]:
# ============================================================
# UPLOAD YOUR OWN DATA (skip if using demo dataset)
# ============================================================
# Uncomment and run this cell to upload your own files

# from google.colab import files
#
# print("Upload your SPECTRAL DATA CSV:")
# uploaded = files.upload()
# spectral_filename = list(uploaded.keys())[0]
# df_spectra = pd.read_csv(spectral_filename)
#
# print("\nUpload your METADATA CSV:")
# uploaded = files.upload()
# metadata_filename = list(uploaded.keys())[0]
# df_metadata = pd.read_csv(metadata_filename)
#
# print(f"\nSpectral data: {df_spectra.shape[0]} samples, {df_spectra.shape[1]-1} variables")
# print(f"Metadata: {df_metadata.shape[0]} samples, columns: {list(df_metadata.columns)}")
# print(f"Classes found: {df_metadata['Class'].value_counts().to_dict()}")

## 4. Data Exploration

Before any analysis, it is essential to **visually inspect your raw spectra**. Look for:
- Obvious outliers or failed measurements
- Baseline drift (common in FTIR)
- Differences in overall intensity between samples
- Key absorption bands that differ between groups

In [None]:
# ============================================================
# PREPARE DATA MATRICES
# ============================================================
# Extract numeric spectral matrix (X) and merge class labels
X_raw = df_spectra.drop(columns=["Sample_ID"]).values.astype(float)
sample_ids = df_spectra["Sample_ID"].values
variable_names = df_spectra.drop(columns=["Sample_ID"]).columns.values

# Try to interpret variable names as numeric (wavenumbers / chemical shifts)
try:
    x_axis = np.array([float(v) for v in variable_names])
    x_label = "Wavenumber (cm⁻¹)" if x_axis[0] > x_axis[-1] else "Chemical shift (ppm)"
except ValueError:
    x_axis = np.arange(X_raw.shape[1])
    x_label = "Variable index"

# Merge class labels
merged = df_spectra[["Sample_ID"]].merge(df_metadata, on="Sample_ID", how="left")
class_labels = merged["Class"].values
unique_classes = np.unique(class_labels)

print(f"Spectral matrix X: {X_raw.shape[0]} samples × {X_raw.shape[1]} variables")
print(f"Classes: {list(unique_classes)}")
print(f"X-axis: {x_label}, range {x_axis.min():.1f} – {x_axis.max():.1f}")

In [None]:
# ============================================================
# INTERACTIVE RAW SPECTRA PLOT
# ============================================================
fig = go.Figure()
colours = px.colors.qualitative.Set2

for i, cls in enumerate(unique_classes):
    mask = class_labels == cls
    for j, idx in enumerate(np.where(mask)[0]):
        fig.add_trace(go.Scatter(
            x=x_axis, y=X_raw[idx],
            name=cls if j == 0 else None,
            legendgroup=cls,
            showlegend=(j == 0),
            line=dict(color=colours[i % len(colours)], width=0.8),
            hovertext=sample_ids[idx],
            hoverinfo="text+y",
            opacity=0.7
        ))

# Reverse x-axis for FTIR (high to low wavenumber convention)
if x_axis[0] > x_axis[-1]:
    fig.update_xaxes(autorange="reversed")

fig.update_layout(
    title="Raw Spectra by Class",
    xaxis_title=x_label,
    yaxis_title="Absorbance / Intensity",
    template="plotly_white",
    height=500,
    hovermode="closest"
)
fig.show()

In [None]:
# ============================================================
# SUMMARY STATISTICS & MISSING VALUES
# ============================================================
print("=== Data Quality Check ===")
print(f"Missing values: {np.isnan(X_raw).sum()}")
print(f"Infinite values: {np.isinf(X_raw).sum()}")
print(f"\nIntensity range: {X_raw.min():.4f} to {X_raw.max():.4f}")
print(f"Mean intensity:  {X_raw.mean():.4f} ± {X_raw.std():.4f}")

# Per-class means
print("\n=== Per-Class Mean Intensity ===")
for cls in unique_classes:
    mask = class_labels == cls
    print(f"  {cls}: {X_raw[mask].mean():.4f} ± {X_raw[mask].std():.4f} (n={mask.sum()})")

## 5. Spectral Preprocessing

Raw spectra typically contain artefacts that must be removed before multivariate analysis:

- **Baseline drift** — gradual shift in the spectrum baseline due to scattering or instrument effects. Corrected using asymmetric least squares (ALS) smoothing or polynomial fitting.
- **Intensity variation** — differences in overall signal intensity between samples due to pathlength or concentration differences. Corrected by normalisation (SNV, min-max, or area normalisation).
- **Noise and band overlap** — high-frequency noise or overlapping peaks can be resolved using Savitzky-Golay derivatives, which also remove constant and linear baseline offsets.

> **Key principle:** Always visualise the effect of each preprocessing step. Over-processing can destroy genuine chemical information.

In [None]:
# ============================================================
# PREPROCESSING FUNCTIONS
# ============================================================

def baseline_als(y, lam=1e6, p=0.01, niter=10):
    """
    Asymmetric Least Squares baseline correction (Eilers & Boelens, 2005).
    
    Parameters:
        y     : spectrum (1D array)
        lam   : smoothness parameter (larger = smoother baseline). Try 1e4 to 1e8.
        p     : asymmetry parameter (smaller = baseline hugs lower envelope). Try 0.001 to 0.05.
        niter : number of iterations
    """
    L = len(y)
    D = sparse.diags([1, -2, 1], [0, -1, -2], shape=(L, L - 2))
    w = np.ones(L)
    for _ in range(niter):
        W = sparse.spdiags(w, 0, L, L)
        Z = W + lam * D.dot(D.transpose())
        z = sparse.linalg.spsolve(Z, w * y)
        w = p * (y > z) + (1 - p) * (y < z)
    return z


def snv(X):
    """
    Standard Normal Variate (SNV) normalisation.
    Each spectrum is centred and scaled by its own mean and std.
    Removes scatter effects (multiplicative and additive).
    """
    X_snv = np.zeros_like(X)
    for i in range(X.shape[0]):
        X_snv[i] = (X[i] - np.mean(X[i])) / np.std(X[i])
    return X_snv


def area_normalise(X):
    """Normalise each spectrum to unit area (total intensity = 1)."""
    return X / np.abs(X).sum(axis=1, keepdims=True)


def minmax_normalise(X):
    """Scale each spectrum to [0, 1] range."""
    mins = X.min(axis=1, keepdims=True)
    maxs = X.max(axis=1, keepdims=True)
    return (X - mins) / (maxs - mins)


def savgol_derivative(X, window_length=15, polyorder=2, deriv=1):
    """
    Savitzky-Golay derivative.
    
    Parameters:
        window_length : must be odd, larger = more smoothing
        polyorder     : polynomial order (2 or 3 typical)
        deriv         : derivative order (1 = first, 2 = second)
    """
    return signal.savgol_filter(X, window_length, polyorder, deriv=deriv, axis=1)


print("✅ Preprocessing functions defined.")

### 5.1 Apply Preprocessing

Modify the settings below to experiment with different preprocessing combinations. A typical workflow for FTIR data is: **baseline correction → SNV → (optional) 1st derivative**.

For NMR data, you might skip baseline correction and use: **area normalisation → (optional) binning**.

In [None]:
# ============================================================
# PREPROCESSING SETTINGS — MODIFY THESE
# ============================================================

# Step 1: Baseline correction
APPLY_BASELINE = True           # True / False
BASELINE_LAMBDA = 1e6           # Smoothness: try 1e4 to 1e8
BASELINE_P = 0.01               # Asymmetry: try 0.001 to 0.05

# Step 2: Normalisation
NORMALISATION = "snv"            # Options: "snv", "area", "minmax", "none"

# Step 3: Savitzky-Golay derivative
APPLY_DERIVATIVE = False         # True / False
SG_WINDOW = 15                   # Window length (must be odd)
SG_POLYORDER = 2                 # Polynomial order
SG_DERIV = 1                     # Derivative order: 1 or 2

# ============================================================
# APPLY PREPROCESSING PIPELINE
# ============================================================
X_processed = X_raw.copy()

# Step 1: Baseline correction
if APPLY_BASELINE:
    print("Applying baseline correction (ALS)...")
    for i in range(X_processed.shape[0]):
        bl = baseline_als(X_processed[i], lam=BASELINE_LAMBDA, p=BASELINE_P)
        X_processed[i] = X_processed[i] - bl

# Step 2: Normalisation
if NORMALISATION == "snv":
    print("Applying SNV normalisation...")
    X_processed = snv(X_processed)
elif NORMALISATION == "area":
    print("Applying area normalisation...")
    X_processed = area_normalise(X_processed)
elif NORMALISATION == "minmax":
    print("Applying min-max normalisation...")
    X_processed = minmax_normalise(X_processed)
else:
    print("No normalisation applied.")

# Step 3: Derivative
if APPLY_DERIVATIVE:
    deriv_suffix = "st" if SG_DERIV == 1 else "nd"
    print(f"Applying Savitzky-Golay {SG_DERIV}{deriv_suffix} derivative...")
    X_processed = savgol_derivative(X_processed, SG_WINDOW, SG_POLYORDER, SG_DERIV)

print(f"\n✅ Preprocessing complete. Matrix shape: {X_processed.shape}")

In [None]:
# ============================================================
# VISUALISE PREPROCESSING EFFECT
# ============================================================
fig = make_subplots(rows=2, cols=1, subplot_titles=("Before preprocessing", "After preprocessing"),
                    shared_xaxes=True, vertical_spacing=0.08)

for i, cls in enumerate(unique_classes):
    mask = class_labels == cls
    for j, idx in enumerate(np.where(mask)[0]):
        fig.add_trace(go.Scatter(
            x=x_axis, y=X_raw[idx], name=cls if j == 0 else None,
            legendgroup=cls, showlegend=(j == 0),
            line=dict(color=colours[i % len(colours)], width=0.6), opacity=0.6
        ), row=1, col=1)
        fig.add_trace(go.Scatter(
            x=x_axis, y=X_processed[idx], name=cls if j == 0 else None,
            legendgroup=cls, showlegend=False,
            line=dict(color=colours[i % len(colours)], width=0.6), opacity=0.6
        ), row=2, col=1)

if x_axis[0] > x_axis[-1]:
    fig.update_xaxes(autorange="reversed")

fig.update_layout(title="Preprocessing Comparison", template="plotly_white", height=700)
fig.update_xaxes(title_text=x_label, row=2, col=1)
fig.update_yaxes(title_text="Absorbance", row=1, col=1)
fig.update_yaxes(title_text="Preprocessed", row=2, col=1)
fig.show()

## 6. Principal Component Analysis (PCA)

PCA is an **unsupervised** method that reduces the dimensionality of spectral data by finding directions (principal components) of maximum variance. It is typically the first step in chemometric analysis because it:

- Reveals natural groupings (or lack thereof) in the data
- Identifies outliers
- Shows which spectral regions contribute most to variation (via loadings)
- Helps determine how many latent variables are needed for supervised models

> **Important:** PCA does not use class labels — any separation seen in the scores plot reflects genuine chemical differences captured by the spectral data.

In [None]:
# ============================================================
# PCA — FIT AND EXPLORE
# ============================================================
N_COMPONENTS = 10  # Number of PCs to compute (adjust if needed)

# Mean-centre the data (standard for PCA on spectral data)
scaler = StandardScaler(with_std=False)  # Mean-centre only, no scaling
X_mc = scaler.fit_transform(X_processed)

# Fit PCA
pca = PCA(n_components=min(N_COMPONENTS, X_mc.shape[0], X_mc.shape[1]))
scores = pca.fit_transform(X_mc)
loadings = pca.components_
explained_var = pca.explained_variance_ratio_ * 100

print(f"PCA fitted with {pca.n_components_} components.")
print(f"\nExplained variance per PC:")
for i, ev in enumerate(explained_var):
    cumulative = explained_var[:i+1].sum()
    bar = "█" * int(ev)
    print(f"  PC{i+1:2d}: {ev:5.1f}% {bar}  (cumulative: {cumulative:.1f}%)")

In [None]:
# ============================================================
# SCREE PLOT
# ============================================================
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Bar(
    x=[f"PC{i+1}" for i in range(len(explained_var))],
    y=explained_var,
    name="Individual",
    marker_color="steelblue"
), secondary_y=False)

fig.add_trace(go.Scatter(
    x=[f"PC{i+1}" for i in range(len(explained_var))],
    y=np.cumsum(explained_var),
    name="Cumulative",
    mode="lines+markers",
    marker=dict(color="firebrick"),
    line=dict(color="firebrick")
), secondary_y=True)

fig.update_layout(title="Scree Plot", template="plotly_white", height=400)
fig.update_yaxes(title_text="Explained variance (%)", secondary_y=False)
fig.update_yaxes(title_text="Cumulative (%)", secondary_y=True, range=[0, 105])
fig.show()

In [None]:
# ============================================================
# PCA SCORES PLOT (interactive with 95% confidence ellipses)
# ============================================================
PC_X = 1  # Which PC on x-axis (1-indexed)
PC_Y = 2  # Which PC on y-axis

scores_df = pd.DataFrame({
    "Sample_ID": sample_ids,
    "Class": class_labels,
    f"PC{PC_X}": scores[:, PC_X - 1],
    f"PC{PC_Y}": scores[:, PC_Y - 1]
})

fig = px.scatter(
    scores_df, x=f"PC{PC_X}", y=f"PC{PC_Y}",
    color="Class", hover_name="Sample_ID",
    title=f"PCA Scores Plot: PC{PC_X} vs PC{PC_Y}",
    labels={
        f"PC{PC_X}": f"PC{PC_X} ({explained_var[PC_X-1]:.1f}%)",
        f"PC{PC_Y}": f"PC{PC_Y} ({explained_var[PC_Y-1]:.1f}%)"
    },
    color_discrete_sequence=px.colors.qualitative.Set2
)

# Add 95% confidence ellipses
for cls in unique_classes:
    mask = class_labels == cls
    pc_x = scores[mask, PC_X - 1]
    pc_y = scores[mask, PC_Y - 1]
    
    if len(pc_x) < 3:
        continue
    
    # Calculate ellipse parameters
    cov = np.cov(pc_x, pc_y)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    angle = np.degrees(np.arctan2(eigenvectors[1, 1], eigenvectors[0, 1]))
    chi2_val = 5.991  # 95% confidence for 2 DOF
    
    # Generate ellipse points
    theta = np.linspace(0, 2 * np.pi, 100)
    a = np.sqrt(eigenvalues[1] * chi2_val)
    b = np.sqrt(eigenvalues[0] * chi2_val)
    ellipse_x = a * np.cos(theta)
    ellipse_y = b * np.sin(theta)
    
    # Rotate
    cos_a, sin_a = np.cos(np.radians(angle)), np.sin(np.radians(angle))
    rot_x = cos_a * ellipse_x - sin_a * ellipse_y + pc_x.mean()
    rot_y = sin_a * ellipse_x + cos_a * ellipse_y + pc_y.mean()
    
    fig.add_trace(go.Scatter(
        x=rot_x, y=rot_y, mode="lines",
        line=dict(dash="dash", width=1),
        showlegend=False, hoverinfo="skip"
    ))

fig.update_layout(template="plotly_white", height=550, width=700)
fig.show()

In [None]:
# ============================================================
# PCA LOADINGS PLOT
# ============================================================
LOADING_PC = 1  # Which PC loadings to plot (1-indexed)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=x_axis, y=loadings[LOADING_PC - 1],
    mode="lines", name=f"PC{LOADING_PC} loadings",
    line=dict(color="steelblue")
))

fig.add_hline(y=0, line_dash="dash", line_color="grey")

if x_axis[0] > x_axis[-1]:
    fig.update_xaxes(autorange="reversed")

fig.update_layout(
    title=f"PC{LOADING_PC} Loadings ({explained_var[LOADING_PC-1]:.1f}% variance)",
    xaxis_title=x_label,
    yaxis_title="Loading",
    template="plotly_white",
    height=400
)
fig.show()

print(f"\nTop contributing variables for PC{LOADING_PC}:")
abs_loadings = np.abs(loadings[LOADING_PC - 1])
top_idx = np.argsort(abs_loadings)[::-1][:10]
for idx in top_idx:
    print(f"  {variable_names[idx]:>10s}: {loadings[LOADING_PC-1][idx]:+.4f}")

### 6.1 Hotelling's T² — Outlier Detection

Hotelling's T² statistic identifies samples that are unusually far from the centre of the PCA model. Samples exceeding the 95% or 99% confidence limit may be outliers that distort subsequent analysis.

> **Action:** If outliers are identified, investigate them. They may represent measurement errors, contamination, or genuinely unusual samples. Remove them only with justification.

In [None]:
# ============================================================
# HOTELLING'S T² OUTLIER DETECTION
# ============================================================
from scipy.stats import f as f_dist

N_PC_HOTELLING = 3  # Number of PCs to use for T² calculation

scores_subset = scores[:, :N_PC_HOTELLING]
cov_inv = np.linalg.inv(np.cov(scores_subset.T))
T2 = np.array([s @ cov_inv @ s.T for s in scores_subset])

# F-distribution critical values for 95% and 99% confidence
n = X_mc.shape[0]
p = N_PC_HOTELLING
T2_95 = (p * (n - 1) * (n + 1)) / (n * (n - p)) * f_dist.ppf(0.95, p, n - p)
T2_99 = (p * (n - 1) * (n + 1)) / (n * (n - p)) * f_dist.ppf(0.99, p, n - p)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(range(n)), y=T2,
    mode="markers+text", text=sample_ids,
    textposition="top center", textfont=dict(size=8),
    marker=dict(
        color=[colours[list(unique_classes).index(c) % len(colours)] for c in class_labels],
        size=8
    ),
    hovertext=sample_ids, name="Samples"
))

fig.add_hline(y=T2_95, line_dash="dash", line_color="orange",
              annotation_text="95% limit", annotation_position="top right")
fig.add_hline(y=T2_99, line_dash="dash", line_color="red",
              annotation_text="99% limit", annotation_position="top right")

fig.update_layout(
    title=f"Hotelling's T² (based on {N_PC_HOTELLING} PCs)",
    xaxis_title="Sample index",
    yaxis_title="T²",
    template="plotly_white",
    height=400
)
fig.show()

# Report outliers
outliers_95 = sample_ids[T2 > T2_95]
outliers_99 = sample_ids[T2 > T2_99]
print(f"Samples exceeding 95% limit: {list(outliers_95) if len(outliers_95) > 0 else 'None'}")
print(f"Samples exceeding 99% limit: {list(outliers_99) if len(outliers_99) > 0 else 'None'}")

## 7. Hierarchical Cluster Analysis (HCA)

HCA is another **unsupervised** method that groups samples based on their similarity. Unlike PCA, it produces a **dendrogram** — a tree-like diagram showing how samples cluster together. This is useful for:

- Confirming groupings seen in PCA
- Identifying sub-groups within a class
- Assessing how well different classes separate

**Key parameters:**
- **Distance metric** — how similarity is measured (Euclidean, correlation, cosine)
- **Linkage method** — how clusters are merged (Ward's is generally recommended for spectral data)

In [None]:
# ============================================================
# HIERARCHICAL CLUSTER ANALYSIS
# ============================================================
DISTANCE_METRIC = "euclidean"   # Options: "euclidean", "correlation", "cosine"
LINKAGE_METHOD = "ward"          # Options: "ward", "complete", "average", "single"
# Note: Ward's method requires Euclidean distance

# Compute linkage on preprocessed data
Z = linkage(X_processed, method=LINKAGE_METHOD, metric=DISTANCE_METRIC)

# Create colour map for classes
class_colour_map = {cls: colours[i % len(colours)] for i, cls in enumerate(unique_classes)}

# Plot dendrogram
fig, ax = plt.subplots(figsize=(14, 5))
dend = dendrogram(
    Z, labels=sample_ids, leaf_rotation=90, leaf_font_size=8,
    ax=ax, color_threshold=0
)

# Colour leaf labels by class
xlbls = ax.get_xmajorticklabels()
for lbl in xlbls:
    sample = lbl.get_text()
    idx = list(sample_ids).index(sample)
    lbl.set_color(class_colour_map[class_labels[idx]])

ax.set_title(f"HCA Dendrogram ({LINKAGE_METHOD} linkage, {DISTANCE_METRIC} distance)")
ax.set_ylabel("Distance")

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=class_colour_map[c], label=c) for c in unique_classes]
ax.legend(handles=legend_elements, loc="upper right")

plt.tight_layout()
plt.show()

## 8. Partial Least Squares Discriminant Analysis (PLS-DA)

PLS-DA is a **supervised** method that finds latent variables maximising the covariance between the spectral data (X) and the class labels (Y). Unlike PCA, it uses class information to find the most discriminating directions in spectral space.

**Key considerations:**
- The number of latent variables (LVs) must be optimised by cross-validation to avoid overfitting
- **VIP scores** (Variable Importance in Projection) identify which spectral regions drive classification
- Always validate with cross-validation — never rely on calibration performance alone

> **Overfitting warning:** PLS-DA can easily overfit, especially with many more variables than samples (the typical case for spectral data). Cross-validation is essential.

In [None]:
# ============================================================
# PLS-DA: ENCODE CLASSES AND SELECT NUMBER OF LVs
# ============================================================
MAX_LV = 10  # Maximum number of LVs to test

# Encode class labels as dummy matrix (one-hot for multiclass)
le = LabelEncoder()
y_encoded = le.fit_transform(class_labels)

# For multiclass: create dummy Y matrix
n_classes = len(unique_classes)
Y_dummy = np.zeros((len(y_encoded), n_classes))
for i, y in enumerate(y_encoded):
    Y_dummy[i, y] = 1

# Cross-validation to find optimal number of LVs
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for n_lv in range(1, MAX_LV + 1):
    pls = PLSRegression(n_components=n_lv, scale=True)
    y_pred_cv = cross_val_predict(pls, X_processed, Y_dummy, cv=cv)
    y_pred_class = le.inverse_transform(np.argmax(y_pred_cv, axis=1))
    acc = accuracy_score(class_labels, y_pred_class)
    cv_scores.append(acc)

optimal_lv = np.argmax(cv_scores) + 1

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(range(1, MAX_LV + 1)), y=cv_scores,
    mode="lines+markers", marker=dict(color="steelblue")
))
fig.add_vline(x=optimal_lv, line_dash="dash", line_color="red",
              annotation_text=f"Optimal: {optimal_lv} LVs")

fig.update_layout(
    title="PLS-DA: Cross-Validation Accuracy vs Number of LVs",
    xaxis_title="Number of Latent Variables",
    yaxis_title="CV Accuracy",
    template="plotly_white",
    height=400,
    yaxis=dict(range=[0, 1.05])
)
fig.show()

print(f"\nOptimal number of LVs: {optimal_lv} (CV accuracy: {cv_scores[optimal_lv-1]:.3f})")

In [None]:
# ============================================================
# PLS-DA: FIT FINAL MODEL AND EVALUATE
# ============================================================
N_LV = optimal_lv  # Use optimal, or override manually

pls_final = PLSRegression(n_components=N_LV, scale=True)
pls_final.fit(X_processed, Y_dummy)

# Cross-validated predictions
y_pred_cv = cross_val_predict(pls_final, X_processed, Y_dummy, cv=cv)
y_pred_class = le.inverse_transform(np.argmax(y_pred_cv, axis=1))

# Confusion matrix
cm = confusion_matrix(class_labels, y_pred_class, labels=le.classes_)
fig = px.imshow(
    cm, x=le.classes_, y=le.classes_,
    labels=dict(x="Predicted", y="True", color="Count"),
    text_auto=True, color_continuous_scale="Blues",
    title=f"PLS-DA Confusion Matrix ({N_LV} LVs, 5-fold CV)"
)
fig.update_layout(height=400, width=500)
fig.show()

# Classification report
print("\nClassification Report (cross-validated):")
print(classification_report(class_labels, y_pred_class))

In [None]:
# ============================================================
# PLS-DA SCORES PLOT
# ============================================================
pls_scores = pls_final.x_scores_

LV_X = 1  # LV for x-axis (1-indexed)
LV_Y = 2  # LV for y-axis

scores_df = pd.DataFrame({
    "Sample_ID": sample_ids,
    "Class": class_labels,
    f"LV{LV_X}": pls_scores[:, LV_X - 1],
    f"LV{LV_Y}": pls_scores[:, LV_Y - 1] if N_LV >= 2 else np.zeros(len(sample_ids))
})

fig = px.scatter(
    scores_df, x=f"LV{LV_X}", y=f"LV{LV_Y}",
    color="Class", hover_name="Sample_ID",
    title=f"PLS-DA Scores Plot: LV{LV_X} vs LV{LV_Y}",
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig.update_layout(template="plotly_white", height=550, width=700)
fig.show()

In [None]:
# ============================================================
# VIP SCORES (Variable Importance in Projection)
# ============================================================
def calculate_vip(pls_model, X, Y):
    """
    Calculate VIP scores for a fitted PLS model.
    Variables with VIP > 1.0 are considered important.
    VIP > 0.8 is sometimes used as a softer threshold.
    """
    T = pls_model.x_scores_        # X scores
    W = pls_model.x_weights_       # X weights
    Q = pls_model.y_loadings_      # Y loadings
    
    p_vars = X.shape[1]  # number of variables
    h = T.shape[1]       # number of components
    
    # Sum of squares of Y explained by each component
    s = np.diag(T.T @ T @ Q.T @ Q)
    s_total = s.sum()
    
    # VIP calculation
    vip = np.zeros(p_vars)
    for i in range(p_vars):
        weight_sum = np.sum(s * (W[i, :] / np.linalg.norm(W[:, :], axis=0)) ** 2)
        vip[i] = np.sqrt(p_vars * weight_sum / s_total)
    
    return vip

vip_scores = calculate_vip(pls_final, X_processed, Y_dummy)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=x_axis, y=vip_scores,
    mode="lines", line=dict(color="steelblue"),
    hovertext=[f"{v}: VIP={vip:.2f}" for v, vip in zip(variable_names, vip_scores)]
))

fig.add_hline(y=1.0, line_dash="dash", line_color="red",
              annotation_text="VIP = 1.0 (important)", annotation_position="top right")
fig.add_hline(y=0.8, line_dash="dot", line_color="orange",
              annotation_text="VIP = 0.8", annotation_position="top right")

if x_axis[0] > x_axis[-1]:
    fig.update_xaxes(autorange="reversed")

fig.update_layout(
    title="VIP Scores — Spectral Regions Driving Classification",
    xaxis_title=x_label,
    yaxis_title="VIP Score",
    template="plotly_white",
    height=400
)
fig.show()

# Top VIP variables
print(f"\nTop 10 most important spectral variables (VIP > 1.0):")
top_vip_idx = np.argsort(vip_scores)[::-1][:10]
for idx in top_vip_idx:
    print(f"  {variable_names[idx]:>10s} cm⁻¹: VIP = {vip_scores[idx]:.3f}")

## 9. Orthogonal PLS-DA (OPLS-DA)

OPLS-DA (Trygg & Wold, 2002) is an extension of PLS-DA that separates the variation in X into:

- **Predictive** variation — correlated with the class labels (Y)
- **Orthogonal** variation — structured variation unrelated to classes (e.g., batch effects, instrument drift)

This separation has two major advantages:
1. The **predictive scores plot** often shows cleaner class separation than PLS-DA
2. The **S-plot** (covariance vs correlation) identifies reliable biomarkers for class discrimination

> **Note:** OPLS-DA is most effective for **two-class** comparisons. For multiclass problems, consider running pairwise OPLS-DA models.

In [None]:
# ============================================================
# OPLS-DA IMPLEMENTATION
# ============================================================

class OPLSDA:
    """
    Orthogonal PLS-DA (Trygg & Wold, 2002).
    Separates X into predictive and orthogonal components.
    
    For two-class problems: use 1 predictive + n_ortho orthogonal components.
    """
    
    def __init__(self, n_ortho=1):
        self.n_ortho = n_ortho
        self.t_ortho_ = None      # Orthogonal scores
        self.p_ortho_ = None      # Orthogonal loadings
        self.w_ortho_ = None      # Orthogonal weights
        self.t_pred_ = None       # Predictive scores
        self.p_pred_ = None       # Predictive loadings
        self.w_pred_ = None       # Predictive weights
        self.pls_ = None          # Underlying PLS model
        self.X_filtered_ = None   # X with orthogonal variation removed
    
    def fit(self, X, y):
        """
        Fit OPLS-DA model.
        X: preprocessed spectral matrix (n_samples, n_variables)
        y: class labels as 1D array (will be encoded internally)
        """
        # Mean-centre
        self.X_mean_ = X.mean(axis=0)
        self.y_mean_ = y.mean()
        Xc = X - self.X_mean_
        yc = (y - self.y_mean_).reshape(-1, 1)
        
        # Store for orthogonal correction
        self.t_ortho_ = []
        self.p_ortho_ = []
        self.w_ortho_ = []
        
        X_filtered = Xc.copy()
        
        for i in range(self.n_ortho):
            # PLS weight vector (first PLS component)
            w = (X_filtered.T @ yc) / (yc.T @ yc)
            w = w / np.linalg.norm(w)
            
            # PLS scores
            t = X_filtered @ w
            
            # PLS loadings
            p = (X_filtered.T @ t) / (t.T @ t)
            
            # Orthogonal weight
            w_ortho = p - (w.T @ p) / (w.T @ w) * w
            w_ortho = w_ortho / np.linalg.norm(w_ortho)
            
            # Orthogonal scores and loadings
            t_ortho = X_filtered @ w_ortho
            p_ortho = (X_filtered.T @ t_ortho) / (t_ortho.T @ t_ortho)
            
            # Remove orthogonal component
            X_filtered = X_filtered - t_ortho @ p_ortho.T
            
            self.t_ortho_.append(t_ortho.ravel())
            self.p_ortho_.append(p_ortho.ravel())
            self.w_ortho_.append(w_ortho.ravel())
        
        self.t_ortho_ = np.array(self.t_ortho_).T
        self.p_ortho_ = np.array(self.p_ortho_).T
        self.w_ortho_ = np.array(self.w_ortho_).T
        
        # Fit PLS on filtered data
        self.pls_ = PLSRegression(n_components=1, scale=False)
        self.pls_.fit(X_filtered, yc)
        
        self.t_pred_ = self.pls_.x_scores_.ravel()
        self.p_pred_ = self.pls_.x_loadings_.ravel()
        self.w_pred_ = self.pls_.x_weights_.ravel()
        self.X_filtered_ = X_filtered
        
        # Calculate R² and Q² for the predictive component
        y_pred = self.pls_.predict(X_filtered)
        ss_res = np.sum((yc - y_pred) ** 2)
        ss_tot = np.sum(yc ** 2)
        self.R2Y_ = 1 - ss_res / ss_tot
        
        return self
    
    def transform(self, X):
        """Apply orthogonal correction to new data."""
        Xc = X - self.X_mean_
        for i in range(self.n_ortho):
            t_ortho = Xc @ self.w_ortho_[:, i:i+1]
            Xc = Xc - t_ortho @ self.p_ortho_[:, i:i+1].T
        return Xc
    
    def predict(self, X):
        """Predict class for new data."""
        X_filt = self.transform(X)
        return self.pls_.predict(X_filt).ravel() + self.y_mean_
    
    def s_plot_data(self):
        """
        Generate S-plot data: covariance (p) vs correlation (pcorr).
        Variables in the upper-right or lower-left corners of the S-plot
        are reliable biomarkers.
        """
        p = self.p_pred_  # Covariance (loadings)
        
        # Correlation between each variable and the predictive score
        t = self.t_pred_
        X_filt = self.X_filtered_
        pcorr = np.array([np.corrcoef(X_filt[:, j], t)[0, 1] for j in range(X_filt.shape[1])])
        
        return p, pcorr


print("✅ OPLS-DA class defined.")

### 9.1 Select Classes and Fit OPLS-DA

OPLS-DA works best for **pairwise** comparisons. Select two classes to compare below. If you have more than two classes, you can repeat this analysis for each pair.

In [None]:
# ============================================================
# OPLS-DA: SELECT CLASSES AND FIT
# ============================================================
CLASS_A = unique_classes[0]  # e.g., "EVOO"
CLASS_B = unique_classes[2]  # e.g., "Adulterated"
N_ORTHO = 1                  # Number of orthogonal components

print(f"Comparing: {CLASS_A} vs {CLASS_B}")

# Subset data
mask_ab = np.isin(class_labels, [CLASS_A, CLASS_B])
X_ab = X_processed[mask_ab]
y_ab = (class_labels[mask_ab] == CLASS_B).astype(float)  # 0 = CLASS_A, 1 = CLASS_B
ids_ab = sample_ids[mask_ab]
labels_ab = class_labels[mask_ab]

print(f"Samples: {(y_ab==0).sum()} x {CLASS_A}, {(y_ab==1).sum()} x {CLASS_B}")

# Fit OPLS-DA
opls = OPLSDA(n_ortho=N_ORTHO)
opls.fit(X_ab, y_ab)

print(f"\nR²Y = {opls.R2Y_:.3f}")
print(f"Predictive score range: {opls.t_pred_.min():.3f} to {opls.t_pred_.max():.3f}")

In [None]:
# ============================================================
# OPLS-DA SCORES PLOT: PREDICTIVE vs ORTHOGONAL
# ============================================================
opls_df = pd.DataFrame({
    "Sample_ID": ids_ab,
    "Class": labels_ab,
    "t_predictive": opls.t_pred_,
    "t_orthogonal": opls.t_ortho_[:, 0]
})

fig = px.scatter(
    opls_df, x="t_predictive", y="t_orthogonal",
    color="Class", hover_name="Sample_ID",
    title=f"OPLS-DA Scores: {CLASS_A} vs {CLASS_B} (R²Y = {opls.R2Y_:.3f})",
    labels={"t_predictive": "Predictive score (t[1])", "t_orthogonal": "Orthogonal score (t_ortho[1])"},
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig.add_vline(x=0, line_dash="dash", line_color="grey", opacity=0.5)
fig.update_layout(template="plotly_white", height=550, width=700)
fig.show()

In [None]:
# ============================================================
# S-PLOT: IDENTIFY RELIABLE BIOMARKERS
# ============================================================
p_cov, p_corr = opls.s_plot_data()

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=p_cov, y=p_corr,
    mode="markers",
    marker=dict(
        size=5, color=np.abs(p_corr),
        colorscale="RdYlBu_r", showscale=True,
        colorbar=dict(title="|Correlation|")
    ),
    hovertext=[f"{v} cm⁻¹" for v in variable_names],
    hoverinfo="text"
))

# Highlight significant variables (|pcorr| > 0.5 AND |p| in top percentile)
p_thresh = np.percentile(np.abs(p_cov), 90)
sig_mask = (np.abs(p_corr) > 0.5) & (np.abs(p_cov) > p_thresh)

if sig_mask.any():
    fig.add_trace(go.Scatter(
        x=p_cov[sig_mask], y=p_corr[sig_mask],
        mode="markers+text",
        text=[f"{v}" for v in variable_names[sig_mask]],
        textposition="top center", textfont=dict(size=8),
        marker=dict(size=10, color="red", symbol="diamond"),
        name="Significant", showlegend=True
    ))

fig.add_hline(y=0, line_dash="dash", line_color="grey", opacity=0.3)
fig.add_vline(x=0, line_dash="dash", line_color="grey", opacity=0.3)

fig.update_layout(
    title=f"S-Plot: {CLASS_A} vs {CLASS_B}",
    xaxis_title="Covariance p(cov)",
    yaxis_title="Correlation p(corr)",
    template="plotly_white",
    height=550, width=700
)
fig.show()

print("Variables in S-plot corners (reliable biomarkers):")
if sig_mask.any():
    for idx in np.where(sig_mask)[0]:
        direction = "higher in " + CLASS_B if p_cov[idx] > 0 else "higher in " + CLASS_A
        print(f"  {variable_names[idx]:>10s} cm⁻¹: p(cov)={p_cov[idx]:+.4f}, p(corr)={p_corr[idx]:+.3f} ({direction})")
else:
    print("  No variables met both thresholds. Try adjusting the criteria above.")

### 9.2 Permutation Testing (Model Validation)

Permutation testing is the gold standard for validating OPLS-DA models. The procedure:

1. Randomly shuffle the class labels
2. Fit a new OPLS-DA model on the shuffled data
3. Repeat many times (e.g., 100–200 permutations)
4. Compare the real model's R²Y with the distribution of permuted R²Y values

If the real model is significantly better than the permuted models, the classification is genuine and not due to chance or overfitting.

> **Note:** This can take a minute to run depending on the number of permutations.

In [None]:
# ============================================================
# PERMUTATION TEST FOR OPLS-DA
# ============================================================
N_PERMUTATIONS = 100  # Increase to 200 for publication-quality

print(f"Running {N_PERMUTATIONS} permutations...")

perm_R2Y = []
perm_corr = []  # Correlation between permuted and original y

for i in range(N_PERMUTATIONS):
    # Shuffle y labels
    y_perm = np.random.permutation(y_ab)
    corr_with_original = np.abs(np.corrcoef(y_ab, y_perm)[0, 1])
    
    # Fit OPLS-DA on permuted data
    opls_perm = OPLSDA(n_ortho=N_ORTHO)
    opls_perm.fit(X_ab, y_perm)
    
    perm_R2Y.append(opls_perm.R2Y_)
    perm_corr.append(corr_with_original)
    
    if (i + 1) % 20 == 0:
        print(f"  {i + 1}/{N_PERMUTATIONS} done...")

perm_R2Y = np.array(perm_R2Y)
perm_corr = np.array(perm_corr)

# Plot permutation results
fig = go.Figure()

# Permuted models
fig.add_trace(go.Scatter(
    x=perm_corr, y=perm_R2Y,
    mode="markers", name="Permuted",
    marker=dict(color="lightgrey", size=6, line=dict(color="grey", width=0.5))
))

# Real model (correlation = 1.0)
fig.add_trace(go.Scatter(
    x=[1.0], y=[opls.R2Y_],
    mode="markers", name="Real model",
    marker=dict(color="red", size=14, symbol="star")
))

# Regression line through permuted points
fit_coef = np.polyfit(perm_corr, perm_R2Y, 1)
x_line = np.linspace(0, 1, 50)
fig.add_trace(go.Scatter(
    x=x_line, y=np.polyval(fit_coef, x_line),
    mode="lines", name="Trend", line=dict(dash="dash", color="grey")
))

fig.update_layout(
    title=f"Permutation Test ({N_PERMUTATIONS} permutations)",
    xaxis_title="Correlation with original y",
    yaxis_title="R²Y",
    template="plotly_white",
    height=500, width=600
)
fig.show()

# Statistical test
p_value = (perm_R2Y >= opls.R2Y_).sum() / N_PERMUTATIONS
print(f"\nReal model R²Y: {opls.R2Y_:.3f}")
print(f"Permuted R²Y: {perm_R2Y.mean():.3f} ± {perm_R2Y.std():.3f}")
print(f"Permutation p-value: {p_value:.3f}")
if p_value < 0.05:
    print("✅ Model is statistically significant (p < 0.05). Classification is genuine.")
else:
    print("⚠️ Model is NOT significant. Consider overfitting or insufficient class differences.")

## 10. Export Results

Download key results as CSV files for your report.

In [None]:
# ============================================================
# EXPORT KEY RESULTS
# ============================================================

# PCA scores
pca_export = pd.DataFrame(scores[:, :5], columns=[f"PC{i+1}" for i in range(5)])
pca_export.insert(0, "Sample_ID", sample_ids)
pca_export.insert(1, "Class", class_labels)
pca_export.to_csv("pca_scores.csv", index=False)

# PCA loadings
loadings_export = pd.DataFrame(loadings[:5].T, columns=[f"PC{i+1}" for i in range(5)])
loadings_export.insert(0, "Variable", variable_names)
loadings_export.to_csv("pca_loadings.csv", index=False)

# VIP scores
vip_export = pd.DataFrame({"Variable": variable_names, "VIP": vip_scores})
vip_export.to_csv("vip_scores.csv", index=False)

print("✅ Files saved:")
print("  - pca_scores.csv")
print("  - pca_loadings.csv")
print("  - vip_scores.csv")

# Download in Colab
try:
    from google.colab import files
    files.download("pca_scores.csv")
    files.download("pca_loadings.csv")
    files.download("vip_scores.csv")
except ImportError:
    print("\n(Not running in Colab — files saved to current directory)")

## References and Further Reading

Jolliffe, I.T. (2002). *Principal Component Analysis*, 2nd ed. Springer.

Barker, M. & Rayens, W. (2003). Partial least squares for discrimination. *Journal of Chemometrics*, 17(3), 166–173.

Trygg, J. & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). *Journal of Chemometrics*, 16(3), 119–128.

Wiklund, S. et al. (2008). Visualization of GC/TOF-MS-based metabolomics data for identification of biochemically interesting compounds using OPLS class models. *Analytical Chemistry*, 80(1), 115–122.

Chong, I.G. & Jun, C.H. (2005). Performance of some variable selection methods when multicollinearity is present. *Chemometrics and Intelligent Laboratory Systems*, 78(1–2), 103–112.

Eilers, P.H.C. & Boelens, H.F.M. (2005). Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre report.

Barnes, R.J., Dhanoa, M.S. & Lister, S.J. (1989). Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. *Applied Spectroscopy*, 43(5), 772–777.

---
*Notebook prepared for FBMFOR — Food Fraud Analysis, University of Reading*