# Exercise set 7: Classification & Signal processing

There are two main goals for this exercise:

1. To develop optimised classifiers (e.g., a decision tree) using cross-validation, and gain experience with the assessment of classifiers.
2. To gain practical experience with signal processing techniques used for preprocessing, for instance, of Near-Infrared (NIR) spectra. Preprocessing methods are important for improving the signal-to-noise ratio, correcting for scattering effects (variations in light path due to particle size, etc.), and enhancing spectral features, which can lead to more reliable analysis and development of robust predictive models.

**Learning Objectives:**

After completing this exercise set, you will be able to:

- Develop optimised classifiers and assess them.
- Preprocess spectra by normalisation, multiplicative scatter correction, or taking a derivative.

**To get the exercise approved, complete the following problems:**

- [7.1(a)](#7.1(a)), [7.1(b)](#7.1(b)) and [7.1(c)](#7.1(c)): To show that you can create a optimised classifier.
- [7.2(a)](#7.2(a)) and at least one of [7.2(b)](#7.2(b)), [7.2(c)](#7.2(c)) or [7.2(d)](#7.2(d)): To show that you can apply preprocessing to NIR spectra.

**Files required for this exercise:**
* [Exercise 7.1](#Exercise-7.1-Developing-optimised-classifiers): [bace-small.csv](bace-small.csv)
* [Exercise 7.2](#Exercise-7.2-Preprocessing-NIR-spectra): [nir.csv](nir.csv)

Please ensure that these files are saved in the same directory as this notebook.

## Exercise 7.1 Developing optimised classifiers

We will here consider a version of the **BACE** dataset from [MoleculeNet](https://moleculenet.org) (site containing benchmark data for molecular machine learning).

This data set contains 1513 molecules that have been labelled by their binding affinity to BACE-1 (1 for active, 0 for inactive). Active binders of BACE-1 could potentially be used as treatments for Alzheimer’s Disease.

The version we consider contains a subset (9) of all features in the original data (around 590):
* `MW`: The total mass (molecular weight) of the molecule.
* `AlogP`: The partition coefficient. Measures the lipophilicity (how much the molecule prefers oil over water).
* `HBA`: Hydrogen Bond Acceptors, the number of atoms that can receive a hydrogen bond.
* `HBD`: Hydrogen Bond Donors, the number of atoms that can be "donated" to form a hydrogen bond.
* `PSA`: Polar Surface Area, the total surface area contributed by polar atoms.
* `RB`: Rotatable Bonds, A count of single bonds that can rotate freely.
* `HeavyAtomCount`: The total number of atoms in the molecule, excluding hydrogen.
* `ChiralCenterCount`: The number of chiral centres in the molecule.
* `RingCount`: The number of cyclic structures (like benzene rings) in the molecule.

The data can be loaded as follows:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="colorblind")

data = pd.read_csv("bace-small.csv")
skip = ("mol", "Class")
y = data["Class"]  # Classification of samples.
class_names = ["Inactive", "Active"]
features = [i for i in data.columns if i not in skip]
print("Features:", features)
X = data[features].to_numpy()
data.head()

### 7.1(a)

**Task: In the following task, you will develop a classifier to predict whether a small molecule is an active (positive) or inactive (negative) binder. Which error type (false positive or false negative) should be minimised?**

#### Your answer to question 7.1(a): Will you minimise false positives or negatives?
*Double click here*

### 7.1(b)

**Task: Prepare the data by splitting it into train/test sets and standardise the features. Use the code provided below and note the order of operations.**


**Hint:**

1. Use the code provided in the cell below to create the training and test sets. We use something called **stratification** here. This makes sure that the train/test split maintains the same proportion of classes as the original dataset.

In [None]:
# To create the training set use:
from sklearn.model_selection import train_test_split

X_train0, X_test0, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.33, random_state=2026
)

2. Preprocess the data using the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from scikit-learn:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train0)

X_train = scaler.transform(X_train0)
X_test = scaler.transform(X_test0)

#### Your answer to question 7.1(b): Can you give a reason why we fit the `StandardScaler` to the training data and not to the full data set?
*Double click here*

### 7.1(c)

**Task: Create a decision tree classifier to classify molecules. Optimise the tree depth using cross-validation on a training set. Report the optimal maximum depth of the resulting tree.**

With reference to the previous problem:

* If you prioritised minimising false positives, use the `precision` as your optimisation metric.
* If you prioritised minimising false negatives, use the `recall` as your optimisation metric.
* If you opted for a balanced approach, use the `balanced_accuracy` as your optimisation metric.


**Hint:**
1. The optimisation of the decision tree can be done as follows (assuming that you have already split into the training and test sets):


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Set up a grid search:
parameters = {"max_depth": range(1, 10)}
grid_t = GridSearchCV(
    DecisionTreeClassifier(),
    parameters,
    scoring="accuracy",  # Swap this with the metric you prefer
    refit=True,
)
# Run the grid search:
grid_t.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_tree = grid_t.best_estimator_
print("Best tree:", best_tree)
print("Best score", grid_t.best_score_)
print("Best parameters", grid_t.best_params_)
print("Actual depth of the tree:", best_tree.get_depth())
# Note: The max_depth variables is just the maximum depth,
# the resulting tree can be shorter if adding more levels
# does not improve the classification.

#### Your answer to question 7.1(c): What depth did you get for your tree?
*Double click here*

### 7.1(d)

**Task: Visualise your decision tree and use this to describe how the classification is made.**

**Hint:** The decision tree can be visualised using [plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) or [export_graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html),

1. Easiest: Using [plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html):

```python
from sklearn import tree


tree.plot_tree(
    best_tree,  # The tree to plot
    filled=True,  # Add colour to the boxes.
    feature_names=features,  # Get name for features.
    class_names=class_names,  # Get the name of the different classes.
)
```

2. Looks nicer: Using [export_graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html):

```python
from sklearn.tree import export_graphviz  # To create the tree.
import graphviz  # To turn the three into a graph, you may need to install this (pip install graphviz).
from IPython.display import display  # To show the graph.

dot_data = export_graphviz(
    best_tree,  # The tree to plot.
    out_file=None,  # Do not write to file.
    feature_names=features,  # Get name for features.
    class_names=class_names,  # Get the name of the different classes.
    rounded=True,  # Show the boxes in the tree with rounded corners.
    filled=True,  # Add colour to the boxes.
)
display(graphviz.Source(dot_data))  # Show the tree in a notebook.
```

In [None]:
# Your code here

#### Your answer to question 7.1(d): What features is the best decision tree using?
*Double click here*

### 7.1(e)

**Task: Create a k-nearest neighbours classifier to classify the molecules. Optimise the number of neighbours using cross-validation on a training set. Report the optimal number of neighbours.**

**Hint:**

1. The optimisation of the k-nearest neighbours classifier can be done as follows (assuming that you have already split into the training and test sets):

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Set up a grid search:
parameters = {"n_neighbors": range(1, 20)}
grid_knn = GridSearchCV(
    KNeighborsClassifier(),
    parameters,
    scoring="accuracy",  # Swap this with the metric you prefer
)
# Run the grid search:
grid_knn.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_knn = grid_knn.best_estimator_
print("Best knn:", best_knn)
print("Best score", grid_knn.best_score_)
print("Best parameters", grid_knn.best_params_)

#### Your answer to question 7.1(e): What was the optimal number of neighbours?
*Double click here*

### 7.1(f)

**Task: Create a random forest classifier to classify molecules. Optimise the number of trees and levels using cross-validation on a training set. Report the optimal number of trees and levels.**

**Hint:**

1. The optimisation of the random forest classifier can be done similarly to what you did in [7.1(c)](#7.1(c)) and [7.1(e)](#7.1(e)). You just have to make use of the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and optimise the parameters `n_estimators` and `max_depth`:
```python
from sklearn.ensemble import RandomForestClassifier

# Set up a grid search:
parameters = {
    "n_estimators": [10, 50, 100, 200, 500],  # the number of trees
    "max_depth": range(1, 11),  # the maximum depth
}
grid = GridSearchCV(
    RandomForestClassifier(),
    parameters,
    scoring="accuracy",  # Swap this with the metric you prefer
    verbose=2,  # Print out text to show the progress of the fitting
)

# ... rest of the optimisation code ...
```

In [None]:
# Your code here

#### Your answer to question 7.1(f): What was the optimal number of estimators and tree depth?
*Double click here*

### 7.1(g)

**Task: Compare the three optimised classifiers you have made by applying them to the test set and obtaining the corresponding confusion matrices. Also compute the [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), and the [balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) for the test set. Which classifier performs best?**



**Hint:** The metrics can be computed as follows:
```python
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import (
    recall_score,
    precision_score,
    balanced_accuracy_score,
)

y_hat = best_tree.predict(X_test)
recall_tree = recall_score(y_test, y_hat)
precision_tree = precision_score(y_test, y_hat)
bac_tree = balanced_accuracy_score(y_test, y_hat)
print(f"Recall: {recall_tree:.3f}")
print(f"Precision: {precision_tree:.3f}")
print(f"Balanced accuracy: {bac_tree:.3f}")

ConfusionMatrixDisplay.from_estimator(
    best_tree,
    X_test,
    y_test,
    colorbar=True,
)
```

In [None]:
# Your code here

#### Your answer to question 7.1(g): Which classifier performs best?
*Double click here*

### 7.1(h)

Explore if you can create an even better classifier using a gradient boosting method (e.g., [XGBoost](https://xgboost.readthedocs.io)) or a foundation model (e.g., [TabPFN](https://github.com/PriorLabs/TabPFN); note that this requires some extra steps for the installation).

In [None]:
# Your code here

#### Your answer to question 7.1(h): Do you get better performance?
*Double click here*

## Exercise 7.2 Preprocessing NIR spectra

We will analyse NIR spectra from two distinct Ethiopian [sorghum](https://en.wikipedia.org/wiki/Sorghum) cultivars to determine if they can be differentiated. Specifically, we will examine how different preprocessing techniques impact the outcome of a principal component analysis (PCA) applied to the spectra. 

**Note:**

1. The dataset used in this exercise is derived from [Kosmowski and Worku
](https://doi.org/10.1371/journal.pone.0193620) who used a miniaturised NIR spectrometer to identify Ethiopian crop cultivars. To simplify the analysis, we focus on measurements from only two of the ten sorghum cultivars studied in the original work

2. This exercise will mainly ask you to run and observe results from already implemented code.

### 7.2(a)

The following code performs these steps:

1. Load the NIR spectra from the data file [nir.csv](./nir.csv).
2. Extracts wavelengths, spectra, and cultivar names.
3. Defines colours for plotting cultivars.
4. Creates a function to plot spectra by cultivar.
5. Creates a function to run a PCA on provided spectra and plot the scores of the first two principal components.
6. Initialises a figure for results.
7. Plots the original spectra and the PCA results.

**Task: Execute the code and observe the generated plot. In the PCA scores plot, are there any noticeable groupings that suggest cultivar separation?**

In [None]:
# Load the needed libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA

%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="colorblind")

# Load the raw data:
data = pd.read_csv("nir.csv")
data.head()

In [None]:
# Extract information from the data
variables = [i for i in data.columns if i != "Cultivator"]
# Wavelengths as numbers:
wavelengths = np.array([float(i) for i in variables])
print(f"Number of wavelengths: {len(wavelengths)}")
# All spectra as a data matrix:
all_spectra = data[variables].to_numpy()
print(f"Size of data matrix: {all_spectra.shape}")
# Name of the two cultivators:
cultivators = data["Cultivator"].unique()
print(f"Cultivators: {cultivators}")

In [None]:
# Define a colour mapping for the two cultivators:
colors = sns.color_palette("colorblind", n_colors=len(cultivators))
color_mapping = {key: colori for key, colori in zip(cultivators, colors)}
# Show the two colors
colors

In [None]:
def plot_spectra(data, X, wavelengths, color_mapping, axi, legend=False):
    """

    Plots NIR spectra from the given data matrix X, colour-coded by cultivar.

    Args:
        data (pandas.DataFrame): DataFrame containing cultivar information.
        X (numpy.ndarray): Matrix of NIR spectra, where each row is a spectrum.
        wavelengths (numpy.ndarray): Array of corresponding wavelengths for the spectra.
        color_mapping (dict): Dictionary mapping cultivar names to colours.
        axi (matplotlib.axes.Axes): Matplotlib Axes object for plotting.
        legend (bool, optional): Whether to include a legend. Defaults to False.

    Returns:
        None (plots directly to the provided Axes object).
    """
    # Initialise empty lists to store legend handles and labels:
    handles, labels = [], []
    for cultivator in color_mapping.keys():
        # Filter spectra belonging to the current cultivar
        spectra_cult = X[data["Cultivator"] == cultivator]
        color = color_mapping[cultivator]
        for spectrum in spectra_cult:
            # Plot each spectrum with the assigned color
            (linei,) = axi.plot(wavelengths, spectrum, color=color)
        # Append the line handle and cultivar label for the legend
        handles.append(linei)
        labels.append(cultivator)
    if legend:
        # Add a legend to the plot if 'legend' is True
        legend = axi.legend(handles, labels, title="Cultivator:")

In [None]:
def run_pca_plot_scores(data, X, color_mapping, axi):
    """
    Performs Principal Component Analysis (PCA) on the input spectra and plots the scores (colour-coded).

    Args:
        data (pandas.DataFrame): DataFrame containing cultivar information.
        X (numpy.ndarray): Matrix of NIR spectra, where each row is a spectrum.
        color_mapping (dict): Dictionary mapping cultivar names to colours.
        axi (matplotlib.axes.Axes): Matplotlib Axes object for plotting.

    Returns:
        None (plots directly to the provided Axes object).
    """
    pca = PCA(n_components=2)  # Initialize PCA with 2 components
    scores = pca.fit_transform(X)  # Perform PCA and get the scores
    sns.scatterplot(
        data=data,
        x=scores[:, 0],
        y=scores[:, 1],
        hue="Cultivator",
        palette=color_mapping,
        legend=False,
        ax=axi,
    )
    # Calculate explained variance ratios
    perc = pca.explained_variance_ratio_ * 100
    # Set axis labels with explained variance percentages
    axi.set_xlabel(f"Scores PC1 ({perc[0]:.2f}%)")
    axi.set_ylabel(f"Scores PC2 ({perc[1]:.2f}%)")

In [None]:
figure1, axes1 = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4))

plot_spectra(
    data,
    all_spectra,
    wavelengths,
    color_mapping,
    axes1[0],
    legend=True,
)
run_pca_plot_scores(data, all_spectra, color_mapping, axes1[1])

axes1[0].set_xlabel("Wavelength (nm)")
axes1[0].set_ylabel("Absorbance")
axes1[0].set_title("Original spectra", loc="left")
axes1[1].set_title("PCA, Original spectra", loc="left")
sns.despine(fig=figure1)

#### Your answer to question 7.2(a): Is there a clear cultivar separation in the scores plot?
*Double click here*

### 7.2(b)

**Task: Observe the impact of normalisation on the spectra and PCA results. In the PCA scores plot, are there any noticeable groupings that suggest cultivar separation?**

**Hint:**
1. Apply one of the provided normalisations to scale the spectra, for instance
```python
normed = normalise_spectra(all_spectra)
```
2. Plot the normalised spectra and the corresponding PCA results side-by-side. For instance,
```python
figure2, axes2 = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4))
plot_spectra(data, normed, wavelengths, color_mapping, axes2[0], legend=True)
run_pca_plot_scores(data, normed, color_mapping, axes2[1])
```

In [None]:
from sklearn.preprocessing import Normalizer


def normalise_spectra(spectra):
    """Normalise the given spectra to the range [-1, 1]."""
    s_min = spectra.min(axis=1, keepdims=True)
    s_max = spectra.max(axis=1, keepdims=True)
    return 2 * (spectra - s_min) / (s_max - s_min) - 1


def vector_norm(spectra):
    """Norm each row to a length of 1."""
    scaler = Normalizer(norm="l2")
    return scaler.fit_transform(spectra)


def snv(spectra):
    """Normalise by standardising each row: (x - mean) / std, per row."""
    return (spectra - np.mean(spectra, axis=1, keepdims=True)) / np.std(
        spectra, axis=1, keepdims=True
    )


def max_peak_normalisation(spectra):
    """Normalise spectra so that the max for each spectrum is at 1."""
    return spectra / np.max(spectra, axis=1, keepdims=True)

In [None]:
# Your code here

#### Your answer to question 7.2(b): Is there a clear cultivar separation in the scores plot?
*Double click here*

### 7.2(c)

**Task: Observe the impact of multiplicative scatter correction (MSC) on the spectra and PCA results. In the PCA scores plot, are there any noticeable groupings that suggest cultivar separation?**

**Hint:**
1. Apply the provided MSC function to correct the spectra, for instance,
```python
corrected = multiplicative_scatter_correction(all_spectra)
```
2. Plot the corrected spectra and the corresponding PCA results side-by-side. For instance,
```python
figure3, axes3 = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4))
plot_spectra(data, corrected, wavelengths, color_mapping, axes3[0], legend=True)
run_pca_plot_scores(data, corrected, color_mapping, axes3[1])
```

In [None]:
def multiplicative_scatter_correction(spectra):
    """
    Applies Multiplicative Scatter Correction (MSC) to the input spectra.

    MSC is a preprocessing technique used to reduce the effects of scatter in spectral data.
    It corrects for variations in path length and particle size, which can affect the
    baseline and slope of the spectra.

    Args:
        spectra (numpy.ndarray): Matrix of spectra, where each row is a spectrum.

    Returns:
        numpy.ndarray: MSC-corrected spectra matrix.
    """

    mean = np.mean(spectra, axis=0)  # Calculate the mean spectrum
    msc_spectra = np.zeros_like(
        spectra
    )  # Initialise an array to store MSC-corrected spectra
    for i, spectrum in enumerate(spectra):
        # Fit a linear regression model to each spectrum against the mean spectrum
        param = np.polyfit(mean, spectrum, 1)
        # Apply the MSC correction: (spectrum - intercept) / slope
        msc_spectra[i] = (spectrum - param[1]) / (param[0])
    return msc_spectra

In [None]:
# Your code here

#### Your answer to question 7.2(c): Is there a clear cultivar separation in the scores plot?
*Double click here*

### 7.2(d)

**Task: Investigate the impact of applying a second derivative transformation on the spectra and PCA results. In the PCA scores plot, are there any noticeable groupings that suggest cultivar separation?**

**Hint:**
1. Use the provided code to calculate the second derivative of the original spectra, for instance,

```python
dspectra = derivative(wavelengths, all_spectra, deriv=2)
```
2. Plot the resulting second derivative spectra and the corresponding PCA results side-by-side. For instance,
```python
figure4, axes4 = plt.subplots(constrained_layout=True, ncols=2, figsize=(8, 4))
plot_spectra(data, dspectra, wavelengths, color_mapping, axes4[0], legend=True)
run_pca_plot_scores(data, dspectra, color_mapping, axes4[1])
```

**Note:** The derivative is computed using the [Savitzky-Golay filter](https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter). This method smooths the data by fitting a polynomial to a moving window of points and then calculating the derivative of that fitted polynomial. The method, as implemented here, assumes evenly spaced data points. It may produce inaccurate results if your wavelengths are unevenly spaced. In such cases, alternative methods like B-spline derivatives or other interpolation-based approaches might be more suitable.

In [None]:
from scipy.signal import savgol_filter


def derivative(wavelengths, spectra, window_length=21, polyorder=3, deriv=2):
    """
    Calculates the derivative of the input spectra using the Savitzky-Golay filter.

    This function applies the Savitzky-Golay filter to smooth and differentiate the
    input spectra. The filter is used to reduce noise and enhance spectral features.

    Args:
        wavelengths (numpy.ndarray): Array of wavelengths corresponding to the spectra.
        spectra (numpy.ndarray): Matrix of spectra, where each row is a spectrum.
        window_length (int): The length of the filter window (must be odd).
        polyorder (int): The order of the polynomial used to fit the samples.
        deriv (int, optional): The order of the derivative to compute. Defaults to 2 (second derivative).

    Returns:
        numpy.ndarray: Matrix of derivative spectra.
    """
    # Apply Savitzky-Golay filter to calculate the derivative:
    delta_w = wavelengths[1] - wavelengths[0]
    dspectra = savgol_filter(
        spectra,
        window_length,
        polyorder,
        deriv=deriv,
        delta=delta_w,  # Wavelength spacing
        mode="nearest",  # Extrapolation mode at the edges
        axis=1,  # Process each row
    )
    return dspectra

In [None]:
# Your code here

#### Your answer to question 7.2(d): Is there a clear cultivar separation in the scores plot?
*Double click here*

### 7.2(e)

**Task: Explain how the Savitzky-Golay filter uses polynomial fitting to smooth data and compute derivatives.**

**Hint:** See page 149 in our textbook.

#### Your answer to question 7.2(e): Your explanation for Savitzky-Golay filtering?

*Double click here*

### 7.2(f)

**Task: The figure below displays the results of the preprocessing steps from exercise [7.2(a)](#7.2(a)) to [7.2(d)](#7.2(d)). Based on these results, which preprocessing method appears most promising for building a classifier?**

![Preprocessing NIR results](results7.2.png)

#### Your answer to question 7.2(f): Which preprocessing step appears most promising?
*Double click here*

### 7.2(g)

**Task: Assuming that the largest variation in the raw data is due to scattering effects, we could assume that PCA should pick up on this in the first (and perhaps second) principal component. If you go back to [7.2(a)](#7.2(a)), will the scores plot look more promising if you use principal components 2 and 3?**

In [None]:
# Your code here

#### Your answer to question 7.2(g): Will using other principal components improve the separation in the scores plot?
*Double click here*