# Exercise set 14

> As you near the end of TKJ4175, it's time to test your newly acquired skills! In this final exercise, you will analyze NMR spectra and identify unknown oils using the knowledge you have gained in this course.

The data file [Data/nmr_oil.csv](./Data/nmr_oil.csv) contains ¹H NMR spectra measured for 
six edible oils: sesame, olive, peanut, sunflower, canola, and corn. We have five spectra for each oil, and each spectrum is recorded at 1100 chemical shifts. We also have three spectra of unknown oils in the data file [Data/nmr_unknown_oil.csv](./Data/nmr_unknown_oil.csv). 

Here's the challenge: we have a limited amount of information on the unknown samples. They could be any of the six known oils we have measured, but the three unknown oils may be of the same kind, or they can all be different. Your task is to decipher their identities.

**Use your chemometrics skills and identify the three oils!**

## Plotting example spectra

To get you started, here are some code to plot example spectra:

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline
sns.set_context("notebook")

In [None]:
data = pd.read_csv("Data/nmr_oil.csv")
data_unknown = pd.read_csv("Data/nmr_unknown_oil.csv")
data.head()
# The column oil contains the oil type, and the other
# columns contain the intensity at the shift value given
# by the column name.

In [None]:
fig, axes = plt.subplots(
    constrained_layout=True, nrows=6, sharex=True, figsize=(9, 12)
)
# ppm values are:
ppms = np.array([float(i.split("ppm")[0]) for i in data.columns if "ppm" in i])
# Loop over oil types and plot one example of each:
for i, oil_type in enumerate(data["oil"].unique()):
    intensity = data[data["oil"] == oil_type].to_numpy()[0, 1:]
    # Note: The selection [0, 1:] above picks the first of
    # the five spectra for the selected oil type, and then
    # it skips the first (index 0 columns) since this is
    # the oil column.
    axes[i].plot(ppms, intensity)
    axes[i].set(ylabel="Intensity")
    axes[i].set_title(f"Oil: {oil_type}", loc="left")
axes[-1].invert_xaxis()
axes[-1].set_xlabel("ppm")
sns.despine(fig=fig)

In [None]:
fig, axes = plt.subplots(constrained_layout=True, figsize=(9, 3))
# ppm values are:
ppms = np.array(
    [float(i.split("ppm")[0]) for i in data_unknown.columns if "ppm" in i]
)
# Show all the unknowns
spectra = data_unknown.to_numpy()[:, 1:]
for i, intensity in enumerate(spectra):
    axes.plot(ppms, intensity, label=f"Unknown oil {i+1}")
axes.set(ylabel="Intensity")
axes.invert_xaxis()
axes.set_xlabel("ppm")
axes.legend(loc="upper left")
sns.despine(fig=fig)

## 0. Preprocessing

We first take care of the different intensities, but normalizing all spectra to one:

In [None]:
variables = [i for i in data.columns if "ppm" in i]
X_known = data[variables].to_numpy()
X_unknown = data_unknown[variables].to_numpy()

In [None]:
# We scale the spectra so that their intensity "sums" to one:
for i, row in enumerate(X_known):
    X_known[i, :] /= np.sqrt(np.dot(row, row))

for i, row in enumerate(X_unknown):
    X_unknown[i, :] /= np.sqrt(np.dot(row, row))

## 1. PCA

Let us use PCA for finding out what the unknowns are:

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
scores = pca.fit_transform(X_known)
scores_unknown = pca.transform(X_unknown)

In [None]:
# Create some colors for plotting:
oils = data["oil"].unique()
colors = sns.color_palette("hls", len(oils))
color_map = {key: colors[i] for i, key in enumerate(oils)}

In [None]:
def plot_scores(component1=0, component2=1):
    fig, ax = plt.subplots(constrained_layout=True)
    for oil in oils:
        ax.scatter(
            scores[data["oil"] == oil, component1],
            scores[data["oil"] == oil, component2],
            color=color_map[oil],
            label=f"{oil}",
        )

    for i, unknown in enumerate(scores_unknown):
        ax.scatter(
            unknown[component1],
            unknown[component2],
            marker="X",
            label=f"Unknown {i+1}",
            edgecolor="k",
        )
    ax.legend()
    var1 = pca.explained_variance_ratio_[component1]
    var2 = pca.explained_variance_ratio_[component2]
    ax.set(
        xlabel=f"Scores, component {component1+1} ({var1*100:.2f}%)",
        ylabel=f"Scores, component {component2+1} ({var2*100:.2f}%)",
    )
    sns.despine(fig=fig)


plot_scores(component1=0, component2=1)
plot_scores(component1=0, component2=2)

Based on the plots above, it looks like the first two samples are corn, while the last
one is peanut.

Let us investigate the loadings to see if we can figure out some shifts that differentiate
between the oils:

In [None]:
# Plot loadings:
load1 = pca.components_[0, :]
load2 = pca.components_[1, :]

fig, ax = plt.subplots()
ax.axhline(y=0, ls=":", color="black", lw=1)
ax.axvline(x=0, ls=":", color="black", lw=1)
ax.set_xlim(-0.4, 0.4)
ax.set_ylim(-0.4, 0.4)
ax.set_aspect("equal")
ax.scatter(load1, load2)

# Show text for the "biggest" loadings:
distance = load1 * load1 + load2 * load2
idx = np.argsort(distance)[-10:]
for i in idx:
    ax.text(load1[i], load2[i], variables[i], fontsize="small")

It seems that shifts in the range 1.2 to 1.4 are important for the distinction. Let us plot the spectra in this region:

In [None]:
fig, axes = plt.subplots(
    constrained_layout=True, ncols=3, figsize=(12, 3), sharex=True
)

# We scale the spectra so that their intensity "sums" to one:
lines, labels = [], []
for i, row in enumerate(X_known):
    oil_name = data["oil"][i]
    (linei,) = axes[0].plot(ppms, row, color=color_map[oil_name])
    (linei,) = axes[1].plot(ppms, row, color=color_map[oil_name])
    (linei,) = axes[2].plot(ppms, row, color=color_map[oil_name])
    if oil_name not in labels:
        labels.append(oil_name)
        lines.append(linei)


for i, unknown in enumerate(X_unknown):
    # axes[i].plot(ppms, unknown, color="k", alpha=0.5, lw=3)
    scat = axes[i].scatter(ppms, unknown, color="k", alpha=0.5, zorder=3)
    axes[i].set_title(f"Unknown {i+1}", loc="left")
    if i == 0:
        labels.append("unknown")
        lines.append(scat)

axes[0].legend(lines, labels, fontsize="small")
axes[0].set_xlim(1.2, 1.4)
axes[0].set(xlabel="ppm", ylabel="Intensity")
axes[1].set(xlabel="ppm")
axes[2].set(xlabel="ppm")
sns.despine(fig=fig)

The plot above agrees with the scores: unknown 1 and 2 look at lot like corn oil, while unknown 3 is similar to peanut.

## 2. Classification with Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
# Since this will be classification, we need to create an "y" value:
encoder = LabelEncoder().fit(data["oil"])
y = encoder.transform(data["oil"])

In [None]:
# And let us do training, although we have few samples (and too many variables to be honest...)
X_train, X_test, y_train, y_test = train_test_split(
    X_known, y, test_size=0.2, stratify=y
)

In [None]:
bayes = GaussianNB()
bayes.fit(X_train, y_train)
y_pred = bayes.predict(X_test)

In [None]:
# Use Bayes to get probabilities for each class
y_unknown = bayes.predict_proba(X_unknown)
y_unknown

Here, the probabilities are zero for all classes except one of then.
So the classifier is pretty sure here. Let us output the name of the classes
with the larges probabilities:

In [None]:
y_unknown_class = y_unknown.argmax(axis=1)
unknown_class = encoder.inverse_transform(y_unknown_class)
print("The unknown oils are:")
unknown_class

The conclusion here is the same as for the PCA: two are from corn and the last one is from peanut. Here, we should consider to downsample and use the scores from PCA (the n'th first scores) to simplify the model.