# Forensic discrimination of lipsticks

![Illustration of method](method.png)


The [raw data](https://doi.org/10.25917/5bee60501fdf0) contains spectroscopic measurements of red- and nude-shaded lipsticks. Here, we will make a model to identify the lipstick brand from infrared spectroscopy (ATR-FTIR). The original authors collecting the data have made similar models in their [publication](https://doi.org/10.1016/j.forsciint.2019.02.044). One motivation behind their work was to create a "non-destructive characterisation of lipstics for forensic purposes".

# 1. Inspecting the raw data

The raw data can be downloaded from the repository linked above with DOI: [10.25917/5bee60501fdf0](https://doi.org/10.25917/5bee60501fdf0). Assuming that the data has been downloaded, we will focus on the files with ATR-FTIR data. We will use:

* `ATR-FTIR - Calibration.xlsx` for creating the model(s).
* `ATR-FTIR - Validation.xlsx` for testing the model(s).

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib notebook

sns.set_context("notebook")

In [None]:
data1 = pd.read_excel("ATR-FTIR - Calibration.xlsx")

In [None]:
data1.head()

In [None]:
data1["Brand"].unique()

In [None]:
data1["Product"].unique()

For plotting, we first extract the wavenumber and then the intensities:

In [None]:
wavenumber = [
    i for i in data1.columns if i not in ("Sample", "Product", "Brand")
]
spectra = data1[wavenumber].to_numpy()

Let us check how many spectra we have and how many wavenumbers:

In [None]:
spectra.shape

And we plot each individual spectrum (here the color is just used to distinguish them):

In [None]:
colors = sns.color_palette("hls", len(spectra))
fig, ax = plt.subplots(constrained_layout=True)
for i, (row, colori) in enumerate(zip(spectra, colors)):
    ax.plot(wavenumber, row, color=colori, label=f"Sample {i}", lw=2)
ax.set(xlabel="Wavenumber / cm$^{-1}$", ylabel="Shifted intensity / a.u.")
sns.despine(ax=ax)

# 2. Simplification of the data by dimensionality reduction.

Before we do the analysis, we attempt to reduce the number of variables to see if we can learn "something" about our data. We first take care of the different intensities, by normalizing all spectra:

In [None]:
X = data1[wavenumber].to_numpy()
norms = np.linalg.norm(X, axis=1)

X_normed = X / norms[:, np.newaxis]

And then we use a method called PCA for reducing the number of variables:

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
scores = pca.fit_transform(X_normed)

In [None]:
target_key = "Brand"

targets = data1[target_key].unique()

colors = sns.color_palette("hls", len(targets))
color_map = {key: colors[i] for i, key in enumerate(targets)}

In [None]:
def plot_scores(component1=0, component2=1):
    fig, ax = plt.subplots(constrained_layout=True)
    for target in targets:
        xpos = scores[data1[target_key] == target, component1]
        ypos = scores[data1[target_key] == target, component2]
        ax.scatter(
            xpos,
            ypos,
            color=color_map[target],
            label=f"{target}",
            s=90,
        )
    ax.legend(fontsize="x-small", ncols=2)
    var1 = pca.explained_variance_ratio_[component1]
    var2 = pca.explained_variance_ratio_[component2]
    ax.set(
        xlabel=f"Scores, component {component1+1} ({var1*100:.2f}%)",
        ylabel=f"Scores, component {component2+1} ({var2*100:.2f}%)",
    )
    sns.despine(fig=fig)
    return fig, ax


fig, _ = plot_scores(component1=0, component2=1)

**Note**: We can interpret the two new axes above in terms of the original variables (the wavenumbers). This is a topic for later in the course.

Assuming that we do not know what brand the samples in `ATR-FTIR - Validation.xlsx` are, we can make some guesses by comparing where these samples fall in the reduced dimensionality space:

In [None]:
data2 = pd.read_excel("ATR-FTIR - Validation.xlsx")
data2.head()

In [None]:
X_val = data2[wavenumber].to_numpy()
norms = np.linalg.norm(X_val, axis=1)

X_normed_val = X_val / norms[:, np.newaxis]

In [None]:
colors = sns.color_palette("hls", len(spectra))
fig, ax = plt.subplots(constrained_layout=True)
for row, colori in zip(spectra, colors):
    ax.plot(wavenumber, row, color=colori)

ax.plot(wavenumber, X_val[0, :], color="k")
ax.set(xlabel="Wavenumber / cm$^{-1}$", ylabel="Intensity / a.u.")
sns.despine(ax=ax)

In [None]:
scores_val = pca.transform(X_normed_val)

In [None]:
fig, ax = plot_scores(component1=0, component2=1)

ax.scatter(
    scores_val[:, 0],
    scores_val[:, 1],
    marker="s",
    facecolor="none",
    edgecolor="k",
    label="Unknown",
)

ax.legend(fontsize="x-small", ncols=2)

# 3. Classification models

A classification model is a supervised model that predicts the class/type. We will create some classification models in this example.

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder().fit(data1[target_key])
y = encoder.transform(data1[target_key])
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_normed, y, test_size=0.2, stratify=y
)
print(X.shape)

In [None]:
reduce = PCA(n_components=10).fit(X_train)
X_train_pca = reduce.transform(X_train)
X_test_pca = reduce.transform(X_test)

In [None]:
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
)


def score_model(model, X, y_true, y_pred=None):
    if y_pred is None:
        y_pred = model.predict(X)

    score = {
        "accuracy": accuracy_score(y_true, y_pred),
        "f1": f1_score(y_true, y_pred, average="macro"),
        "recall": recall_score(y_true, y_pred, average="macro"),
        "precision": precision_score(y_true, y_pred, average="macro"),
    }
    return score


def add_scores(add_to, model, X, y_true):
    name = model.__class__.__name__

    scores = score_model(model, X, y_true)

    for key, val in scores.items():
        add_to[key].append(val)
    add_to["model"].append(name)


keep_scores_test = {
    "model": [],
    "accuracy": [],
    "f1": [],
    "recall": [],
    "precision": [],
}

all_models = []

## 3.1 Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

bayes = GaussianNB()

bayes.fit(X_train_pca, y_train)

all_models.append(bayes)

add_scores(keep_scores_test, bayes, X_test_pca, y_test)
pd.DataFrame(keep_scores_test)

## 3.2 RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier

parameters = {"n_estimators": [10, 50, 100], "max_depth": [None, 3, 6, 9, 100]}

grid = GridSearchCV(
    RandomForestClassifier(), parameters, cv=2, scoring="accuracy", refit=True
)

grid.fit(X_train_pca, y_train)

forest = grid.best_estimator_

all_models.append(forest)


add_scores(keep_scores_test, forest, X_test_pca, y_test)
pd.DataFrame(keep_scores_test)

## 3.3 Support Vector Machine

In [None]:
from sklearn.svm import SVC

parameters = {"C": [0.001, 0.1, 0.5, 1.0], "kernel": ["poly", "rbf"]}


grid = GridSearchCV(
    SVC(probability=True), parameters, cv=2, scoring="accuracy", refit=True
)

grid.fit(X_train_pca, y_train)

support = grid.best_estimator_
print(support)

all_models.append(support)

add_scores(keep_scores_test, support, X_test_pca, y_test)
pd.DataFrame(keep_scores_test)

## 3.4 Catboost

In [None]:
from catboost import CatBoostClassifier

cat = CatBoostClassifier(verbose=0)
cat.fit(X_train_pca, y_train)

all_models.append(cat)

add_scores(keep_scores_test, cat, X_test_pca, y_test)

In [None]:
pd.DataFrame(keep_scores_test)

## 3.5 KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

parameters = {
    "n_neighbors": range(1, 11),
}


grid = GridSearchCV(
    KNeighborsClassifier(), parameters, cv=2, scoring="accuracy", refit=True
)

grid.fit(X_train_pca, y_train)

knn = grid.best_estimator_
all_models.append(knn)

add_scores(keep_scores_test, knn, X_test_pca, y_test)
pd.DataFrame(keep_scores_test)

# 4. Try the models on validation data

We have not used the data in `ATR-FTIR - Validation.xlsx` to create the model. This means that we can use it to check how well the model is performing. So we will now check what brands our models are predicting for this data.

In [None]:
y_val = encoder.transform(data2["Brand"])

X_val_pca = reduce.transform(X_normed_val)

In [None]:
scores_validation = {
    "model": [],
    "accuracy": [],
    "f1": [],
    "recall": [],
    "precision": [],
}


for model in all_models:
    y_pred = model.predict(X_val_pca)
    add_scores(scores_validation, model, X_val_pca, y_val)

results = pd.DataFrame(scores_validation)
results.sort_values(by="accuracy")