# Baseline Classifier

In this notebook, I will create two baseline classifiers. One random and one intelligent that will make predictions based off the peak of the sample in the ranges I found in 02_DataExploration.ipynb. 

The peaks for each analyte are at:
- Copper - [660, 720]
- Cadmium - [530, 580]
- Lead - [580, 620]
- Seawater - N/A (it's a flat baseline)

## Random Classifier

Let's see how the model will perform if it guesses the classes 0-3 at random.

In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import KFold
from functools import partial

from scripts.data_processing import get_class_label_to_int_mapping
from scripts.baseline import RandomClassifier, MaxPeakClassifier, \
                             print_scores, print_kfold_scores

CLASS_LABEL_TO_INT_MAPPING = get_class_label_to_int_mapping()

sns.set()

DATA_DIR = Path('data/')

# Read in dataframes - good to keep them separate to make plotting easier
cadmium = pd.read_csv(DATA_DIR / 'cadmium.csv', index_col=0)
copper = pd.read_csv(DATA_DIR / 'copper.csv', index_col=0)
lead = pd.read_csv(DATA_DIR / 'lead.csv', index_col=0)
seawater = pd.read_csv(DATA_DIR / 'seawater.csv', index_col=0)

analytes = pd.read_csv(DATA_DIR / 'all_data.csv', index_col=0)

X = analytes.drop('label', axis=1)
y = analytes.loc[:, 'label']

In [4]:
random_clf = RandomClassifier()
print('RANDOM CLASSIFIER SCORES')
print_scores(random_clf, X, y)
print()

max_peak_clf = MaxPeakClassifier()
print('MAX PEAK CLASSIFIER SCORES')
print_scores(max_peak_clf, X, y)

RANDOM CLASSIFIER SCORES
accuracy - 0.2571
f1_micro - 0.2571
precision_micro - 0.2571
recall_micro - 0.2571

MAX PEAK CLASSIFIER SCORES
accuracy - 0.8629
f1_micro - 0.8629
precision_micro - 0.8629
recall_micro - 0.8629


We see that the scores are the same for accuracy, f1, precision and recall which seems odd to me. Why is this the case? Am I doing something wrong? Defo want to check this with Waylon. 

Let's also run a KFold and see if it makes a difference. I know I have made these a bit unfair since I have used all the data to make my models... eek should I go back and change that?

In [8]:
print('RANDOM CLASSIFIER KFOLD SCORES')
print_kfold_scores(random_clf, X, y, 5)

RANDOM CLASSIFIER KFOLD SCORES
FOLD 1
accuracy - 0.1714
f1_micro - 0.1714
precision_micro - 0.1714
recall_micro - 0.1714

FOLD 2
accuracy - 0.3143
f1_micro - 0.3143
precision_micro - 0.3143
recall_micro - 0.3143

FOLD 3
accuracy - 0.1714
f1_micro - 0.1714
precision_micro - 0.1714
recall_micro - 0.1714

FOLD 4
accuracy - 0.3429
f1_micro - 0.3429
precision_micro - 0.3429
recall_micro - 0.3429

FOLD 5
accuracy - 0.3429
f1_micro - 0.3429
precision_micro - 0.3429
recall_micro - 0.3429



In [7]:
print('MAX PEAK CLASSIFIER KFOLD SCORES')
print_kfold_scores(max_peak_clf, X, y, 5)

MAX PEAK CLASSIFIER KFOLD SCORES
FOLD 1
accuracy - 0.6286
f1_micro - 0.6286
precision_micro - 0.6286
recall_micro - 0.6286

FOLD 2
accuracy - 0.8286
f1_micro - 0.8286
precision_micro - 0.8286
recall_micro - 0.8286

FOLD 3
accuracy - 0.9143
f1_micro - 0.9143
precision_micro - 0.9143
recall_micro - 0.9143

FOLD 4
accuracy - 0.9714
f1_micro - 0.9714
precision_micro - 0.9714
recall_micro - 0.9714

FOLD 5
accuracy - 0.9714
f1_micro - 0.9714
precision_micro - 0.9714
recall_micro - 0.9714



Random classifier ranges from 0.1429 - 0.2286
Max peak classifier ranges from 0.6286 - 0.9714

In [20]:
def calc_average_kfold_accuracy(model, n_splits=5):
    kfold = KFold(n_splits=5)
    fold = 1
    accuracies = []
    for _, test_idx in kfold.split(X):
        X_test, y_test = X.iloc[test_idx, :], y[test_idx]
        acc = accuracy_score(y_test, model.predict(X_test))
        accuracies.append(acc)
    return np.mean(accuracies)

In [21]:
calc_average_kfold_accuracy(random_clf)

0.26285714285714284

In [22]:
calc_average_kfold_accuracy(max_peak_clf)

0.8628571428571428

The average accuracies for each model are the same as if the model makes predictions on the whole dataset. This makes sense.

I did not split the dataset into train/val/test sets before creating these baseline models. For the `RandomClassifier` this is not an issue since the prediction is not based on the sample. However, for the `MaxPeakClassifier` it will have caused some data leakage - I used the whole population to create these rules and, if I just used a training set, I would not necessarily have come up with the same rules. However, I decided to do this for brevity and to get a better understanding of the dataset as a whole. So, the 86.29% accuracy result is probably an overexaggeration of the model's power.But this is a decent baseline from which to compare everything else to. 86% is very strong for a rule based approach and the best an ML model can hope to achieve is a 16.3% increase in performance. 

Yes there could be some data leakage in here but really only in the value for seawater we would choose and it's clear that the vast majority 90%+ of our seawater samples stay below 2.5. Moreover, the peak sizes were chosen to be wide enough to have room for all samples. So probably a bit of data leakage but for brevity we will work with it and can say that 86% is the minimum that our ML models need to achieve. 

Also the train and test sets should come from a similar population. There is no randomness in these models but we can see from KFold that different datasets give different results.