# Training school "Data and Models" INRAE 2024

## Exercise 2: Introduction to ML - Classification
We are now going to see a few short examples of classification, to give you a global idea of the tasks. Differently from regression tasks, where the objective is to predict a continuous value for the target feature, in classification the objective is to associate a sample to a class label. A class label is a discrete value with an associated meaning (e.g. Good/Medium/Bad, Healthy/Unhealthy, Cat/Dog/Horse).

In [None]:
# don't worry about this code here, totally normal stuff and absolutely not a horrible hack
# to make things work in Google colaboratory, just run it and don't think about it
!pip uninstall scikit-learn --yes
!pip uninstall imblearn --yes
!pip install scikit-learn==1.2.2
!pip install imblearn

In [None]:
# we need (again) to install the openml library to download the data set we are going to work on
!pip install openml

# load and download the dataset
import openml

dataset = openml.datasets.get_dataset(188)
print(dataset)

Let's now take a look at the data.

The dataset 188, "eucalyptus", describes the expected utility of a seed lot of eucalyptus, taking into account information such as location, altitude, initial height of the seedlings, etc. The class here is a set of discrete values, ranging from a terrible outcome (all seedlings died) to a great outcome, as evaluated by human experts.

We can even check how many samples are available for each class, something that will be useful for later.

In [None]:
# get the data into a dataframe object, which makes it easier to manipulate datasets
df, *_ = dataset.get_data()
#print(df)

# print the different values of the target variable: as you will see, there are
# just five values, qualitative assessment of the final quality of the eucalyptus
# 'none' means that all seedlings died
target_feature = dataset.default_target_attribute
other_features = [c for c in df.columns if c != target_feature]
class_labels = df[target_feature].sort_values(ascending=False).unique() # all this stuff is just to get them in order from best to worst
print("Unique values of the class:", class_labels) # print the unique values

# before starting with the learning process, we need to tackle the categorical variables
# and convert them to numerical values
categorical_columns = df[other_features].select_dtypes(include=['category', 'object', 'string'])
print("Categorical columns found:", categorical_columns.columns)
for c in categorical_columns :
  df[c].replace({category : index for index, category in enumerate(df[c].astype('category').cat.categories)}, inplace=True)

# also remove all rows with missing values
df.dropna(inplace=True)

# this block of code just prints how many samples are available for each class label;
# in general, we would like to have a balanced data set, with an equal number of
# samples for each class label
for class_label in class_labels :
  df_class_label = df[df[target_feature] == class_label]
  print("For class label \"%s\", we have %d samples (%.2f%% of the total)" %
        (class_label, df_class_label.shape[0], df_class_label.shape[0] * 100/df.shape[0]))

# get just the data without the column headers, as numerical matrices; for the target feature,
# we have to replace the strings ('best', 'good', ...) with integers (0, 1, 2, ...); finally our data is ready
X = df[other_features].values

import numpy as np
dictionary = {class_label : i for i, class_label in enumerate(class_labels)} # creates a dictionary to go from label name to an integer
y = np.array([dictionary[y_i] for y_i in df[target_feature].values])

It is interesting to notice that the data set is not perfectly balanced. One of the classes has over 30% of all samples, while the class labels that is least represented has around 15% of the samples.

Just as was the case for regression, we are going to evaluate the performance of our classification algorithm resorting to a cross-validation. This time, we are going to use different metrics, to compare the information that we are able to obtain from each.

Also, for classification problems we would like to have each of the _k_ splits in the _k-fold cross-validation_ to 'look like' the original data: we would not like to have a split that contains only samples from one class, for example. To attain this objective, we are going to use a cross-validation variant called _stratified cross-validation_, that will try to keep the proportion of samples of each class in each fold as close as possible to the original proportion in the whole dataset. So, for example, if the original dataset has 30% samples of class A, 40% samples of class B, and 30% samples of class C, the stratified cross-validation will attempt to keep the 30-40-30 proportion in each fold.

The stratified cross-validation is already implemented in the _cross_validate_ and _cross_val_predict_ functions we used in the previous exercise, and it is automatically used if the values of the target _y_ are class labels.

In [None]:
# create a classifier, we are going to pick Random Forest again, because it's
# fast to train and it is able to natively manage integer variables without further preprocessing
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

# since we are going to use several metrics, it's faster to just obtain the predictions
# of the test set for each fold, and then compute the metrics afterwards
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(classifier, X, y, cv=5)

# let's see if the final result was good, using different metrics
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef
print("Accuracy: %.4f" % accuracy_score(y, y_pred))
print("F1 score: %.4f" % f1_score(y, y_pred, average='weighted'))
print("Matthew's Correlation Coefficient: %.4f" % matthews_corrcoef(y, y_pred))

Another useful visualization for the results of classification problems is the _confusion matrix_, a plot showing the percentage of samples that were correctly and uncorrectly classified, and more specifically _where_ are the mistakes (for example: when the algorithm misclassifies a sample belonging to class A, does it have more the tendency to attribute it to class B or C?).

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# create a figure, and display the confusion matrix
fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred,
    display_labels=class_labels,
    ax=ax,
)

Now, the classification performance is not great. However, from the confusion matrix, we can easily observe that most of the mistakes happen between adjacent classes, especially among the best two ('best' and 'good') and worst two ('low' and 'none'). We can use the option 'normalize' to have a more precise idea of the ratio of samples from a class that end up with the correct label or another.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred,
    display_labels=class_labels,
    normalize='true',
    ax=ax,
)

ax.set_title("Confusion matrix for Random Forest")

Let's now check if we can do better! Random Forest is nice, but we will try a few state-of-the-art classifiers: "eXtreme Gradient Boosting", or XGBoost; and "Category Boosting", or CatBoost. The two are commonly considered to be the best by industry and practitioners.

In [None]:
!pip install catboost

In [None]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

classifier_xgboost = XGBClassifier(verbosity=0)
classifier_catboost = CatBoostClassifier(verbose=0)

print("Running the cross-validation (this might take some) time...")
y_pred_xgboost = cross_val_predict(classifier_xgboost, X, y, cv=5)
y_pred_catboost = cross_val_predict(classifier_catboost, X, y, cv=5)

print("F1 score for XGBoost: %.4f" % f1_score(y, y_pred_xgboost, average='weighted'))
print("F1 score for CatBoost: %.4f" % f1_score(y, y_pred_catboost, average='weighted'))

Let's take a look at the confusion matrix of the most performing of the two new algorithms.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred_xgboost,
    display_labels=class_labels,
    normalize='true',
    ax=ax,
)

ax.set_title("Confusion matrix for XGBoost")

The results are better, but not by a lot. This is a common trend, problems that are difficult for one ML algorithm are often difficult for most algorithms designed to deal with the same type of problems (classification, regression, ...). Let's now try to see what happens with a simpler classification problem.

Dataset 24, "mushroom", is a classification of mushrooms between poisonous and edible. As you will see, it's pretty balanced, with around 4,000 samples for each class.

In [None]:
# load another data set
dataset = openml.datasets.get_dataset(24)

# obtain the data set as a dataframe
df, *_ = dataset.get_data()

# preprocessing to obtain just matrices of values
target_feature = dataset.default_target_attribute
other_features = [c for c in df.columns if c != target_feature]

# convert categorical features to numerical values
categorical_columns = df[other_features].select_dtypes(include=['category', 'object', 'string'])
print("Categorical columns found:", categorical_columns.columns)
for c in categorical_columns :
  df[c].replace({category : index for index, category in enumerate(df[c].astype('category').cat.categories)}, inplace=True)

class_labels = df[target_feature].unique()
dictionary = {class_label : i for i, class_label in enumerate(class_labels)} # creates a dictionary to go from label name to an integer
for class_label in class_labels :
  print("For class label \"%s\", there are %d samples" % (class_label, df[df[target_feature] == class_label].shape[0]))

# finally, obtain matrices of numerical values
X = df[other_features].values
y = np.array([dictionary[y_i] for y_i in df[target_feature].values])

print("Training classifiers, this might take some time...")
classifier_rf = RandomForestClassifier()
y_pred_rf = cross_val_predict(classifier_rf, X, y, cv=5)

classifier_xgboost = XGBClassifier(verbosity=0)
y_pred_xgboost = cross_val_predict(classifier_xgboost, X, y, cv=5)

print("\nAccuracy for Random Forest: %.4f" % accuracy_score(y, y_pred_rf))
print("F1 score for Random Forest: %.4f" % f1_score(y, y_pred_rf))
print("\nAccuracy for XGBoost: %.4f" % accuracy_score(y, y_pred_xgboost))
print("F1 score for XGBoost: %.4f" % f1_score(y, y_pred_xgboost))

This time, there is little difference between the two algorithms, and the values of accuracy and F1 are extremely close, as the two class labels have almost equal numerosity. Let's take a look at the confusion matrix for Random Forest, this time there are only two classes.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred_rf,
    display_labels=class_labels,
    #normalize='true',
    ax=ax,
)

ax.set_title("Confusion matrix for Random Forest")

This problem was clearly simpler, and all algorithms performed well. Now, let's try with a more unbalanced dataset. We need a specific library that was designed to explore and deal with heavily imbalanced problems.

In [None]:
# import the imbalanced dataset
from imblearn.datasets import fetch_datasets
ecoli = fetch_datasets()['ecoli']

# how imbalanced is this, exactly? let's count
import numpy as np
class_labels, counts = np.unique(ecoli["target"], return_counts=True)
for i in range(0, class_labels.shape[0]) :
  print("For class label \"%d\", found %d samples" % (class_labels[i], counts[i]))

# preprocessing to get our nice matrices of numbers
dictionary = {-1 : 0, 1 : 1}

X = ecoli["data"]
y = np.array([dictionary[l] for l in ecoli["target"]])

# a first classification in cross-validation using Random Forest
classifier_rf = RandomForestClassifier(class_weight="balanced")
y_pred_rf = cross_val_predict(classifier_rf, X, y, cv=5)

print("Accuracy for Random Forest: %.4f" % accuracy_score(y, y_pred_rf))
print("F1 score for Random Forest: %.4f" % f1_score(y, y_pred_rf))

This time we can observe a huge difference between accuracy and F1! Before we take a look at the confusion matrix, try to guess: what do you expect to see here?

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred_rf,
    display_labels=class_labels,
    #normalize='true',
    ax=ax,
)

ax.set_title("Confusion matrix for Random Forest")

The algorithm has the tendency of putting most samples into class "-1", which is the more numerous (301 samples vs 35). Let's try to improve Random Forest's behavior, by using an option that takes into account the class imbalance during training, assigning a different weight to samples from each class. The weights are used to change the function that Random Forest internally optimizes, so that misclassifying a sample with a higher weight is considered much worse than misclassifying a sample with a lower weight. The weights are automatically computed to be inversely proportional to class frequencies, with less frequent classes given more importance (trying to compensate for the fact that they have less samples).

In the plot for the confusion matrix below, you can comment/uncomment the line
```
#normalize='true',
```
to display the values in the matrix as ratios rather than number of samples.

In [None]:
classifier_rf_balanced = RandomForestClassifier(class_weight="balanced")
y_pred_rf_balanced = cross_val_predict(classifier_rf_balanced, X, y, cv=5)

print("Accuracy for Random Forest: %.4f" % accuracy_score(y, y_pred_rf_balanced))
print("F1 score for Random Forest: %.4f" % f1_score(y, y_pred_rf_balanced))

fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred_rf_balanced,
    display_labels=class_labels,
    #normalize='true',
    ax=ax,
)

ax.set_title("Confusion matrix for Random Forest")

The results are only marginally better! What about XGBoost? In the case of XGBoost, the algorithm has a similar option to take into account class imbalance, called 'scale_pos_weight'. The details of how this is computed are not very important, it's a different way of giving more importance to samples from the less frequent class label.

In [None]:
classifier_xgboost_balanced = XGBClassifier(scale_pos_weight=counts[0]/counts[1], verbosity=0)
y_pred_xgboost_balanced = cross_val_predict(classifier_xgboost_balanced, X, y, cv=5)

print("Accuracy for XGBoost: %.4f" % accuracy_score(y, y_pred_xgboost_balanced))
print("F1 score for XGBoost: %.4f" % f1_score(y, y_pred_xgboost_balanced))

fig = plt.figure()
ax = fig.add_subplot(111)

cmd = ConfusionMatrixDisplay.from_predictions(
    y_true=y,
    y_pred=y_pred_xgboost_balanced,
    display_labels=class_labels,
    #normalize='true',
    ax=ax,
)

ax.set_title("Confusion matrix for XGBoost")

A bit better, but it's still not great: if we have a sample of class '1', we only have 60% probability that it will be correctly classified. This imbalanced data set is a difficult classification problem, in practical cases it would be best to try to have balanced class labels.