# **Lab 3.1: Machine Learning (Classification)**

<hr>

## **1. Introduction**
Once we have analyzed, cleaned, and visualized our dataset, we can move on to the learning phase. Before we begin, it is necessary to identify the type of problem we are facing in order to select the most suitable methods or models.

The following diagram illustrates the most common types of problems in the field of machine learning:

<center><img src="ML_Diagram.png" alt="diagram" width="1000"/></center>

As you already know, problems that require **machine learning** to be solved are those where we do not know the *formula* that allows us to transform the input into the output. These problems are mainly divided into two types: **supervised** and **unsupervised**.

In this course, we will focus on supervised learning problems, where we aim to predict either one or more classes (**classification**) or one or more numerical values (**regression**).

Remember that in order to solve supervised learning problems, we always need **labeled data**, meaning data where we already know the expected output or correct label for a given input. These labeled examples will be used by the model to try to learn that *unknown formula* during training.

We will begin by exploring **classification problems**, their characteristics, evaluation metrics, techniques, and how they are used to make predictions.

### **Objective**
In this practice, you will learn how to solve classification problems using different models and how to evaluate their performance.

<hr>

## **2. Problem Definition**

To begin, we will attempt to create a model capable of solving a **binary classification** problem.

In this case, we need to **create a model that, given the time (in seconds) of the 3 sectors of an *Aston Martin* driver, predicts whether the time was set by *Alonso* or not (*Stroll*)**.

We will reload our data and generate the necessary dataset to solve the problem.

In [None]:
import pandas as pd
data = pd.read_pickle('https://raw.githubusercontent.com/AIC-Uniovi/Sistemas-Inteligentes/refs/heads/main/datasets/f1_23_monaco.pkl')

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Create the variable <code>data_aston</code> with a DataFrame that contains only the data of that team and the columns "Sector1Time", "Sector2Time", "Sector3Time", and "Driver". Transform the sector columns from timedelta to seconds using <code>.dt.total_seconds()</code>.
</div>

In [None]:
# Your code here

We will modify the dataset to transform it into a binary classification problem. As you may recall, in this type of problem, the model predicts **a single value** that indicates the probability between zero ($0\%$) and one ($100\%$) that the given input belongs to the **positive class**.

In our case, the **positive class** will be *"Alonso"*. Therefore, if the model predicts a $1$, it indicates that the given sector times belong to Alonso with $100\%$ probability.

If the model's prediction is $0$ or less than $0.5$, it indicates that the lap was not set by Alonso and therefore belongs to Stroll.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Create the column <code>Class</code> within the DataFrame <code>data_aston</code> so that it equals zero whenever the driver is not Alonso and 1 otherwise.
</div>

In [None]:
# Your code here

<hr>

## **3. Baseline**

Once we have a labeled dataset, where each entry (Sector1Time, Sector2Time, and Sector3Time) is associated with an expected output (Class), we can proceed to create a model to solve the binary classification problem.

As explained in theory class, there are simple models that can provide very good performance without the need for more complex solutions. These models are called *baselines* and serve as a reference or lower bound. If a baseline performs better than a much more complex model, something is going wrong.

In classification problems, there are three main ones:

* **Random:** Predicts a 0 or 1 randomly without considering the input variables.
* **Zero-R:** Predicts the majority class, that is, it analyzes the "Class" column of the dataset and, if the most frequent class is 1, it always predicts 1 for any input. As you can see, it doesn't use the input variables at all.
* **One-R:** This model selects **one** input variable, the one that offers the best classification possible by itself. It is designed for categorical input variables.

<div class="alert alert-block alert-warning">
    <strong>NOTE:</strong> We will use <i>Random</i> and <i>Zero-R</i> as baselines for our problem. <i>One-R</i> is not possible, as our input variables are numeric.
</div>

### **Random Baseline**
We will create a model that generates random predictions (0 or 1) regardless of the input variables.

The implementation is very simple: we just need to generate a random list of zeros and ones. Then, we will count how many times the model has made the correct prediction.

In [None]:
import random

# Set a seed so that the same random values are generated every time
seed = 2533
random.seed(seed)

# Create a list with as many values (0 or 1) as rows in our dataset
random_values = [random.choice([0, 1]) for _ in range(len(data_aston))]

<div class="alert alert-block alert-info">
    <b>Exercise:</b> How many times has the model been correct (in percentage)? To calculate this, you need to count how many times the random model predicted a 1 and it was actually a 1, and how many times it predicted a 0 and it was actually a 0.
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-success">
    The value you just obtained is one of the most popular <strong>classification</strong> <strong>metrics</strong>, called <strong>accuracy</strong>. It simply counts the number of correct predictions made by the model over the maximum possible.
</div>

### **Zero-R**
Next, we will obtain the result for the *Zero-R* baseline.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> What is the majority class of the dataset? Considering this, what will be the <strong>accuracy</strong> of <i>Zero-R</i>?
</div>

In [None]:
# Your code here

## **4. Model Evaluation**

At this point, we **already have two models** capable of solving our binary classification problem: the **Random** model and the **Zero-R** model. Additionally, we know that these models have an accuracy on **resubstitution** of approximately $35\%$ and $60\%$, respectively.

<div class="alert alert-block alert-warning">
    <strong>Resubstitution:</strong> The process by which a model is <u>trained and evaluated using the same dataset</u>. This procedure can provide an initial estimate of model performance, though <strong>it does not reflect its generalization ability</strong>.
</div>

It is important to remember that when we create a machine learning model, we aim for it to not only perform well with known data but **to be able to make predictions on new, unseen data**.

With this in mind, we can see that the evaluation we just conducted on our models is not entirely appropriate. What we've measured is how well they predict examples they already know, but not their ability to make predictions on future data, which is what really matters.

### **Strategies**
To address this issue, rather than evaluating the model on the same data it learned from, we will evaluate it on a part of the dataset that we will have previously separated (test set).

This subset is usually created by randomly selecting a percentage (typically $20\%$) of examples from the original dataset and will be used solely to evaluate the model's performance, not for training. By doing this, we *"simulate"* future unseen cases, and if the model has good accuracy on this set, we will know that it is genuinely good.

This strategy is called **Simple Validation**, but there are many other strategies:

- **Cross-validation**: Involves splitting the data into several subsets (*folds*), training the model on some of them, and evaluating it on the remaining ones, repeating the process several times. This helps provide a more stable estimate of model performance.
- **Validation with validation set**: A third subset, distinct from the test set, is used to adjust the model's hyperparameters and prevent overfitting. This is very typical when using neural networks.

To get a more realistic evaluation of our models' performance, we will apply **simple validation** and recheck their accuracy, but this time on the test set.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Split the <code>data_aston</code> dataset into training and test sets (80% and 20%) and store the subsets in <code>data_aston_train</code> and <code>data_aston_test</code>. Use the <code><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">train_test_split()</a></code> function from the <i>scikit-learn</i> library, which you need to install in your conda environment first. 
    <hr>
    <strong>Set the function's seed so that it always performs the same split (random_state=2533)</strong>
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-info">
    <b>Exercise:</b> What is the accuracy of the Random model now, over the test set? Remember to set the seed again.
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-info">
    <b>Exercise:</b> What is the accuracy of the Zero-R model now, over the test set? 
    <hr>
    Remember that you need to recalculate the majority class using the new training set (this is the only data your model should know) and use it to evaluate the model on the test set.
</div>

In [None]:
# Your code here

As you have seen, the models have obtained values very close to $50\%$ accuracy on the test set, which is far from ideal.

<div class="alert alert-block alert-warning">
    <strong>Note:</strong> The worst possible binary classification model will have an accuracy of 50%, not 0%. A model with 0% accuracy would be predicting everything correctly, but its predictions would need to be inverted.
</div>

<hr>

## **5. Scikit-Learn**

The `Scikit-learn` library we used earlier contains a wide range of models and tools that will be useful for solving machine learning problems.

For example, the baselines we just implemented are already incorporated in the `DummyClassifier()` class.

This class has the `strategy` parameter, which allows us to select the desired baseline and can take, among others, the following values:
- **uniform**: Equivalent to our *Random* baseline.
- **most_frequent**: Equivalent to *Zero-R* baseline.

In the following code, we can see how to use this class:

In [None]:
from sklearn.dummy import DummyClassifier

# Create the models
baseline_random = DummyClassifier(strategy = 'uniform', random_state = seed)
baseline_zeror = DummyClassifier(strategy = 'most_frequent')

Once the models are created, the next step is to train them, and for that, we need to provide the training data using the `fit()` method, specifically the $X$ and $Y$.

<div class="alert alert-block alert-warning">
    <strong>Note:</strong> In supervised learning, we refer to X as the independent variables or <i>inputs</i> and Y as the dependent variables or <i>outputs</i>. 
</div>

As you know, the goal of the model is to *learn* the relationship between $X$ and $Y$ in order to predict $Y$ from new, unknown $X$ values.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Create the variables X and Y to train the models. Remember that we have split the dataset into two parts.
</div>

In [None]:
# Your code here

# Train the models
baseline_random.fit(X, Y)
baseline_zeror.fit(X, Y)

Once the model is trained, we can make predictions to evaluate its performance with data that was not seen during training. In `scikit-learn`, this is done using the `predict()` function.

<div class="alert alert-block alert-warning">
    <strong>Note:</strong> This function only takes X (not Y) and returns the predicted Y values.
</div>

In [None]:
X_test = data_aston_test[['Sector1Time', 'Sector2Time', 'Sector3Time']]
Y_test = data_aston_test[['Class']]

# Make predictions
pred_random = baseline_random.predict(X_test)
pred_zeror = baseline_zeror.predict(X_test)

# Print results
print(pred_random)
print(pred_zeror)

Another advantage of this library is that it includes many metrics within the `metrics` package, so we can easily obtain the accuracy.

In [None]:
from sklearn import metrics

print(metrics.accuracy_score(Y_test, pred_random))
print(metrics.accuracy_score(Y_test, pred_zeror))

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Obtain the confusion matrices for the models using <a href="https://scikit-learn.org/stable/api/sklearn.metrics.html">the method implemented in the library</a>. Also, obtain the <i>Precision, Recall, and F1</i> scores using the methods or formulas.
</div>

<div style="width:800px;background:white;padding:10px">
    <img src="https://i.imgur.com/7WwY9bZ.jpeg" style="margin-bottom:10px"> </img>
</div>

In [None]:
# Your code here

<hr>

## **6. Other Models**

As you have seen in the theory class, in addition to the baselines we have used, there are many other different models to solve classification problems. Some of them can be:

* **Logistic Regression:** A "linear" model that uses a logistic function (sigmoid) to predict the probability of belonging to a class.
* **K-Nearest Neighbors:** A classifier that assigns an instance to the most frequent class among its k nearest neighbors.
* **Decision Trees:** A classification model that creates a tree allowing decisions based on the data features (inputs).
* **SVM:** A classification algorithm that seeks the optimal hyperplane to maximize the margin between classes.
* **Neural Networks:** We will explore them in detail in the upcoming topics.

Just like with the baselines, `scikit-learn` provides us with most of these models already implemented.

<div class="alert alert-block alert-warning">
    <strong>Note:</strong> Before using other models, it is necessary to carry out a phase of <u>data preprocessing</u> that we had pending: <strong>normalization or standardization</strong> of the data.
</div>

As you may remember, this phase ensures that all the <u>inputs</u> to our model (sector times in our case) are in the same range (normalization) or have the same mean and standard deviation (standardization) to facilitate learning.

We haven't done this so far because the previous baselines did not use any input information; one predicted randomly, and the other predicted the most frequent class.

In this case, we will perform standardization (normalization would also be valid). Let's first check the mean and standard deviation of the inputs.

In [None]:
# Mean and standard deviation of the inputs in our model
X.describe().loc[["mean", "std"]]

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"><code>StandardScaler()</code></a> from <i>sklearn</i> to standardize the training and test X data. Store the new data in <code>X_std</code> and <code>X_test_std</code>. Remember that the test data is unknown to us, so it should not influence the calculation of the mean and standard deviation.
</div>

In [None]:
from sklearn.preprocessing import StandardScaler
standardizer = StandardScaler()

# Your code here

# Verify with: 
print(X_std.mean(axis = 0))
print(X_std.std(axis = 0))

### **6.1. Logistic Regression**

We are now ready to train and evaluate new models <u>aiming to improve the baselines' metrics</u>. We will start with **Logistic Regression**.

To use it, we simply need to create an object of the `LogisticRegression()` class, train it with `fit()`, and evaluate it with `predict()`.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create, train, and evaluate the model
model_log = LogisticRegression()
model_log.fit(X_std, Y.squeeze())  # The squeeze removes unnecessary dimensions
pred_log = model_log.predict(X_test_std)
print(metrics.accuracy_score(Y_test, pred_log))

As you can see, this model achieves better results in terms of accuracy, but not much better.

### **6.2. K-Nearest Neighbors**

The next model we can try is the `KNeighborsClassifier()`, which is used exactly the same way as the others.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create, train, and evaluate the model
model_knn = KNeighborsClassifier(n_neighbors = 3)
model_knn.fit(X_std, Y.squeeze())
pred_knn = model_knn.predict(X_test_std)
print(metrics.accuracy_score(Y_test, pred_knn))

The results in terms of accuracy with this model are very close to 100%, making it the best among those evaluated.

### **6.3. Decision Trees**

Next, we will analyze the performance of a decision tree in solving this task. The class within the library is called `DecisionTreeClassifier()` and it has numerous hyperparameters, with one of the most notable being `max_depth`, which allows setting the maximum depth of the tree and thus controlling **overfitting**.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create, train, and evaluate the model
model_tree = DecisionTreeClassifier(random_state = seed, max_depth = 2)
model_tree.fit(X_std, Y.squeeze())
pred_tree = model_tree.predict(X_test_std)
print(metrics.accuracy_score(Y_test, pred_tree))

The performance is similar to the previous model, but this one has the advantage of allowing us to see its internal workings, i.e., the tree it has created. To do this, we will need to install a new library.

In [None]:
! pip install pydotplus

In [None]:
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus

# Export the resulting tree to DOT format
dot_data = export_graphviz(decision_tree = model_tree, feature_names = X.columns, class_names = ['Stroll', 'Alonso'], filled = True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

### **6.4. Support Vector Machines (SVM)**

The last model we are going to try is the SVM, implemented in the `SVC()` class. We will start with a **linear kernel**.

In [None]:
from sklearn.svm import SVC

# Create an SVM with a linear kernel, train, and evaluate
model_svm = SVC(kernel = 'linear')
model_svm.fit(X_std, Y.squeeze())
pred_svm = model_svm.predict(X_test_std)
print(metrics.accuracy_score(Y_test, pred_svm))

The accuracy is not very high, so it's possible that the classes are not **linearly separable**.

To improve performance, we will now try a **polynomial kernel** of degree 2.

In [None]:
# Create an SVM with a polynomial kernel of degree 2 and independent term 1. Train and evaluate
model_svm_p = SVC(kernel = 'poly', degree = 2, coef0 = 1)
model_svm_p.fit(X_std, Y.squeeze())
pred_svm_p = model_svm_p.predict(X_test_std)
print(metrics.accuracy_score(Y_test, pred_svm_p))

As you can see, the non-linearity of this model allows it to solve the problem with a precision identical to the best models.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Complete the function <code>train_and_eval()</code> for the remaining models, including both versions of the SVM. It uses the function <code>train_and_eval_model()</code> implemented below.
</div>

In [None]:
def train_and_eval_model(model_name, model, X_std, Y, X_test_std, Y_test):

    # Train the model
    model.fit(X_std, Y.squeeze())

    # Predictions
    Y_train_pred = model.predict(X_std)
    Y_test_pred = model.predict(X_test_std)

    # Calculate metrics for training data
    tr_accuracy = metrics.accuracy_score(Y, Y_train_pred)
    tr_precision = metrics.precision_score(Y, Y_train_pred, zero_division = 0)
    tr_recall = metrics.recall_score(Y, Y_train_pred)
    tr_f1 = metrics.f1_score(Y, Y_train_pred)
    
    # Calculate metrics for test data
    tst_accuracy = metrics.accuracy_score(Y_test, Y_test_pred)
    tst_precision = metrics.precision_score(Y_test, Y_test_pred, zero_division = 0)
    tst_recall = metrics.recall_score(Y_test, Y_test_pred)
    tst_f1 = metrics.f1_score(Y_test, Y_test_pred)
    
    return (model_name, tr_accuracy, tr_precision, tr_recall, tr_f1, tst_accuracy, tst_precision, tst_recall, tst_f1)

In [None]:
def train_and_eval(X_std, Y, X_test_std, Y_test):

    # Create a list to store the results of each model
    all_results = []
    
    # Random baseline
    baseline_aleatorio = DummyClassifier(strategy = 'uniform', random_state = seed)
    model_results = train_and_eval_model('Random', baseline_aleatorio, X_std, Y, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Your code here

    # Print the resulting dataframe
    multi_index = pd.MultiIndex.from_tuples([ ('Model', 'Name'), ('Train', 'Accuracy'), ('Train', 'Precision'), ('Train', 'Recall'), ('Train', 'F1'), ('Test', 'Accuracy'), ('Test', 'Precision'), ('Test', 'Recall'), ('Test', 'F1') ])    
    all_results = pd.DataFrame(all_results, columns = multi_index)
    display(all_results)

train_and_eval(X_std, Y, X_test_std, Y_test)

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Which model do you think is the best? Why?
</div>

In [None]:
# Your response here

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Create a model capable of predicting, based on the 3 sectors of a driver, whether the tire used for the lap is for wet conditions ("INTERMEDIATE" or "WET") or not. Perform all necessary preprocessing and try all the classification models we have seen so far using the previous function.
</div>

In [None]:
# Your code here