# Data Loading, Clustering, and Machine Learning Predictions

## Class Summary

In this class, we will explore a publicly available dataset and apply various techniques to analyze and model the data. The session will be structured as follows:

1. **Data Loading and Inspection**: We will begin by loading the dataset and performing an initial inspection to understand its structure, features, and any potential issues that may need to be addressed before analysis.
  
2. **Clustering**: We will then dive into clustering techniques, focusing on how to effectively perform clustering on selected data features. This step is instrumental for data characterization and analysis, as it allows us to group similar data together and identify patterns in the dataset.

3. **Machine Learning Models**: Finally, we will apply various Machine Learning (ML) models to the data, focusing on both regression and classification tasks. We will experiment with different algorithms and evaluate their performance.

Throughout the class, we will highlight best practices for data analysis, clustering, and predictive modeling, with a focus on how to handle data effectively for such applications.

## Preliminaries

### Import Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### PM100 Dataset

PM100 is a dataset available at https://zenodo.org/records/10127767. The dataset contains the data of more than **230K job executed on Marconi100 supercomputer**. A job is a **user-submitted request** to execute a specific computational task on an HPC system. These jobs are managed by a scheduler, which allocates the necessary resources—such as CPU cores, memory, and time—based on availability and predefined policies. This ensures efficient and fair utilization of the HPC infrastructure.

Each entry contains information job executions (i.e., a computational task executed on a supercomputer), such as **power consumption, duration, user**, etc. ​

In [None]:
# Download dataset
!wget https://zenodo.org/records/10127767/files/job_table.parquet?download=1 -O job_table.parquet

## Understanding Parquet Files and Loading Them with Pandas

### What is a Parquet File?

A **Parquet file** is a columnar storage file format optimized for efficiency in big data processing. It is designed to handle large datasets efficiently and is commonly used in data analytics and machine learning workflows.

#### Key Features of Parquet:
- **Columnar Storage:** Data is stored column-wise rather than row-wise, improving compression and query performance.
- **Efficient Compression:** Parquet uses efficient encoding and compression techniques (e.g., Snappy, Gzip, Brotli) to reduce storage size.
- **Schema Evolution:** It supports schema evolution, allowing for flexible data modifications.
- **Optimized for Analytics:** Queries that select specific columns can run faster because only the required columns are read from disk.

### How to Load a Parquet File with Pandas

Pandas provides built-in support for reading and writing Parquet files through the `pyarrow` or `fastparquet` libraries.

#### 1. Installing Dependencies
To work with Parquet files in Pandas, install one of the required backends:

```
pip install pandas pyarrow  # or pip install pandas fastparquet
```

#### 2. Pandas API

```
import pandas as pd
df = pd.read_parquet(PATH_TO_PARQUET_FILE)
```

In [None]:
# Load Dataset in a DataFrame format
df = pd.read_parquet('job_table.parquet')
df

In [None]:
# Display info on the data columns
df.info()

In [None]:
# Generate statistics on the data columns
df.describe()

## Data Pre-processing

Here, we will perform some data pre-processing on the original data, which is intrumental for the following steps.

### Job Power Consumption

Each job entry contains information on the power consumption of the job execution. The feature is a time-serie of the power consumption, sampled every 20 seconds.

In [None]:
df["node_power_consumption"]

We have to check that all the elements are not empty

In [None]:
# Sanity check
df_empty = df[df["node_power_consumption"].apply(lambda pc: len(pc) == 0)]
print(df_empty)
if len(df_empty):
  df = df[df["node_power_consumption"].apply(lambda pc: len(pc) != 0)]

#### Visualization

In [None]:
# Plot 5 random jobs' power consumption
df_plot = df.sample(n = 5)
df_plot["x"] = df_plot["node_power_consumption"].apply(lambda x: list(range(len(x))))
for i in range(len(df_plot)):
    plt.plot(df_plot["x"].iloc[i], df_plot["node_power_consumption"].iloc[i], label = f"Job {i}")
plt.legend()
plt.xlabel("Steps")
plt.ylabel("Power Consumption")
plt.title("Power Consumption of 5 Random Jobs")
plt.show()

#### Normalization

For our purposes we want to inspect only the average power consumption per job

In [None]:
# Creation of an additional feature
df["average_power_consumption"] = df["node_power_consumption"].apply(lambda x: np.mean(x))
df["average_power_consumption"]

In [None]:
sns.histplot(data = df["average_power_consumption"].values, bins = 50, kde = False)
plt.title("Average Power Consumption Distribution")
plt.ylabel("# of Jobs")
plt.xlabel("Average Power Consumption")
# plt.yscale("log")
plt.show()

Since the values span a **very large range,** we can **normalize the average power consumption** to bring the data within a manageable scale. Additionally, since jobs run on a varying number of nodes, we can better characterize each job by its **performance on a single node**. Therefore, we **normalize the power consumption based on the number of nodes allocated to each job**, providing a more consistent and comparable measure of energy usage.


In [None]:
# Normalization
df["norm_average_power_consumption"] = df["average_power_consumption"] / df["num_nodes_alloc"]
df["norm_average_power_consumption"]

In [None]:
sns.histplot(data = df["norm_average_power_consumption"].values, bins = 50, kde = False)
plt.title("Average Power Consumption (Per Node) Distribution")
plt.ylabel("# of Jobs")
plt.xlabel("Average Power Consumption Per Node")
plt.yscale("log")
plt.show()

## Data Clustering

### Job Characterization

In order to perform analysis and characterize the jobs based on their per-node power consumption, we can divide them into categories through clustering techniques. For our experiments we will be using the **K-Means** and the **DBSCAN**

#### DBSCAN

**DBSCAN** is a popular **density-based** clustering algorithm used in machine learning. Unlike algorithms like **K-Means**, DBSCAN does not require the number of clusters to be specified in advance. Instead, it forms clusters based on the density of points in a region, marking regions of low density as **outliers** or **noise**.

### Key Concepts:
- **Core Points**: Points that have a sufficient number of neighboring points within a given radius (`eps`).
- **Border Points**: Points that are within the `eps` radius of a core point but do not have enough neighbors themselves to be considered core points.
- **Noise Points**: Points that do not belong to any cluster because they do not meet the criteria of being close enough to core points.

DBSCAN is effective for finding clusters of arbitrary shapes, making it a versatile choice for various datasets.

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
dbscan = DBSCAN()
df["cluster_db"] = dbscan.fit_predict(df["norm_average_power_consumption"].values.reshape(-1, 1))

In [None]:
sns.histplot(data = df, x = "norm_average_power_consumption", bins = 50, kde = False, hue = "cluster_db")
plt.title("Average Power Consumption (Per Node) Distribution")
plt.ylabel("# of Jobs")
plt.xlabel("Average Power Consumption Per Node")
plt.yscale("log")
plt.show()

In [None]:
sns.histplot(data = df, x = "cluster_db")
plt.xlabel("Cluster")
plt.ylabel("# of Jobs")
plt.title("DBSCAN Clustering")
plt.yscale("log")
plt.show()

#### K-Means

### K-Means Clustering

**K-Means** is one of the most widely used **partitioning-based** clustering algorithms. It aims to divide a dataset into a predefined number of clusters (K) by assigning each data point to the nearest cluster center (centroid). The algorithm iteratively refines the centroids to minimize the **within-cluster sum of squared distances** (inertia).

### Key Concepts:
- **Centroids**: The center points of each cluster, computed as the mean of the data points assigned to that cluster.
- **Clusters**: Groups of data points that are closer to the same centroid than to any other centroid.
- **Iterations**: The algorithm repeatedly updates the centroids by moving them to the average position of the points in the cluster until convergence (no significant changes in centroid positions).

K-Means is effective when clusters are roughly spherical and evenly sized, but it requires the number of clusters (K) to be specified beforehand, which can be a limitation.

### Steps:
1. Initialize K centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids based on the mean of the points in each cluster.
4. Repeat the assignment and centroid update steps until the centroids no longer change significantly.


##### Elbow Method

In order to find the optimal number of clusters (i.e., $k$) for the K-Means, we can rely on the Elbow Method. Such method analyzes the inertia (sum of squared distances from points to cluster centers) to determine which is the $k$ after which the inertia starts decreasing at a slower rate.

In [None]:
from sklearn.cluster import KMeans
!pip3 install kneed
from kneed import KneeLocator

In [None]:
def elbow_method(data, clustering_method, k_min, k_max):
  # Store the sum of squared distances (inertia)
  inertia = []

  k_values = list(range(k_min, k_max + 1))

  # Compute K-Means clustering for each value of K
  for k in k_values:
    kmeans = clustering_method(n_clusters=k, random_state=42)
    kmeans.fit(data)
    inertia.append(kmeans.inertia_)  # Store inertia (sum of squared distances)

  kneedle = KneeLocator(k_values, inertia, curve="convex", direction="decreasing")

  # Plot the Elbow curve
  plt.figure(figsize=(8, 5))
  plt.plot(k_values, inertia, marker="o", linestyle="--")

  # Draw a vertical line at optimal k
  optimal_k = kneedle.knee
  plt.axvline(x=optimal_k, color='red', linestyle='--', label="Optimal k")

  plt.legend()
  plt.xlabel("Number of Clusters (K)")
  plt.ylabel("Inertia (Sum of Squared Distances)")
  plt.title("Elbow Method for Optimal K")
  plt.xticks(k_values)

  # Show the plot
  plt.show()
  # Return optimal k
  return optimal_k

In [None]:
opt_k = elbow_method(data = df["norm_average_power_consumption"].values.reshape(-1, 1), clustering_method = KMeans, k_min = 1, k_max = 10)

In [None]:
# Generate the clusters
kmeans = KMeans(n_clusters=opt_k, random_state=42)
df["cluster"] = kmeans.fit_predict(df["norm_average_power_consumption"].values.reshape(-1, 1))

Let's now see how the data are split by the K-Means.

In [None]:
sns.histplot(data = df, x = "norm_average_power_consumption", bins = 50, kde = False, hue = "cluster")
plt.title("Average Power Consumption (Per Node) Distribution")
plt.ylabel("# of Jobs")
plt.xlabel("Average Power Consumption Per Node")
plt.yscale("log")
plt.show()

In [None]:
df["cluster"] = df.cluster.astype(str)
sns.histplot(data = df, x = "cluster")
plt.xlabel("Cluster")
plt.ylabel("# of Jobs")
plt.title("K-Means Clustering")
plt.show()

#### Considerations

With **DBSCAN** we obtain too many labels, resulting in not having clear and meaningful labels. Conversely, with **K-Means**, the resulting labels make more sense. Indeed, the cluster labels effectively categorize the jobs into three groups: "low," "medium," and "high" power-consuming jobs, providing a clearer and more intuitive division of the data.


## ML Predictive Modelling

In this section we will see how to use several **ML models** to perform prediction on the data.
We will showcase how to perform **regression and classification tasks**. For the regression, we will predict the power consumption values in **"norm_average_power_consumption"**; while for the classification we will predict the **"cluster"** label created through the **K-Means**.

### Data Split and Preparation for Modelling

In order to perform prediction tasks, we need to prepare the data for the models. This includes defining the target values, the input features and splitting the data into training and test set.

#### Target Values

In [None]:
y_regression = df["norm_average_power_consumption"].values
y_classification = df["cluster"].values

#### Input Features

### Selecting Input Features for a Tabular Prediction Task

Selecting the right input features is crucial for building an effective machine learning model for a tabular prediction task. The goal is to choose features that are most informative and relevant to the prediction target, while minimizing noise and redundancy.

### Steps for Feature Selection:

1. **Domain Knowledge**:
   - Leverage knowledge on the domain to select features that are likely to be important for the target variable.

2. **Correlation Analysis**:
   - Use correlation matrices to identify and remove highly correlated features. Features that are highly correlated with each other can introduce multicollinearity, leading to less stable models.

3. **Feature Importance**:
   - Use models like **Random Forest** or **Gradient Boosting** to calculate feature importance scores. These models can help identify which features contribute the most to the prediction.

4. **Statistical Methods**:
   - Perform statistical tests (e.g., chi-squared test for categorical features, ANOVA for continuous features) to assess the relationship between each feature and the target variable.

5. **Dimensionality Reduction**:
   - Techniques like **Principal Component Analysis (PCA)** or **t-SNE** can help reduce the feature space, especially when dealing with high-dimensional datasets, while retaining the most important variance.

6. **Feature Engineering**:
   - Consider creating new features from existing ones. For example, combining categorical features or extracting new insights from continuous variables can improve model performance.

7. **Remove Redundant Features**:
   - Drop irrelevant or redundant features that do not add predictive value to the model or may increase overfitting.

### Considerations:
- Always keep track of how changes to feature selection impact model performance through cross-validation or separate validation datasets.
- Be mindful of the risk of **overfitting** when using too many features, especially when dealing with limited data.



In [None]:
df.info()

In [None]:
# Calculate the correlation matrix
eligible_features = ["group_id", "job_state", "num_nodes_alloc", "num_cores_alloc", "num_tasks", "partition", "qos", "run_time", "shared", "threads_per_core", "time_limit", "num_gpus_alloc", "mem_alloc", "user_id"]

for feat in eligible_features:
  df[feat] = df[feat].astype("category").cat.codes

targets = ["cluster", "norm_average_power_consumption"]

corr_matrix = df[eligible_features + targets].corr()

# Display the correlation matrix
print(corr_matrix[targets])


In [None]:
# Plot the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix[targets], annot=True, cmap='coolwarm', fmt='.2f', cbar=True)
plt.title("Correlation Heatmap")
plt.show()

In [None]:
# We pick the features which are more correlated to the targets
feature_set = ["job_state", "num_cores_alloc", "partition", "qos", "run_time", "shared", "time_limit"]

#### Data Normalization

Data normalization is a crucial preprocessing step in machine learning, especially when using algorithms that are sensitive to the scale of the input features, such as **K-Nearest Neighbors**, **Support Vector Machines**, and **Neural Networks**. Normalization transforms the features of the dataset into a standard scale, ensuring that no feature dominates over others due to differences in magnitude or units. Common methods include **Min-Max Scaling**, which scales the data to a range between 0 and 1, and **Standardization**, which rescales the data to have a mean of 0 and a standard deviation of 1. Properly normalized data can significantly improve the performance and convergence of machine learning models, helping them to make more accurate predictions.

For this example, we will be using the **Min-Max** scailing.




In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
scaler = MinMaxScaler()

# Create the vector of the input features
x = scaler.fit_transform(df[feature_set].values)

In [None]:
x

#### Data Splitting

We split the data into **training and test set**, with a **70/30** proportion.

In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.3

# Subsampling if the computation is too heavy on the resources
subsampling_ratio = 1 # set to 1 if not needed
subsampling_elem = int(len(df) * subsampling_ratio)

# Split the data into training and testing sets (70% training, 30% testing)
x_train_reg, x_test_reg, y_train_reg, y_test_reg = train_test_split(x[:subsampling_elem], y_regression[:subsampling_elem], test_size=test_size, random_state=42)
x_train_clas, x_test_clas, y_train_clas, y_test_clas = train_test_split(x[:subsampling_elem], y_classification[:subsampling_elem], test_size=test_size, random_state=42)

### Modeling

We demonstrate how to perform regression and classification tasks with different ML models, such as the:

- **Linear Regression (LR)**: Linear regression is a simple statistical model that establishes a relationship between input features and a continuous target variable. It assumes that the relationship between the features and the target is linear, and it aims to minimize the sum of squared errors between the predicted and actual values.
  
- **Decision Tree (DT)**: A decision tree is a hierarchical model that splits the dataset into subsets based on feature values, forming a tree-like structure. Each internal node represents a feature, and each leaf node represents a predicted output. It is easy to interpret but can suffer from overfitting.

- **Random Forest (RF)**: A random forest is an ensemble method that creates multiple decision trees by bootstrapping data samples and using random subsets of features for each tree. The final prediction is made by averaging the outputs of all trees, reducing overfitting and improving accuracy compared to a single decision tree.

- **Support Vector Machine (SVM)**: SVM is a supervised learning algorithm used for classification and regression tasks. It works by finding the hyperplane that best separates data points of different classes. For regression, it tries to fit the best possible line while maximizing the margin between the line and the data points.

- **K-Nearest Neighbors (KNN)**: KNN is a simple, non-parametric algorithm that classifies data points based on the majority label of their nearest neighbors. It computes the distance between data points and assigns a label to a point based on the labels of its K closest neighbors, making it highly intuitive but computationally expensive for large datasets.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, classification_report, f1_score, accuracy_score, mean_absolute_percentage_error, ConfusionMatrixDisplay, confusion_matrix

#### Regression Task

In [None]:
# Initialize the models
models_reg = {
    'Linear Regression': LinearRegression(n_jobs = -1),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor(n_jobs = -1),
    # 'Support Vector Regressor': SVR(),
    'K-Neighbors Regressor': KNeighborsRegressor(n_jobs = -1)
}

# Train, predict, and evaluate each model
results = []

for name, model in models_reg.items():
  # Fit the model
  model.fit(x_train_reg, y_train_reg)

  # Make predictions
  y_pred_reg = model.predict(x_test_reg)

  # Calculate evaluation metrics
  mae = mean_absolute_error(y_test_reg, y_pred_reg)
  mse = mean_squared_error(y_test_reg, y_pred_reg)
  mape = mean_absolute_percentage_error(y_test_reg, y_pred_reg)
  r2 = r2_score(y_test_reg, y_pred_reg)

  # Store the results
  results.append({
      'Model': name,
      'MAE': mae,
      'MAPE': mape,
      'MSE': mse,
      'R2': r2
  })

# Create a DataFrame to display the results
results_df_reg = pd.DataFrame.from_records(results)
print(results_df_reg)

#### Visualization of the results

We plot the results to help evaluation

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(40, 8))
axes = axes.flatten()
metrics = ["MAE", "MAPE", "MSE", "R2"]

for i in range(len(metrics)):
  ax = axes[i]
  metric = metrics[i]
  sns.barplot(data = results_df_reg, x = "Model", y = metric, ax = ax)
  ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

plt.show()

#### Classification Task

In [None]:
# Initialize the models
models_clas = {
    'Decision Tree Regressor': DecisionTreeClassifier(),
    'Random Forest Regressor': RandomForestClassifier(n_jobs = -1),
    # 'Support Vector Regressor': SVC(),
    'K-Neighbors Regressor': KNeighborsClassifier(n_jobs = -1)
}

# Train, predict, and evaluate each model
results = []

for name, model in models_clas.items():
  # Fit the model
  model.fit(x_train_clas, y_train_clas)

  # Make predictions
  y_pred_clas = model.predict(x_test_clas)

  # Calculate evaluation metrics
  f1 = f1_score(y_test_clas, y_pred_clas, average = "macro")
  accuracy = accuracy_score(y_test_clas, y_pred_clas)
  cr = classification_report(y_test_clas, y_pred_clas)

  print(f"Classification Report of {name}:\n{cr}")

  # Store the results
  results.append({
      'Model': name,
      'F1': f1,
      'Accuracy': accuracy
  })

# Create a DataFrame to display the results
results_df_clas = pd.DataFrame.from_records(results)
print(results_df_clas)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
axes = axes.flatten()
metrics = ["F1", "Accuracy"]

for i in range(len(metrics)):
  ax = axes[i]
  metric = metrics[i]
  sns.barplot(data = results_df_clas, x = "Model", y = metric, ax = ax)
  ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

plt.show()

#### Error Analysis for Regression and Classification Tasks

When evaluating the performance of the models, it is also important to understand how the metrics are reflected in the actual prediction error. 
For the regression task, we will investigate the residual analysis, error distribution, and performance metrics.
While for the classification, we will take a look at the confusion matrices, class-wise metrics, and error inspection.

##### Residual plot

In [None]:
# Absolute residual
fig, axes = plt.subplots(1, len(models_reg), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_reg.items():
  # Make predictions
  y_pred_model = model.predict(x_test_reg)
  # Calculate residuals
  residuals = y_test_reg - y_pred_model
  ax = axes[m_idx]
  sns.scatterplot(x=y_pred_model, y=residuals, ax = ax)
  ax.axhline(0, color='red', linestyle='--')
  ax.set_title(f"Residual Plot for {name}")
  ax.set_xlabel("Predicted values")
  ax.set_ylabel("Residuals")
  m_idx+=1

plt.show()
plt.clf()

When displaying absolute errors with regression, it is not always easy to the reader to evaluate such error. A way to ease the understanding of the performance is to put all the errors in percentual values.

In [None]:
# Percentage residual
fig, axes = plt.subplots(1, len(models_reg), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_reg.items():
  # Make predictions
  y_pred_model = model.predict(x_test_reg)
  # Calculate residuals
  residuals = (np.abs(y_test_reg - y_pred_model)/y_test_reg) * 100
  ax = axes[m_idx]
  sns.scatterplot(x=y_pred_model, y=residuals, ax = ax)
  ax.set_title(f"Residual Plot for {name}")
  ax.set_xlabel("Predicted values")
  ax.set_ylabel("Residuals (in %)")
  ax.set_ylim(0, 100)
  m_idx+=1

plt.show()
plt.clf()

##### Error distribution

The scatterplot is a good way of visualizing the error, however it is non-trivial to understand its distribution. Hence, a good way to evaluate this is to plot an histogram of the residuals.

In [None]:

fig, axes = plt.subplots(1, len(models_reg), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_reg.items():
  # Make predictions
  y_pred_model = model.predict(x_test_reg)
  # Calculate residuals
  residuals = y_test_reg - y_pred_model
  ax = axes[m_idx]
  sns.histplot(residuals, bins=30, ax = ax)
  ax.set_title("Distribution of Residuals")
  ax.set_xlabel("Residual")
  ax.set_ylabel("Frequency")
  m_idx+=1

plt.show()
plt.clf()

Same as before, we also visualize the error in %

In [None]:
fig, axes = plt.subplots(1, len(models_reg), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_reg.items():
  # Make predictions
  y_pred_model = model.predict(x_test_reg)
  # Calculate residuals
  residuals = (np.abs(y_test_reg - y_pred_model)/y_test_reg) * 100
  ax = axes[m_idx]
  sns.histplot(residuals, bins=50, ax = ax)
  ax.set_title("Distribution of Residuals")
  ax.set_xlabel("Residual")
  ax.set_ylabel("Frequency")
  ax.set_yscale("log")
  m_idx+=1

plt.show()
plt.clf()

For the classification task, we can visualize the confusion matrix and the error per class.

In [None]:
fig, axes = plt.subplots(1, len(models_clas), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_clas.items():
  ax = axes[m_idx]
  # Make predictions
  y_pred_model = model.predict(x_test_clas)
  # Calculate confusion matrix
  cm = confusion_matrix(y_test_clas, y_pred_model)
  disp = ConfusionMatrixDisplay(confusion_matrix=cm)
  disp.plot(cmap='Blues', ax = ax)
  ax.set_title(f"Confusion Matrix {name}")
  m_idx+=1

plt.show()
plt.clf()

We show another way of analysing performance per class

In [None]:
# --- Per-Class Error Analysis ---
fig, axes = plt.subplots(1, len(models_clas), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_clas.items():
  ax = axes[m_idx]
  # Make predictions
  y_pred_model = model.predict(x_test_clas)
  errors = pd.DataFrame({
    "True Label": y_test_clas,
    "Predicted Label": y_pred_model
  })
  errors["Correct"] = errors["True Label"] == errors["Predicted Label"]
  # Calculate confusion matrix
  sns.countplot(data=errors, x="True Label", hue="Correct", ax = ax)
  ax.set_title(f"Correct vs Incorrect Predictions per Class, with {name}")
  m_idx+=1

plt.show()
plt.clf()

if setting the stats parameters of the barplot to "percent", we can report the same plot with percentual values.

In [None]:
fig, axes = plt.subplots(1, len(models_clas), figsize=(40, 8))
axes = axes.flatten()
m_idx = 0
for name, model in models_clas.items():
  ax = axes[m_idx]
  # Make predictions
  y_pred_model = model.predict(x_test_clas)
  errors = pd.DataFrame({
    "True Label": y_test_clas,
    "Predicted Label": y_pred_model
  })
  errors["Correct"] = errors["True Label"] == errors["Predicted Label"]
  # Calculate confusion matrix
  sns.countplot(data=errors, x="True Label", hue="Correct", ax = ax, stat="percent")
  ax.set_title(f"Correct vs Incorrect Predictions per Class, with {name}")
  m_idx+=1

plt.show()
plt.clf()