# Exercises on data clustering and local PCA

## Exercise Extra:

The goal of this exercise is to explore K-Means and VQPCA clustering of a combustion dataset.

***

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

%matplotlib inline

In this exercise, we use a combustion dataset which represents combustion of hydrogen in air.

Below, we load the dataset, $\mathbf{X}$, composed of 9 variables (columns):

$$
\begin{gather}
\mathbf{X} =
\begin{bmatrix}
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\
T & Y_{H} & Y_{H_2} & Y_{O} & Y_{OH} & Y_{H_2O} & Y_{O_2} & Y_{HO_2} & Y_{H_2O_2} \\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\
\end{bmatrix}
\end{gather}
$$

The first variable in the dataset is temperature, $T$, and the remaining variables are mass fractions, $Y$, of different chemical species.

The dataset has 13,650 observations (rows).

In [None]:
url_X = (r'https://raw.githubusercontent.com/burn-research/data-driven-engineering-course2025/main/Clustering/H2-air-X.csv')
url_X_names = (r'https://raw.githubusercontent.com/burn-research/data-driven-engineering-course2025/main/Clustering/H2-air-X-names.csv')
url_X_mf = (r'https://raw.githubusercontent.com/burn-research/data-driven-engineering-course2025/main/Clustering/H2-air-mixture-fraction.csv')
url_X_hrr = (r'https://raw.githubusercontent.com/burn-research/data-driven-engineering-course2025/main/Clustering/H2-air-heat-release-rate.csv')

X = pd.read_csv(url_X, sep = ',', header=None).to_numpy()
X = X[:,0:-3]

In [None]:
X.shape

We also load names for all of the variables in $\mathbf{X}$.

In [None]:
X_names = pd.read_csv(url_X_names, sep = ',', header=None).to_numpy().ravel()
X_names = X_names[0:-3]

In [None]:
X_names

We also load two additional quantities that will be helpful.

The first one is called *mixture fraction*, it represents the local stoichiometry of the flame at every observation in the dataset:

In [None]:
mixture_fraction = pd.read_csv(url_X_mf, sep = ',', header=None).to_numpy()

In [None]:
mixture_fraction.shape

The second one is called the *heat release rate*, it's a measurement of the amount of heat released in the combustion process corresponding to every observation in the dataset:

In [None]:
heat_release_rate = pd.read_csv(url_X_hrr, sep = ',', header=None).to_numpy()

In [None]:
heat_release_rate.shape

***

## Clustering the dataset with K-Means

We are going to find clusters in the dataset using K-Means clustering technique. The documentation of the K-Means algorithm implementation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

First, we need to preprocess the dataset. Center and scale the dataset $\mathbf{X}$ using Auto (standard) scaling in a similar way to what we've done in the previous exercise:

Perform clustering of the centered and scaled dataset $\mathbf{X}$ into 4 clusters:

Visualize the result of clustering in the mixture fraction and temperature space:

Next, we are going to create another clustering solution for comparison. This time, we will run K-Means clustering on the heat release rate variable only.

First, let's visualize how the heat release rate (HRR) variable looks like in the mixture fraction and temperature space.
- To do:
    - Create a scatter plot of mixture fraction versus temperature, and color it with heat release rate.
    - Use the colormap called `'inferno'` for a nicer visualization (in `plt.scatter` add `cmap='inferno'`.

From the plot above, you see that there is one localized region, where the heat release rate is the highest. Outside of that region, it is zero or close to zero, meaning that combustion is not occuring.

Now, perform K-Means clustering if the heat release rate into 4 clusters. Note, that since we now cluster based on a single variable (one vector), it doesn't matter if we scale the vector or not.

Visualize the result of clustering in the mixture fraction and temperature space:

What do you observe? Can you relate the clusters to different values of the heat release rate variable from the plot you generated earlier?

***

## Clustering the dataset with VQPCA

Below, we are going to also find clustering solution with VQPCA and compare it with the K-Means clustering obtained before.

We are going to use the VQPCA implementation from the [OpenMORe package](https://github.com/burn-research/OpenMORe).

We import the OpenMORe package and fill in the settings, where you can for instance set how the dataset should be centered and scaled, how many clusters, $k$, you want to create and how many eigenvectors (PCs), $q$, should be used in the cluster reconstruction at each iteration.

In [None]:
! git clone https://github.com/burn-research/OpenMORe.git

import sys

sys.path.insert(0,'/content/OpenMORe')
import OpenMORe.clustering as clustering

In [None]:
clustering_settings = {
    #centering and scaling options
    "center"                    : True,
    "centering_method"          : "mean",
    "scale"                     : True,
    "scaling_method"            : "auto",

    #set the initialization method (random, observations, kmeans, pkcia, uniform)
    "initialization_method"     : "uniform",

    #set the number of clusters and PCs in each cluster
    "number_of_clusters"        : 4,
    "number_of_eigenvectors"    : 2,

    #enable additional options:
    "correction_factor"         : "off",    # --> enable eventual corrective coefficients for the LPCA algorithm:
                                            #     'off', 'c_range', 'uncorrelation', 'local_variance', 'phc_multi', 'local_skewness' are available

    "classify"                  : False,    # --> call the method to classify a new matrix Y on the basis of the lpca clustering
    "write_on_txt"              : False,     # --> write the idx vector containing the label for each observation
    "evaluate_clustering"       : False,     # --> enable the calculation of indeces to evaluate the goodness of the clustering

    #improve the clustering solution via kNN
    "kNN_post"                  : False,     # activate the kNN algorithm once the convergence is achieved
    "neighbors_number"          : 2,       # set the number of neighbors that has to be taken into account
}

Perform VQPCA clustering of the centered and scaled dataset $\mathbf{X}$ into 4 clusters:

Visualize the result of clustering in the mixture fraction and temperature space:

Go back to the `clustering_settings` dictionary and play with changing the number of clusters and the number of eigenvectors.

Are the differences in clustering solution subtle or significant?