# Oil analysis using k-means clustering 
The essentials of k-means clustering are best illustrated visually for a fictitious oil analysis of only two measurements $x$ and $y$. Then, the scatter plot surveys several paired measurements ($x$,$y$) that are not entirely independent.

![image](figures/Oilanalysis_kmeans01.png).

Let's model this dependency by k=2 clusters (and not by a linear equation). As prior knowledge of the centroids of these two clusters is lacking, they are chosen randomly. 

![image](figures/Oilanalysis_kmeans02.png).

Now each of the measurements ($x$,$y$) is nearest to either of these randomly chosen centroids. 

![image](figures/Oilanalysis_kmeans03.png).

So, the measurements are being clustered in this way and the center of gravity of each cluster becomes the new centroid. 

![image](figures/Oilanalysis_kmeans04.png).

Again, each of the measurements ($x$,$y$) is nearest to either of these new centroids.

![image](figures/Oilanalysis_kmeans05.png).

Again, the center of gravity of these newly formed clusters becomes the new centroid. The clusters that have been formed will stabilise when repeating this procedure several times.

![image](figures/Oilanalysis_kmeans06.png).

### K-means clustering is sensitive to the choice of k
K-means clustering requires a prior choice of the number of clusters k. An intuition about the expected number of clusters may exist, i.e. the data is known to entail two grades of oil. But this intuition is often lacking. Then, one may just resort to some optimisation criterion that balances the distances within the clusters with the number of clusters. 

![image](figures/Oilanalysis_kmeans07.png).

An elbow plot visualises this optimisation criterion. As k=1, the picture trivially indicates the scatter in the measurements ($x$,$y$). As k equals the sample size, each measurement ($x$,$y$) trivially reduces to a cluster and the spread within a cluster becomes zero. So, the optimisation criterion should minimise the number of clusters while explaining most of the scatter in the measurements ($x$,$y$).

### K-means clustering is sensitive to scaling and transformations
An elbow plot is sensitive to scaling of the measurements. Let the scatter plot in the picture represent the *same* bivariate data ($x$,$y$). Just the scale of $x$ differs. However, in the picture on the righthand side, the distances between $x$ and the centroid predominate the distances between $y$ and the centroid. The influence of $y$ may become ignorable just by scaling (millimeters instead of meters).

![image](figures/Oilanalysis_kmeans10.png).

In the raw oil-analysis data, the range of the data differs much. A naive k-means clustering will ignore the data with a small range. The demo script therefore uses a standard scaler that transforms all measurements $x$ to the test statistic $z$ by:

$ z= {{(x-\bar{x})}\over s} $

Here, $\bar{x}$ is the mean and $s$ is the standard deviation of the measurements $x$. The result of the k-means clustering is sensitive to this arbitrary transformation of the measurements. 

### K-means clustering is sensitive to irrelevant data
Dependencies in the data should be expected to make k-means clustering useful, i.e. if it is known that all measurements are independent, a quest for clusters among these measurements would become quite trivial. So, k-means clustering should typically confirm expectations. For example, it may be expected that some measurement $x$ either takes a high or a low value due to some condition that is either "green" or "ocher". Now, a k-means clustering on this measurement $x$ perfectly identifies whether this condition was "green" or "ocher" as k=2:

![image](figures/Oilanalysis_kmeans08.png).

However, if the data set is extended to ($x$,$y$) where $y$ is independent of $x$, a k-means on the extended data set ($x$,$y$) no longer perfectly identifies whether this condition was "green" or "ocher" as k=2. The picture below just extends the data from the previous picture with fair coin flips that are either 0 or 1 and the k-means clustering no longer identifies the condition.

![image](figures/Oilanalysis_kmeans09.png).

This implies that a k-means clustering on all the measurements of the oil analysis (as is done in the demo script) is quite naive if specific expected dependencies are of interest. For example, it may be expected that:
- contamination by solid particles influences the ISO4406 and LNF measurements
- contamination by liquids influences the fuel, water and Na (antifreeze) measurements
- degradation of additives influences the TAN, TBN, P, Mn, Pb measurements 
- loss of tribological properties influences the viscosity and the flash point measurements

Then, these expected dependencies may become better identifiable by censoring the measurements that are known to be independent. 

### K-means clustering is sensitive to the initial choice of the centroids
The initial position of the centroids is usually randomly chosen, but this choice may affect the composition of the clusters. It is therefore wise to run the k-means clustering several times to investigate reproducibility. The clustering that explains most of the scatter is usually the preferred solution.

# [Click here to see the k_means script](https://nbviewer.jupyter.org/github/chrisrijsdijk/RAMS/blob/master/notebook/Oilanalysis_kmeans.ipynb?flush_cache=true)