# Oil analysis using k-means clustering 
The essentials of k-means clustering are best illustrated visually for a fictitious oil analysis of only two measurements x and y. Then, the scatter plot surveys several paired measurements (x,y) that are not entirely independent.

![image](figures/Oilanalysis_kmeans01.png).

Let's model this dependency by k=2 clusters (and not by a linear equation). As prior knowledge of the centroids of these two clusters is lacking, they are chosen randomly. 

![image](figures/Oilanalysis_kmeans02.png).

Now each of the measurements (x,y) is nearest to either of these randomly chosen centroids. 

![image](figures/Oilanalysis_kmeans03.png).

So, the measurements are being clustered in this way and the center of gravity of each cluster becomes the new centroid. 

![image](figures/Oilanalysis_kmeans04.png).

Again, each of the measurements (x,y) is nearest to either of these new centroids.

![image](figures/Oilanalysis_kmeans05.png).

Again, the center of gravity of these newly formed clusters becomes the new centroid. The clusters that have been formed will stabilise when repeating this procedure several times.

![image](figures/Oilanalysis_kmeans06.png).

K-means clustering just assigns a clusterlabel while leaving the choice of the number of clusters to the analyst. Also the explanation for these clusters (wear, different grades,...) is left to the analyst. In the absence of knowledge of the data, the analyst may just resort to some optimisation criterion that balances the distances within the clusters with the number of clusters. An elbow plot visualises this optimisation criterion. As k=1, the picture trivially indicates the scatter in the measurements (x,y). As k equals the sample size, each measurement (x,y) trivially reduces to a cluster and the spread within a cluster becomes zero. So, the optimisation criterion should minimise the number of clusters while explaining most of the scatter in the measurements (x,y).

![image](figures/Oilanalysis_kmeans07.png).

However, knowledge of the dependencies in the data is often available. For example, let it be known that some measurement (x) either take a high or a low value due to some condition that is either "green" or "ocher". Now, a k-means clustering on this measurement (x) perfectly identifies whether this condition was "green" or "ocher" as k=2:

![image](figures/Oilanalysis_kmeans08.png).

However, if the data set is extended to (x,y) where (y) is independent of (x), a k-means on the extended data set (x,y) no longer perfectly identifies whether this condition was "green" or "ocher" as k=2. The picture below just extends the data from the previous picture with fair coin flips that are either 0 or 1 and the k-means clustering no longer identifies the condition.

![image](figures/Oilanalysis_kmeans09.png).

This implies that a k-means clustering on all the measurements of the oil analysis (as is done in the demo script) is quite naive if dependencies of these measurements are known. For example, it may be known that:
- contamination by solid particles is expected to affect the ISO4406 and LNF measurements
- contamination by liquids is expected to affect the fuel, water and Na (antifreeze) measurements
- degradation of additives is expected to affect the TAN, TBN, P, Mn, Pb measurements 
- loss of tribological properties is indicated by the Viscosity and the flash point measurements

Then, one of these causes of oil degradation may become better identifiable by k-means clustering by censoring the measurements that are known to be independent of that cause. 

In conclusion, if knowledge about the dependencies in a data set is lacking, an elbow plot may help to find clusters that are hopefully explainable afterwards. If knowledge about the dependencies in a data set is available, it is wise to use this knowledge to eliminate independencies.

# [Click here to see the k_means script](https://nbviewer.jupyter.org/github/chrisrijsdijk/RAMS/blob/master/notebook/Oilanalysis_kmeans.ipynb)