 <img src="uva_seal.png"> 

## MLlib Clustering

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: September 29, 2024

---  


### SOURCES  
- Learning Spark, Chapter 11: Machine Learning with MLlib  

- https://spark.apache.org/docs/latest/ml-clustering.html

- [Cluster cohesion](https://towardsdatascience.com/explain-ml-in-a-simple-way-k-means-clustering-e925d019743b)

- [Silhouette score](https://en.wikipedia.org/wiki/Silhouette_(clustering))

- [Silhouette toy example](https://medium.com/@MrBam44/how-to-evaluate-the-performance-of-clustering-algorithms-3ba29cad8c03)

### OBJECTIVES
Introduction to some of the major clustering techniques in MLlib using the DataFrame API

### CONCEPTS

- Unsupervised learning
- K-means
- Silhouette Score
- Mixture of Gaussians

---

**Unsupervised Learning**  
In this task, labels are unknown and the analyst wishes to segment the observations into groups of high similarity, where similarity is defined in terms of the feature space.

Common use cases are:
- Data exploration to discover the properties of similar observations  
- Outlier detection; outliers will generally form their own group (e.g., singletons)  

**K-Means**  
This is the most popular clustering algorithm, with widespread use in industry. It is relatively simple, uses a single parameter, and converges on a solution (but possibly not the global maximum).

The following models are supported in `spark.mllib` with the DataFrame API:

- K-means
- Gaussian mixture
- Power iteration clustering (PIC)
- Latent Dirichlet allocation (LDA)
- Bisecting k-means

**<center>K-Means Specs</center>**

| Item   | Description |
| -------- | ----------- |
| Supervised/Unsupervised | Unsupervised |
| Initialization | Random Assignment |
| Assumptions | Euclidean Distance |
| Preprocessing | Scaling |
| Parameters | $K:$ number of clusters |
| Metrics | Inertia |
| Strengths | One parameter, relatively simple |
| Weaknesses | 1. May not find global optimum <br> 2. Can't handle non-quant data (e.g., categorical)<br> 3. Assumes spherical cluster shape|

**K-Means Sample 2D Visualization ($K=3$)**

<img src="k_means_before_after.png">

| K-Means Sample Workflow | 
| -------- | 
| 1. feature selection | 
| 2. feature standardization | 
| 3. run algo for sequence of $K$ |  
| 4. examine results and remediate outliers <br> <span style="color:red">loop on 3-4 as needed</span>| 
| 5. select $K^*$, extract cluster assignments | 
| 6. enrich with domain knowledge | 

**K-Means Metric**

To measure the quality of clustering, we use *within cluster sum of squares (WSS)*.  

For each cluster, we compute sum of squared distance between each point and the centroid.  
Then we compute these sums across all clusters. It is a measure of internal cohesion of clusters.  

$$ \texttt{Within Sum of Squares} \,\, (WSS) = 1 - \frac{\texttt{Between Sum of Squares}}{\texttt{Total Sum of Squares}} $$

This diagram illustrates:

<img src="./cohesion.png" width=400>

**K-Means: Selecting $K^*$**

Below we construct a scree plot of WSS versus number of clusters.  
One method for selecting $K^*$ is by identifying the elbow in a scree plot.  At the inflection point, adding more clusters reduces WSS only marginally.  Generally, well-formed clusters are split apart, creating new ones.

<img src='scree_plot_k_means2.png' width=400>

**K-Means Implementation**

`MLlib` contains an implementation of `K-means` and also `kmeans||`  
`kmeans||` provides a better initialization in parallel environments.  

Please see k-means extension [deck](https://github.com/UVADS/distributed_computing/blob/main/05_clustering_and_dim_reduction/content_clustering/k_means_extensions.ppt) for details.

Included in the parameters,  
initMode = 'random' or 'k-means||', where 'k-means||' is the default initialization method.


**Methods:**  
Train the model with `fit()`  
Can access `clusterCenters` as an array of vectors  
Can call `transform()` on a new vector to return its assigned cluster;   this is the closest center.

**K-Means Example**

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.feature import VectorAssembler

# Loads data
df = spark.read.csv("kmeans_data.txt", header=True, inferSchema=True)
df.show()

#### Assemble the features

In [None]:
feats =  ['f1','f2','f3']      
assembler = VectorAssembler(inputCols=feats, outputCol="features")
dataset=assembler.transform(df)
dataset.select("*").show(truncate=False)

Notice that the features in the first observation are saved in sparse format (since all values are zero)

#### Train a k-means model with k=2

In [None]:
kmeans = KMeans().setK(2).setSeed(314).setMaxIter(10)
model = kmeans.fit(dataset)

#### Make Predictions

note the `transform()` method does prediction

In [None]:
predictions = model.transform(dataset)
predictions.show()

#### Evaluate the Model

In [None]:
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

print("Cluster Centers: ")
centers = model.clusterCenters()
print(centers)

**Notice the cluster centers are very intuitive**

#### Silhouette Score

The silhouette score measures the consistency within clusters.  

It falls in range [-1, 1] where
- a value close to 1 means that points are consistent (this is a good clustering)
- a value near 0 indicates overlapping clusters  
- a negative value indicates observations assigned to incorrect clusters

Here is a toy example computing the silhouette score for a single point:

<img src="./silhouette.png" width=300>

Here is the algorithm to compute the silhouette score: 

1) For each observation (point), compute A and B with these definitions:

   A : The mean distance of each point to its neighbors within its cluster (called the *mean intra-cluster distance*).
 
   B : the mean distance of each point to its next-closest cluster (called the *mean nearest-cluster distance*). The `min()` function in the illustration above determines the mean distance of the nearest cluster. 

   For well-assigned points, quantity *A* should be much smaller than quantity *B*

2) Compute for each point *i* the quantity:  $ s(i) = (B - A) / max(A, B)$

3) The mean is then computed over all s(i) to arrive at the silhoutte score.

You can read more about the metric [here](https://en.wikipedia.org/wiki/Silhouette_(clustering)).

### Gaussian Mixture Model

The *Gaussian Mixture Model* is a weighted combination of underlying Gaussian distributions, each with a fixed probability.  
The *expectation-maximization algorithm* is used in `spark.mllib` to estimate the parameters.  
There is a mean vector and a covariance matrix for each cluster.  

**Fit Mixture of Two Gaussians**

In [None]:
from pyspark.ml.clustering import GaussianMixture

# reuse data from K-Means example above

gmm = GaussianMixture().setK(2).setSeed(314)
model = gmm.fit(dataset)

print("Gaussians shown as a DataFrame: ")
print("component mean vectors")
model.gaussiansDF.select("mean").show(truncate=False)

print("component covariance matrices")
model.gaussiansDF.select("cov").show(truncate=False)

Notice the mean vectors are very close to the k-means centroids.  

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) **K-Means: try different initialization mode**  
i. Copy the k-means example in the cell below  
ii. Change the setting and observe if the results change. Hint: use setInitMode()

2) **K-Means: Try different K**  
i. Copy the k-means example in the cell below  
ii. Rerun k-mean with different k and observe the results

3) We considered a scree plot above, where the within sum of squared errors is measured for various values of k.  
Is it possible to reduce the within sum of squares to zero, and if so, how? and is this a good idea?

Answer: By setting k = n (the number of observations), each observation will be placed in its own cluster. This will drive the within sum of squares to zero. However, this doesn't provide useful information, as no clustering is taking place.