## Package ml-clustering (org.apache.spark.ml.clustering)

L'objectif est de faire une présentation de la librairie de clustering de Spark ML (Scala).

**Note :** Je mets plutôt l'accent sur la démarche et non sur la recherche d'un meilleur modèle.

In [2]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel,
                                       BisectingKMeans, BisectingKMeansModel,
                                       GaussianMixture, GaussianMixtureModel
                                      }
import org.apache.spark.ml.evaluation.ClusteringEvaluator

// Charger les données
val fromageDF = spark.read.option("header",  true)
                   .option("inferSchema",  true)
                   .option("delimiter", "\t").csv("../data/fromage.txt")

fromageDF.show(3)

val features = Array("calories", "sodium", "calcium", "lipides", "retinol",
                     "folates", "proteines", "cholesterol", "magnesium")

val featuresName = "features" 
val assembler = new VectorAssembler()
  .setInputCols(features)
  .setOutputCol(featuresName)

val evaluatorSilhouette = new ClusteringEvaluator()

+-----------+--------+------+-------+-------+-------+-------+---------+-----------+---------+
|   Fromages|calories|sodium|calcium|lipides|retinol|folates|proteines|cholesterol|magnesium|
+-----------+--------+------+-------+-------+-------+-------+---------+-----------+---------+
|CarredelEst|     314| 353.5|   72.6|   26.3|   51.6|   30.3|     21.0|         70|       20|
|    Babybel|     314| 238.0|  209.8|   25.1|   63.7|    6.4|     22.6|         70|       27|
|   Beaufort|     401| 112.0|  259.4|   33.3|   54.9|    1.2|     26.6|        120|       41|
+-----------+--------+------+-------+-------+-------+-------+---------+-----------+---------+
only showing top 3 rows



import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel, BisectingKMeans, BisectingKMeansModel, GaussianMixture, GaussianMixtureModel}
import org.apache.spark.ml.evaluation.ClusteringEvaluator
fromageDF: org.apache.spark.sql.DataFrame = [Fromages: string, calories: int ... 8 more fields]
features: Array[String] = Array(calories, sodium, calcium, lipides, retinol, folates, proteines, cholesterol, magnesium)
featuresName: String = features
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_9c85ca4e2d24, handleInvalid=error, numInputCols=9
evaluatorSilhouette: org.apache.spark.ml.evaluation.ClusteringEvaluator = ClusteringEvaluator: uid=...


## 1. K-means

L’objectif de la classification automatique est de minimiser l'inertie intra-classes pour un nombre de classes k fixé. Le principe est le suivant :

* Fixer le nombre de classes :
* Initialiser les centres des classes
* Déplacer les observations d'une classe à l’autre pour obtenir une partition meilleure

Les algorithmes de ce type diffèrent souvent :   

* La façon d'initialiser les centres des classes
* La manière de mettre à jour les centres des classes à chaque itération

In [3]:
val k = 4

val kmeans = new KMeans().setK(k).setSeed(123456)
   .setFeaturesCol(featuresName)

val pipeline = new Pipeline().setStages(Array(assembler, kmeans))

val model = pipeline.fit(fromageDF)

val predictions = model.transform(fromageDF)

val silhouette = evaluatorSilhouette.evaluate(predictions)

println(f"Silhouette with squared euclidean distance = $silhouette%.3f at k=$k")

// Shows the result.
val kmeanModel = model.stages.last.asInstanceOf[KMeansModel]
println("Cluster Centers: ")

kmeanModel.clusterCenters.foreach(println)

Silhouette with squared euclidean distance = 0,513 at k=4
Cluster Centers: 
[363.875,146.125,257.02500000000003,29.05,63.6,3.8625,26.562500000000004,96.25,38.875]
[101.75,44.75,133.75,6.275,55.15,16.475,7.200000000000001,18.25,11.25]
[286.0,191.33333333333331,79.73333333333332,24.0,101.39999999999999,29.46666666666667,17.03333333333333,70.0,21.666666666666664]
[323.2142857142857,297.8928571428571,182.56428571428572,26.507142857142856,66.12142857142857,13.721428571428572,20.892857142857142,79.28571428571428,25.785714285714285]


k: Int = 4
kmeans: org.apache.spark.ml.clustering.KMeans = kmeans_bc0b552aedab
pipeline: org.apache.spark.ml.Pipeline = pipeline_db418a854394
model: org.apache.spark.ml.PipelineModel = pipeline_db418a854394
predictions: org.apache.spark.sql.DataFrame = [Fromages: string, calories: int ... 10 more fields]
silhouette: Double = 0.5126405368943555
kmeanModel: org.apache.spark.ml.clustering.KMeansModel = KMeansModel: uid=kmeans_bc0b552aedab, k=4, distanceMeasure=euclidean, numFeatures=9


## 2. Bisecting k-means

Le **Bisecting k-means** est une approche hybride entre une classification hiérarchique un k-means. Le principe est le suivant : 
* 1. Choisir un groupe
* 2. Diviser le groupe en deux sous-groupes à l'aide du K-Means (étape de bisecting)
* 3. Répéter l'étape 2, l'étape de bisecting, pour trouver la division meilleure
* 4. Répéter les étapes 1, 2 et 3 jusqu'à atteindre le critère de convergence

In [4]:
val k = 4

val bkm = new BisectingKMeans().setK(k).setSeed(123456)
   .setFeaturesCol(featuresName)

val pipeline = new Pipeline().setStages(Array(assembler, bkm))

val model = pipeline.fit(fromageDF)

val predictions = model.transform(fromageDF)

val silhouette = evaluatorSilhouette.evaluate(predictions)

println(f"Silhouette with squared euclidean distance = $silhouette%.3f at k=$k")

// Shows the result.
val bkmModel = model.stages.last.asInstanceOf[BisectingKMeansModel]
println("Cluster Centers: ")

bkmModel.clusterCenters.foreach(println)

Silhouette with squared euclidean distance = 0,499 at k=4
Cluster Centers: 
[112.33333333333333,29.333333333333332,106.43333333333334,7.233333333333334,59.23333333333333,21.0,8.233333333333334,20.0,10.333333333333332]
[138.0,125.5,144.25,10.95,96.7,16.95,7.6,31.5,15.0]
[364.22222222222223,158.33333333333331,257.8,29.02222222222222,61.955555555555556,4.066666666666666,26.166666666666668,95.55555555555556,37.888888888888886]
[320.6666666666667,288.56666666666666,163.88666666666668,26.386666666666667,68.70666666666668,16.25333333333333,20.633333333333333,78.66666666666667,25.333333333333332]


k: Int = 4
bkm: org.apache.spark.ml.clustering.BisectingKMeans = bisecting-kmeans_81e71c3d6183
pipeline: org.apache.spark.ml.Pipeline = pipeline_76b474d876fb
model: org.apache.spark.ml.PipelineModel = pipeline_76b474d876fb
predictions: org.apache.spark.sql.DataFrame = [Fromages: string, calories: int ... 10 more fields]
silhouette: Double = 0.4985007380025358
bkmModel: org.apache.spark.ml.clustering.BisectingKMeansModel = BisectingKMeansModel: uid=bisecting-kmeans_81e71c3d6183, k=4, distanceMeasure=euclidean, numFeatures=9


##  3. Modèle de mélanges gaussiens

L'algorithme GMM cherche une distribution de gaussiennes multidimensionnelles qui s'adapte le mieux à la forme des données.

La maximisation de la log vraisemblance se fait grâce à la méthode EM (expectation minimisation).

In [5]:
val k = 4

val gmm = new GaussianMixture().setK(k).setSeed(123456)
   .setFeaturesCol(featuresName)

val pipeline = new Pipeline().setStages(Array(assembler, gmm))

val model = pipeline.fit(fromageDF)

val predictions = model.transform(fromageDF)

val silhouette = evaluatorSilhouette.evaluate(predictions)

println(f"Silhouette with squared euclidean distance = $silhouette%.3f at k=$k")

val gmmModel = model.stages.last.asInstanceOf[GaussianMixtureModel]

// output parameters of mixture model model
for (i <- 0 until gmmModel.getK) {
  println(s"Gaussien $i:\nweight=${gmmModel.weights(i)}\n" +
      s"mu=${gmmModel.gaussians(i).mean}\nsigma=\n${gmmModel.gaussians(i).cov}\n")
}

Silhouette with squared euclidean distance = 0,056 at k=4
Gaussien 0:
weight=0.4972488239933205
mu=[323.3885419261037,249.99106437969996,189.78936892677967,26.19233365621317,64.85417062236904,9.376383115261243,21.339066622482484,80.13139463947476,27.024582485889074]
sigma=
2809.249047261747   1005.9614885561348   739.1925649447539    ... (9 total)
1005.9614885561348  6653.510795957366    -549.7649863678694   ...
739.1925649447539   -549.7649863678694   3495.8884292571397   ...
237.75869947736922  122.55107271569388   28.812380154832635   ...
-88.13771981384471  201.84776816429016   -275.78701213885006  ...
-93.11499503606204  76.69577605652565    -271.64709530134564  ...
173.00423529070073  -38.03440152088525   113.04660020556331   ...
834.1777001223265   264.32612259691683   148.3184912458894    ...
229.07847130460618  -114.07287351890987  363.44120248619595   ...

Gaussien 1:
weight=0.12532016894888748
mu=[296.09151352949164,217.08975291558093,180.21800646266468,24.088677159666766,76

k: Int = 4
gmm: org.apache.spark.ml.clustering.GaussianMixture = GaussianMixture_d6f426357d69
pipeline: org.apache.spark.ml.Pipeline = pipeline_0b770bbc4b74
model: org.apache.spark.ml.PipelineModel = pipeline_0b770bbc4b74
predictions: org.apache.spark.sql.DataFrame = [Fromages: string, calories: int ... 11 more fields]
silhouette: Double = 0.056150879368063754
gmmModel: org.apache.spark.ml.clustering.GaussianMixtureModel = GaussianMixtureModel: uid=GaussianMixture_d6f426357d69, k=4, numFeatures=9


Sources :   
[Documentation Spark](https://spark.apache.org/docs/3.0.0/ml-clustering.html)   
[WikiStat](https://github.com/wikistat/)  