# K-Means

**Attention** : Contrairement à ce qui a été dit K-NN est un alogrithme de classification supervisé. Il se base sur les "K" voisins les plus proches pour faire une prédiction.

K-Means est lui non-supervisé : il trouve tout seul les _K_ "centroids" qui minimisent les distances avec les points d'un même cluster.

In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName('K-Means') \
    .getOrCreate()
sc = spark.sparkContext

In [3]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Loads data.
dataset = spark.read.format("libsvm").load("/data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Silhouette with squared euclidean distance = 0.9997530305375207
Cluster Centers: 
[9.1 9.1 9.1]
[0.1 0.1 0.1]


In [7]:
??KMeans

[0;31mInit signature:[0m
[0mKMeans[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfeaturesCol[0m[0;34m=[0m[0;34m'features'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpredictionCol[0m[0;34m=[0m[0;34m'prediction'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mk[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minitMode[0m[0;34m=[0m[0;34m'k-means||'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minitSteps[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtol[0m[0;34m=[0m[0;36m0.0001[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmaxIter[0m[0;34m=[0m[0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdistanceMeasure[0m[0;34m=[0m[0;34m'euclidean'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mweightCol[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m

In [6]:
dataset.show(20,50)

+-----+-------------------------+
|label|                 features|
+-----+-------------------------+
|  0.0|                (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0.1,0.1])|
|  2.0|(3,[0,1,2],[0.2,0.2,0.2])|
|  3.0|(3,[0,1,2],[9.0,9.0,9.0])|
|  4.0|(3,[0,1,2],[9.1,9.1,9.1])|
|  5.0|(3,[0,1,2],[9.2,9.2,9.2])|
+-----+-------------------------+

