# **Clustering**
Spark MLlib provides a (limited) set of clustering algorithms
- K-means
- Bisecting k-means
- Gaussian Mixture Model (GMM)

Each clustering algorithm has its own parameters. However, all the provided algorithms identify a set of groups of objects/clusters and assign each input object to one single cluster All the clustering algorithms available in Spark work only with numerical data.

The input of the MLlib clustering algorithms is a DataFrame containing a column called features of type Vector The clustering algorithm clusters the input records by considering only the content of features.

## **Clustering with Mllib**
1. Create a DataFrame with the features column
2. Define the clustering pipeline and run the fit() method on the input data to infer the clustering model (e.g., the centroids of the k-means algorithm)
    - This step returns a clustering model
3. Invoke the transform() method of the inferred clustering model on the input data to assign each input record to a cluster
    - This step returns a new DataFrame with the new column “prediction” in which the cluster identifier is stored for each input record
    
## **K-Means**

In [None]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

# input and output folders
inputData = "ex_datakmeans/dataClusteering.csv"
outputPath = "clusterskmeans/“

# Create a DataFrame from dataClusteering.csv
# Training data in raw format
inputDataDF = spark.read.load(inputData,\
                                format="csv", header=True,\
                                inferSchema=True)

In [None]:
# Define an assembler to create a column (features) of type Vector
# containing the double values associated with columns attr1, attr2, attr3
assembler = VectorAssembler(inputCols=["attr1", "attr2", "attr3"],\
outputCol="features")

**Creation of the cluster**

In [None]:
# Create a k-means object.
# k-means is an Estimator that is used to
# create a k-means algorithm
km = KMeans()
# Set the value of k ( = number of clusters)
km.setK(2)

In [None]:
# Define the pipeline
pipeline = Pipeline().setStages([assembler, km])

# Execute the pipeline
kmeansModel = pipeline.fit(inputDataDF)

In [None]:
# Now the clustering model can be applied on the input data

# to assign them to a cluster (i.e., assign a cluster id)
# The returned DataFrame has the following schema (attributes)

# - features: vector (values of the attributes)
# - prediction: double (the predicted cluster id)
# - original attributes attr1, attr2, attr3

clusteredDataDF = kmeansModel.transform(inputDataDF)

In [None]:
# Select only the original columns and the clusterID (prediction) one
# I rename prediction to clusterID
clusteredData = clusteredDataDF\
.select("attr1", "attr2", "attr3", "prediction")\
.withColumnRenamed("prediction","clusterID")

# Save the result in an HDFS output folder
clusteredData.write.csv(outputPath, header="true")