# Clustering Analysis with KMeans: Seeds Dataset
## About the Data

So this is a seed data set and it's obtained from the University of California Irvine machine learning repository: https://archive.ics.uci.edu/ml/datasets/seeds

`Abstract: Measurements of geometrical properties of kernels belonging to three different varieties of wheat. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes.`

And basically what this is is an experiment in visualizing kernels and their actual features.

We will cluster the data with k=3 representing group of kernels three different varieties of wheat that is:
- Kama
- Rosa
- Canadian

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('clustering').getOrCreate()

In [3]:
dataset = spark.read.csv('datasets/seeds_dataset.csv', inferSchema=True, header=True)

In [5]:
dataset.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [6]:
dataset.head(1)

[Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)]

In [7]:
dataset.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

## Feature Selection

In [20]:
from pyspark.ml.feature import VectorAssembler

In [21]:
assembler = VectorAssembler( inputCols=dataset.columns, outputCol='features')

In [29]:
final_data = assembler.transform(dataset)

In [30]:
final_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)



## Feature Standardization

We wil scale the data using `StandardScaler`

In [14]:
from pyspark.ml.feature import StandardScaler

In [31]:
scaler = StandardScaler( inputCol='features', outputCol='scaledFeatures', withMean=True)

In [32]:
# Get information about STD or Mean of the data
scaler_model = scaler.fit(final_data)

In [33]:
# transform the data using the obtained STD/Mean to be standardized
final_data = scaler_model.transform(final_data) 

In [34]:
final_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [35]:
final_data.head(1)

[Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22, features=DenseVector([15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22]), scaledFeatures=DenseVector([0.1418, 0.2149, 0.0001, 0.3035, 0.1414, -0.9838, -0.3827]))]

## Train the Model

In [36]:
from pyspark.ml.clustering import KMeans

In [37]:
kmeans = KMeans( k=3, featuresCol='scaledFeatures')

In [38]:
model = kmeans.fit(final_data)

## Evaluate the Model

In [39]:
from pyspark.ml.evaluation import ClusteringEvaluator

In [40]:
model_preds = model.transform( final_data)

In [44]:
model_preds.printSchema()
model_preds.select('scaledFeatures', 'prediction').show()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)
 |-- prediction: integer (nullable = false)

+--------------------+----------+
|      scaledFeatures|prediction|
+--------------------+----------+
|[0.14175903742014...|         0|
|[0.01116135575161...|         0|
|[-0.1916087289442...|         0|
|[-0.3462638782885...|         0|
|[0.44419577391567...|         0|
|[-0.1606776990753...|         0|
|[-0.0541374850826...|         0|
|[-0.2534707886819...|         0|
|[0.61259804764614...|         2|
|[0.54729920681188...|         0|
|[0.14175903742014...|         0|
|[-0.2809650374542...|         0|
|[-0.3290799728058...|         0|
|[-0

In [45]:
evaluator = ClusteringEvaluator( featuresCol='scaledFeatures', predictionCol='prediction')

In [46]:
print('Silouhette metric:{}'.format( evaluator.evaluate(model_preds)))

Silouhette metric:0.5928460631863557


In [47]:
centers = model.clusterCenters()
print('Cluster Centers: {}'.format(centers))

Cluster Centers: [array([-0.14078309, -0.16963724,  0.44853463, -0.25719987,  0.00164301,
       -0.66034122, -0.58449646]), array([-1.02779666, -1.00424915, -0.96260496, -0.89554512, -1.08299564,
        0.693148  , -0.62331915]), array([ 1.25368596,  1.25895795,  0.55912833,  1.23493193,  1.1620751 ,
       -0.04511088,  1.28922727])]
