# Seeds group prediction

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for 
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. 

The data set can be used for the tasks of classification and cluster analysis.


Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
All of these parameters were real-valued continuous.

Clustering them in to 3 groups with K-means!

In [38]:
# Initialize pyspark
import findspark
findspark.init()
import pyspark

In [39]:
# Initialize and create ba spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('seeds').getOrCreate()

In [40]:
# Using Spark to read in the wheat kernels data
data = spark.read.csv('seeds_dataset.csv', header=True, inferSchema=True)

In [41]:
# Printing the first row of the dataframe
data.head()

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)

In [42]:
# Printing the schema of the dataframe
data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [43]:
data.describe().show()

+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|summary|              area|         perimeter|         compactness|   length_of_kernel|   width_of_kernel|asymmetry_coefficient|   length_of_groove|
+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|  count|               210|               210|                 210|                210|               210|                  210|                210|
|   mean|14.847523809523816|14.559285714285718|  0.8709985714285714|  5.628533333333335| 3.258604761904762|   3.7001999999999997|  5.408071428571429|
| stddev|2.9096994306873647|1.3059587265640225|0.023629416583846364|0.44306347772644983|0.3777144449065867|   1.5035589702547392|0.49148049910240543|
|    min|             10.59|             12.41|              0.8081|              4.899|            

### **Formatting the data**

In [44]:
# Import statements to setup ML
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors

In [45]:
data.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

In [46]:
#Assembling all the dependant features to a single vector column "features"

assembler = VectorAssembler(inputCols=data.columns, outputCol='features')

In [47]:
output = assembler.transform(data)

In [48]:
output.select('features').show(3, truncate=False)

+--------------------------------------------------------+
|features                                                |
+--------------------------------------------------------+
|[15.26,14.84,0.871,5.763,3.312,2.221,5.22]              |
|[14.88,14.57,0.8811,5.553999999999999,3.333,1.018,4.956]|
|[14.29,14.09,0.905,5.291,3.3369999999999997,2.699,4.825]|
+--------------------------------------------------------+
only showing top 3 rows



## Scaling the Data

It is a good idea to scale the data to deal with the curse of dimensionality

In [49]:
from pyspark.ml.feature import StandardScaler

In [50]:
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')

In [51]:
# Compute summary statistics by fitting the StandardScaler
scaled_model = scaler.fit(output)

In [52]:
# Normalize each feature to have unit standard deviation.
final_data = scaled_model.transform(output)

In [53]:
final_data.select('features','scaled_features').show(3)

+--------------------+--------------------+
|            features|     scaled_features|
+--------------------+--------------------+
|[15.26,14.84,0.87...|[5.24452795332028...|
|[14.88,14.57,0.88...|[5.11393027165175...|
|[14.29,14.09,0.90...|[4.91116018695588...|
+--------------------+--------------------+
only showing top 3 rows



In [54]:
final_data.select('scaled_features').show(3, truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------+
|scaled_features                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------+
|[5.244527953320284,11.363299389287777,36.860833906302894,13.007165541092315,8.76852883087142,1.4771618831975104,10.62097073949694]|
|[5.113930271651758,11.156554723849252,37.28826722714521,12.53544983779745,8.824126386864265,0.6770602418257837,10.08381819634997] |
|[4.911160186955888,10.789008651958541,38.29971835270278,11.94185543604363,8.834716397529569,1.7950742560783792,9.817276593500525] |
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



__Creating a K-means model, training and evaluating it__

In [55]:
from pyspark.ml.clustering import KMeans

In [56]:
# Trains a k-means model.
#Here the value of k is 3, since we already know that there are 3 group of wheat seeds

kmeans = KMeans(featuresCol='scaled_features', k=3)

In [57]:
model = kmeans.fit(final_data)

Evaluating clustering by computing Within Set Sum of Squared Errors.

In [58]:
wssse = model.computeCost(final_data)

In [59]:
print("Within Set Sum of Squared Errors:",wssse)

Within Set Sum of Squared Errors: 428.60820118716356


Displaying the cluster centres

In [60]:
centres = model.clusterCenters()

In [61]:
print(centres)

[array([ 4.96198582, 10.97871333, 37.30930808, 12.44647267,  8.62880781,
        1.80061978, 10.41913733]), array([ 6.35645488, 12.40730852, 37.41990178, 13.93860446,  9.7892399 ,
        2.41585013, 12.29286107]), array([ 4.07497225, 10.14410142, 35.89816849, 11.80812742,  7.54416916,
        3.15410901, 10.38031464])]


In [62]:
print("Cluster Centres:")
for centre in centres:
    print(centre)

Cluster Centres:
[ 4.96198582 10.97871333 37.30930808 12.44647267  8.62880781  1.80061978
 10.41913733]
[ 6.35645488 12.40730852 37.41990178 13.93860446  9.7892399   2.41585013
 12.29286107]
[ 4.07497225 10.14410142 35.89816849 11.80812742  7.54416916  3.15410901
 10.38031464]


***Displaying the Predictions (groups of wheat seeds)***

In [63]:
predictions = model.transform(final_data)

In [64]:
predictions.select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         1|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         2|
+----------+
only showing top 20 rows



In [65]:
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   67|
|         2|   72|
|         0|   71|
+----------+-----+



###### Thus, there are 71 seeds which belongs to group 0, 67 seeds belongs to group 1 and 72 seeds which belongs to group 2!

In [None]:
#Closing spark session
spark.stop()