# Seeds group prediction

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.

Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured:

1. area A,
2. perimeter P,
3. compactness C = 4piA/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.

All of these parameters were real-valued continuous.

Clustering them in to 3 groups with K-means!

### Initialize and create a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("seeds").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577722198702)
SparkSession available as 'spark'


2019-12-30 21:40:14 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7bbc8e96


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Import statements to setup ML

In [3]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Using Spark to read in the wheat kernels data

In [4]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("seeds_dataset.csv")

data: org.apache.spark.sql.DataFrame = [area: double, perimeter: double ... 5 more fields]


### Printing the first row of the dataframe

In [5]:
val colnames = data.columns
val firstRow = data.head(1)(0)

colnames: Array[String] = Array(area, perimeter, compactness, length_of_kernel, width_of_kernel, asymmetry_coefficient, length_of_groove)
firstRow: org.apache.spark.sql.Row = [15.26,14.84,0.871,5.763,3.312,2.221,5.22]


In [6]:
for (i <- Range(0,colnames.size)){
    println(s"Column Name: ${colnames(i)}")
    println(s"Column Data: ${firstRow(i)}")
    println()
}

Column Name: area
Column Data: 15.26

Column Name: perimeter
Column Data: 14.84

Column Name: compactness
Column Data: 0.871

Column Name: length_of_kernel
Column Data: 5.763

Column Name: width_of_kernel
Column Data: 3.312

Column Name: asymmetry_coefficient
Column Data: 2.221

Column Name: length_of_groove
Column Data: 5.22



### Printing the schema of the dataframe

In [7]:
data.printSchema

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [8]:
data.describe().show()

+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|summary|              area|         perimeter|         compactness|   length_of_kernel|   width_of_kernel|asymmetry_coefficient|   length_of_groove|
+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|  count|               210|               210|                 210|                210|               210|                  210|                210|
|   mean|14.847523809523816|14.559285714285718|  0.8709985714285714|  5.628533333333335| 3.258604761904762|   3.7001999999999997|  5.408071428571429|
| stddev|2.9096994306873647|1.3059587265640225|0.023629416583846364|0.44306347772644983|0.3777144449065867|   1.5035589702547392|0.49148049910240543|
|    min|             10.59|             12.41|              0.8081|              4.899|            

### Count

In [9]:
data.count()

res4: Long = 210


### Formatting the data

In [10]:
data.columns

res5: Array[String] = Array(area, perimeter, compactness, length_of_kernel, width_of_kernel, asymmetry_coefficient, length_of_groove)


#### Assembling all the dependant features to a single vector column "features"

In [13]:
val assembler = new VectorAssembler().setInputCols(data.columns).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_cb3f1e6dd6a0


In [14]:
val output = assembler.transform(data)

output: org.apache.spark.sql.DataFrame = [area: double, perimeter: double ... 6 more fields]


In [16]:
output.select("features").show(5,false)

+---------------------------------------------------------+
|features                                                 |
+---------------------------------------------------------+
|[15.26,14.84,0.871,5.763,3.312,2.221,5.22]               |
|[14.88,14.57,0.8811,5.553999999999999,3.333,1.018,4.956] |
|[14.29,14.09,0.905,5.291,3.3369999999999997,2.699,4.825] |
|[13.84,13.94,0.8955,5.324,3.3789999999999996,2.259,4.805]|
|[16.14,14.99,0.9034,5.6579999999999995,3.562,1.355,5.175]|
+---------------------------------------------------------+
only showing top 5 rows



### Scaling the Data

In [17]:
import org.apache.spark.ml.feature.StandardScaler

import org.apache.spark.ml.feature.StandardScaler


In [18]:
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures")

scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_431b9a61f34f


#### Compute summary statistics by fitting the StandardScaler

In [19]:
val scaledModel = scaler.fit(output)

scaledModel: org.apache.spark.ml.feature.StandardScalerModel = stdScal_431b9a61f34f


#### Normalize each feature to have unit standard deviation.

In [20]:
val final_data = scaledModel.transform(output)

final_data: org.apache.spark.sql.DataFrame = [area: double, perimeter: double ... 7 more fields]


In [23]:
final_data.select("scaledFeatures").show(3,false)

+----------------------------------------------------------------------------------------------------------------------------------+
|scaledFeatures                                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------+
|[5.244527953320284,11.363299389287777,36.860833906302894,13.007165541092315,8.76852883087142,1.4771618831975104,10.62097073949694]|
|[5.113930271651758,11.156554723849252,37.28826722714521,12.53544983779745,8.824126386864265,0.6770602418257837,10.08381819634997] |
|[4.911160186955888,10.789008651958541,38.29971835270278,11.94185543604363,8.834716397529569,1.7950742560783792,9.817276593500525] |
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



### Creating a K-means model, training and evaluating it

In [24]:
import org.apache.spark.ml.clustering.KMeans

import org.apache.spark.ml.clustering.KMeans


### Trains a k-means model

In [25]:
// Here the value of k is 3, since we already know that there are 3 group of wheat seeds
val kmeans = new KMeans().setFeaturesCol("scaledFeatures").setK(3)

kmeans: org.apache.spark.ml.clustering.KMeans = kmeans_f6b942c34fdf


In [26]:
val k_model = kmeans.fit(final_data)

2019-12-30 22:43:43 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-30 22:43:43 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


k_model: org.apache.spark.ml.clustering.KMeansModel = kmeans_f6b942c34fdf


### Evaluating clustering by computing Within Set Sum of Squared Errors.

In [27]:
val wssse = k_model.computeCost(final_data)

wssse: Double = 429.07559671506715


#### WSSSE

In [29]:
println("Within Set Sum of Squared Errors: " + wssse)

Within Set Sum of Squared Errors: 429.07559671506715


#### Displaying the cluster centres

In [30]:
val centres = k_model.clusterCenters

centres: Array[org.apache.spark.ml.linalg.Vector] = Array([4.061059164337848,10.139795061513128,35.805369844737804,11.821330945345327,7.503959365003412,3.271847318221192,10.421260178583319], [6.316705461695198,12.371097591766675,37.39491395670191,13.911550623916485,9.748066995945349,2.3984996835040806,12.266174803060888], [4.8725765911787295,10.88120145832446,37.27692543209712,12.341015696873171,8.554434115254532,1.8164901104858355,10.329985983042679])


In [32]:
print("Cluster Centres:")
for (centre <- centres){
    println(centre)
}

Cluster Centres:[4.061059164337848,10.139795061513128,35.805369844737804,11.821330945345327,7.503959365003412,3.271847318221192,10.421260178583319]
[6.316705461695198,12.371097591766675,37.39491395670191,13.911550623916485,9.748066995945349,2.3984996835040806,12.266174803060888]
[4.8725765911787295,10.88120145832446,37.27692543209712,12.341015696873171,8.554434115254532,1.8164901104858355,10.329985983042679]


### Displaying the Predictions (groups of wheat seeds)

In [33]:
val predictions = k_model.transform(final_data)

predictions: org.apache.spark.sql.DataFrame = [area: double, perimeter: double ... 8 more fields]


In [34]:
predictions.select("prediction").show(10)

+----------+
|prediction|
+----------+
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         1|
|         1|
+----------+
only showing top 10 rows



In [36]:
predictions.groupBy("prediction").count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   70|
|         2|   75|
|         0|   65|
+----------+-----+



Thus, there are 65 seeds which belongs to group 0, 70 seeds belongs to group 1 and 75 seeds which belongs to group 2!

### Closing spark session

In [37]:
spark.stop()

## Thank You!