Our task will be to try to cluster clients of a Wholesale Distributor based off of the sales of some product categories.

Source of the Data http://archive.ics.uci.edu/ml/datasets/Wholesale+customers

Here is the info on the data:

1. **FRESH**: annual spending (m.u.) on fresh products (Continuous);
2. **MILK**: annual spending (m.u.) on milk products (Continuous);
3. **GROCERY**: annual spending (m.u.)on grocery products (Continuous);
4. **FROZEN**: annual spending (m.u.)on frozen products (Continuous)
5. **DETERGENTS_PAPER**: annual spending (m.u.) on detergents and paper products (Continuous)
6. **DELICATESSEN**: annual spending (m.u.)on and delicatessen products (Continuous);
7. **CHANNEL**: customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
8. **REGION**: customers Region- Lisnon, Oporto or Other (Nominal)


### Initialize and create a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("wholesale").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577730674847)
SparkSession available as 'spark'


2019-12-31 00:01:28 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@3572494f


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Load the Wholesale Customers Data

In [3]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("Wholesale customers data.csv")

data: org.apache.spark.sql.DataFrame = [Channel: int, Region: int ... 6 more fields]


### Show

In [4]:
data.show(3)

+-------+------+-----+----+-------+------+----------------+----------+
|Channel|Region|Fresh|Milk|Grocery|Frozen|Detergents_Paper|Delicassen|
+-------+------+-----+----+-------+------+----------------+----------+
|      2|     3|12669|9656|   7561|   214|            2674|      1338|
|      2|     3| 7057|9810|   9568|  1762|            3293|      1776|
|      2|     3| 6353|8808|   7684|  2405|            3516|      7844|
+-------+------+-----+----+-------+------+----------------+----------+
only showing top 3 rows



### Count

In [5]:
data.count

res2: Long = 440


### Schema

In [6]:
data.printSchema

root
 |-- Channel: integer (nullable = true)
 |-- Region: integer (nullable = true)
 |-- Fresh: integer (nullable = true)
 |-- Milk: integer (nullable = true)
 |-- Grocery: integer (nullable = true)
 |-- Frozen: integer (nullable = true)
 |-- Detergents_Paper: integer (nullable = true)
 |-- Delicassen: integer (nullable = true)



### Filtering the Data

Since Channel and Region doesn't contribute in predicting product category, we can drop them.

In [7]:
val feature_data = data.drop("Channel","Region")

feature_data: org.apache.spark.sql.DataFrame = [Fresh: int, Milk: int ... 4 more fields]


### Imports to setup ML for KMeans Algorithm

In [9]:
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Create a new VectorAssembler object called assembler for the feature columns as the input. Set the output column to be called features. Remember there is no Label column

In [10]:
val assembler = new VectorAssembler().setInputCols(feature_data.columns).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_653e68f2402e


### Use the assembler object to transform the feature_data, Call this new data training_data

In [11]:
val training_data = assembler.transform(feature_data)

training_data: org.apache.spark.sql.DataFrame = [Fresh: int, Milk: int ... 5 more fields]


In [12]:
training_data.show(5)

+-----+----+-------+------+----------------+----------+--------------------+
|Fresh|Milk|Grocery|Frozen|Detergents_Paper|Delicassen|            features|
+-----+----+-------+------+----------------+----------+--------------------+
|12669|9656|   7561|   214|            2674|      1338|[12669.0,9656.0,7...|
| 7057|9810|   9568|  1762|            3293|      1776|[7057.0,9810.0,95...|
| 6353|8808|   7684|  2405|            3516|      7844|[6353.0,8808.0,76...|
|13265|1196|   4221|  6404|             507|      1788|[13265.0,1196.0,4...|
|22615|5410|   7198|  3915|            1777|      5185|[22615.0,5410.0,7...|
+-----+----+-------+------+----------------+----------+--------------------+
only showing top 5 rows



### Create a Kmeans Model with K=3

In [13]:
val kmeans = new KMeans().setK(3).setSeed(1L)

kmeans: org.apache.spark.ml.clustering.KMeans = kmeans_550f26c4e862


### Fit that model to the training_data

In [14]:
val k_model = kmeans.fit(training_data)

2019-12-31 00:11:24 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-31 00:11:24 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


k_model: org.apache.spark.ml.clustering.KMeansModel = kmeans_550f26c4e862


### Evaluate clustering by computing Within Set Sum of Squared Errors

In [15]:
val WSSSE = k_model.computeCost(training_data)

WSSSE: Double = 8.095172370767671E10


In [16]:
println(s"Within Set Sum of Squared Errors = ${WSSSE}")

Within Set Sum of Squared Errors = 8.095172370767671E10


### Cluster Centres

In [19]:
println("Cluster Centres: ")
k_model.clusterCenters.foreach(println)

Cluster Centres: 
[7993.574780058651,4196.803519061584,5837.4926686217,2546.624633431085,2016.2873900293255,1151.4193548387098]
[9928.18918918919,21513.081081081084,30993.486486486487,2960.4324324324325,13996.594594594595,3772.3243243243246]
[35273.854838709674,5213.919354838709,5826.096774193548,6027.6612903225805,1006.9193548387096,2237.6290322580644]


### Closing Spark Session

In [20]:
spark.stop()

## Thank You!