# Customer segmentation with K-means clustering

### Importing MLlib libraries 

In [3]:
import org.apache.spark.mllib.clustering._
import org.apache.spark.mllib.linalg._
import org.apache.spark.rdd._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.sql.functions._
import sqlContext.implicits._ 

Name: Compile Error
Message: <console>:77: error: stable identifier required, but $iwC.this.$VAL105.sqlContext.implicits found.
         import sqlContext.implicits._
                           ^
StackTrace: 

We're going to use the wholesale customer dataset we downloaded from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine. You can download the dataset from here – https://archive.ics.uci.edu/ml/datasets/Wholesale+customers#.

The dataset contains 440 customers (observations) of a wholesale distributor. It includes the annual spend in monetary units on six product categories – Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen. 

### Read data and pre-processing

In [2]:
val data =sc.textFile("data/customer.csv")
data.take(5).foreach(println)
data.count()

Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
2,3,12669,9656,7561,214,2674,1338
2,3,7057,9810,9568,1762,3293,1776
2,3,6353,8808,7684,2405,3516,7844
1,3,13265,1196,4221,6404,507,1788


441

In [3]:
val header = data.first
val temp = data.filter(l => l != header)

In [4]:
temp.take(5).foreach(println)

2,3,12669,9656,7561,214,2674,1338
2,3,7057,9810,9568,1762,3293,1776
2,3,6353,8808,7684,2405,3516,7844
1,3,13265,1196,4221,6404,507,1788
2,3,22615,5410,7198,3915,1777,5185


In [5]:
val customer=temp.map(line=>Vectors.dense(line.split(',').slice(2,8).map(_.toDouble)))

In [6]:
customer.take(5).foreach(println)

[12669.0,9656.0,7561.0,214.0,2674.0,1338.0]
[7057.0,9810.0,9568.0,1762.0,3293.0,1776.0]
[6353.0,8808.0,7684.0,2405.0,3516.0,7844.0]
[13265.0,1196.0,4221.0,6404.0,507.0,1788.0]
[22615.0,5410.0,7198.0,3915.0,1777.0,5185.0]


### Perform cluster analysis

In [7]:
val numClusters = 5
val numIterations = 10
val model = KMeans.train(customer,numClusters,numIterations)

In [8]:
val summary: MultivariateStatisticalSummary = Statistics.colStats(customer)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column

[12000.297727272728,5796.265909090909,7951.277272727274,3071.931818181818,2881.493181818182,1524.8704545454548]
[1.5995492742140704E8,5.44699672389263E7,9.03101037543798E7,2.3567853166183483E7,2.2732436036399867E7,7952997.497986122]
[440.0,440.0,440.0,440.0,440.0,440.0]


In [9]:
model.clusterCenters.foreach(println)

[8569.241258741258,3211.3811188811187,4121.104895104895,2745.527972027972,1183.2272727272727,1052.4195804195804]
[34782.0,30367.0,16898.0,48701.5,755.5,26776.0]
[5176.25,12308.75,19113.214285714286,1655.0595238095236,8426.45238095238,1980.7142857142856]
[34872.12698412698,5078.253968253968,5924.746031746032,5028.9047619047615,1115.079365079365,2166.269841269841]
[25603.0,43460.600000000006,61472.200000000004,2636.0,29974.2,2708.8]


In [10]:
val cusclu = model.predict(customer)

In [11]:
cusclu.take(5).foreach(println)

0
0
0
0
3


In [12]:
println("PMML Model:\n" + model.toPMML)

PMML Model:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2">
    <Header description="k-means clustering">
        <Application name="Apache Spark MLlib" version="1.6.0"/>
        <Timestamp>2016-03-23T09:00:41</Timestamp>
    </Header>
    <DataDictionary numberOfFields="6">
        <DataField name="field_0" optype="continuous" dataType="double"/>
        <DataField name="field_1" optype="continuous" dataType="double"/>
        <DataField name="field_2" optype="continuous" dataType="double"/>
        <DataField name="field_3" optype="continuous" dataType="double"/>
        <DataField name="field_4" optype="continuous" dataType="double"/>
        <DataField name="field_5" optype="continuous" dataType="double"/>
    </DataDictionary>
    <ClusteringModel modelName="k-means" functionName="clustering" modelClass="centerBased" numberOfClusters="5">
        <MiningSchema>
            <MiningField name="field_0" usageType="activ

checked