#   Machine Learning Clustering using K-Means

Customer Segmentation is used to divide a customer base into groups of individuals that are similar in a specific way relevant to marketing, such as age, gender, interest... etc. This project enables us to cluster our customers for a growing company's revenue. 
Project's customer dataset includes Gender, Age, Annual Income and Spending Score features, and it does not include label feature. Therefore, clustering is the most suitable method where doesn't have a label feature in the dataset. As a result, in this project we perform a customer segmentation we apply the K-Means algorithm which is one of the most popular clustering algorithms. 


QUESTION: How many clusters do I need to divide the customers into ?

According to above question bussiness decision will be taken.

## 1. Configuration 

In [1]:
from pyspark.sql import SparkSession 

pyspark = SparkSession.builder \
.master("local[4]")\
.appName("KMeansClusterin")\
.config("spark.executer.memory","3g")\
.config("spark.driver.memory","3g")\
.getOrCreate()

sc = pyspark.sparkContext

## 2. Load Dataset

In [2]:
customer_df = spark.read.format("csv")\
.option("header","True")\
.option("inferSchema", "True")\
.option("sep", ",")\
.load("Mall-Customers.csv")

In [3]:
customer_df.show(5)

+----------+------+---+------------+-------------+
|CustomerID|Gender|Age|AnnualIncome|SpendingScore|
+----------+------+---+------------+-------------+
|         1|  Male| 19|       15000|           39|
|         2|  Male| 21|       15000|           81|
|         3|Female| 20|       16000|            6|
|         4|Female| 23|       16000|           77|
|         5|Female| 31|       17000|           40|
+----------+------+---+------------+-------------+
only showing top 5 rows



## 3. Data Cleaning

In [4]:
from pyspark.sql.functions import *

#### 3.1. Dropping the CustomerID 

We don't use CustomerID in clustering. Hence, CustomerID is deleted from dataset.

In [5]:
customer_df2 = customer_df.drop("CustomerID")
customer_df2.show(3)

+------+---+------------+-------------+
|Gender|Age|AnnualIncome|SpendingScore|
+------+---+------------+-------------+
|  Male| 19|       15000|           39|
|  Male| 21|       15000|           81|
|Female| 20|       16000|            6|
+------+---+------------+-------------+
only showing top 3 rows



#### 3.2. Checking NULL values

In [6]:
count_for_null = 1
for column in customer_df2.columns:
    if(customer_df2.filter(col(column).isNull()).count()>0):
        print(count_for_null, ".", column, "--> \033[1;31;1m there has null values \033[0m")
    else:
        print(count_for_null, ".",column,"--> \033[1;32;1m is clean \033[0m")
    count_for_null += 1

1 . Gender --> [1;32;1m is clean [0m
2 . Age --> [1;32;1m is clean [0m
3 . AnnualIncome --> [1;32;1m is clean [0m
4 . SpendingScore --> [1;32;1m is clean [0m


#### 3.3. Checking the gender text character

Does all gender name start with uppercase ?

In [7]:
customer_df2.groupBy("Gender").count().show()

+------+-----+
|Gender|count|
+------+-----+
|Female|  112|
|  Male|   88|
+------+-----+



#### 3.3. Checking max and min ranges

Do we have any outlier value in Age, AnnualIncome and SpedingScore ?

In [8]:
customer_df2.select("Age","AnnualIncome","SpendingScore").describe().show()

+-------+-----------------+------------------+------------------+
|summary|              Age|      AnnualIncome|     SpendingScore|
+-------+-----------------+------------------+------------------+
|  count|              200|               200|               200|
|   mean|            38.85|           60560.0|              50.2|
| stddev|13.96900733155888|26264.721165271247|25.823521668370173|
|    min|               18|             15000|                 1|
|    max|               70|            137000|                99|
+-------+-----------------+------------------+------------------+



## 4. Data Preparation (Transform) 

#### 4.1. StringIndexer Process (Categorical Features)


Categories (A, B, C) and after StringIndexer --> Categories(0, 1, 2)

In [9]:
from pyspark.ml.feature import StringIndexer

In [10]:
gender_index = StringIndexer()\
.setInputCol("Gender")\
.setOutputCol("Gender_Index")

gender_index_model = gender_index.fit(customer_df2)
gender_index_df = gender_index_model.transform(customer_df2)
gender_index_df.show(3)

+------+---+------------+-------------+------------+
|Gender|Age|AnnualIncome|SpendingScore|Gender_Index|
+------+---+------------+-------------+------------+
|  Male| 19|       15000|           39|         1.0|
|  Male| 21|       15000|           81|         1.0|
|Female| 20|       16000|            6|         0.0|
+------+---+------------+-------------+------------+
only showing top 3 rows



#### 4.2. OneHotEncoderEstimator Process (Categorical Features)

CategoryIndex(0,0,1,0,0) --> This label is 3th index.


In [11]:
from pyspark.ml.feature import OneHotEncoderEstimator

In [12]:
encoder = OneHotEncoderEstimator()\
.setInputCols(["Gender_Index"])\
.setOutputCols(["Gender_Encoded"])

encoder_model = encoder.fit(gender_index_df)
encoder_df = encoder_model.transform(gender_index_df)
encoder_df.show(3)

+------+---+------------+-------------+------------+--------------+
|Gender|Age|AnnualIncome|SpendingScore|Gender_Index|Gender_Encoded|
+------+---+------------+-------------+------------+--------------+
|  Male| 19|       15000|           39|         1.0|     (1,[],[])|
|  Male| 21|       15000|           81|         1.0|     (1,[],[])|
|Female| 20|       16000|            6|         0.0| (1,[0],[1.0])|
+------+---+------------+-------------+------------+--------------+
only showing top 3 rows



#### 4.3. VectorAssembler  (Transforming features into vector)

All input values should be in a single feature for Machine Learning algorithms.

In [13]:
from pyspark.ml.feature import VectorAssembler

In [14]:
assembler = VectorAssembler()\
.setInputCols(["Gender_Encoded","Age","AnnualIncome","SpendingScore"])\
.setOutputCol("vectorized_features")

assembler_df = assembler.transform(encoder_df)

#### 4.4. Normalization


StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean.

#####  4.4.1 Using StandardScale

In [15]:
from pyspark.ml.feature import StandardScaler

standard_scaler = StandardScaler()\
.setInputCol("vectorized_features")\
.setOutputCol("features")\

standard_scale_model = standard_scaler.fit(assembler_df)
standard_df = standard_scale_model.transform(assembler_df)
standard_df.show(3)

+------+---+------------+-------------+------------+--------------+--------------------+--------------------+
|Gender|Age|AnnualIncome|SpendingScore|Gender_Index|Gender_Encoded| vectorized_features|            features|
+------+---+------------+-------------+------------+--------------+--------------------+--------------------+
|  Male| 19|       15000|           39|         1.0|     (1,[],[])|[0.0,19.0,15000.0...|[0.0,1.3601539142...|
|  Male| 21|       15000|           81|         1.0|     (1,[],[])|[0.0,21.0,15000.0...|[0.0,1.5033280104...|
|Female| 20|       16000|            6|         0.0| (1,[0],[1.0])|[1.0,20.0,16000.0...|[2.00951470525829...|
+------+---+------------+-------------+------------+--------------+--------------------+--------------------+
only showing top 3 rows



##### 4.4.2 Using Normalization

In [16]:
from pyspark.ml.feature import Normalizer

In [17]:
normalizer = Normalizer()\
.setInputCol("vectorized_features")\
.setOutputCol("features")\

normal_df = normalizer.transform(assembler_df)
normal_df.show(3)

+------+---+------------+-------------+------------+--------------+--------------------+--------------------+
|Gender|Age|AnnualIncome|SpendingScore|Gender_Index|Gender_Encoded| vectorized_features|            features|
+------+---+------------+-------------+------------+--------------+--------------------+--------------------+
|  Male| 19|       15000|           39|         1.0|     (1,[],[])|[0.0,19.0,15000.0...|[0.0,0.0012666613...|
|  Male| 21|       15000|           81|         1.0|     (1,[],[])|[0.0,21.0,15000.0...|[0.0,0.0013999782...|
|Female| 20|       16000|            6|         0.0| (1,[0],[1.0])|[1.0,20.0,16000.0...|[6.24999466553417...|
+------+---+------------+-------------+------------+--------------+--------------------+--------------------+
only showing top 3 rows



#### 4.5 Split Train and Test

In [18]:
train_df, test_df = normal_df.randomSplit([0.8, 0.2], seed=142)

In [19]:
train_df, test_df = standard_df.randomSplit([0.8, 0.2], seed=142)

## 5. Spark ML (Using Pipeline) 

#### 5.1. Applying K-Means Algorithm

In [20]:
from pyspark.ml.clustering import KMeans

In [21]:
k_means = KMeans()\
.setFeaturesCol("vectorized_features")\
.setPredictionCol("cluster")\
.setK(3)

kmeans_model = k_means.fit(train_df)
result_df = kmeans_model.transform(test_df)

result_df.select("Gender","Age","AnnualIncome","SpendingScore","features","cluster").toPandas().head()

Unnamed: 0,Gender,Age,AnnualIncome,SpendingScore,features,cluster
0,Female,19,65000,50,"[2.0095147052582996, 1.360153914235199, 2.4748...",2
1,Female,21,30000,73,"[2.0095147052582996, 1.503328010470483, 1.1422...",0
2,Female,21,62000,42,"[2.0095147052582996, 1.503328010470483, 2.3605...",2
3,Female,22,17000,76,"[2.0095147052582996, 1.574915058588125, 0.6472...",0
4,Female,23,70000,29,"[2.0095147052582996, 1.6465021067057672, 2.665...",2


In [22]:
result_df.groupBy("cluster").count().show()

+-------+-----+
|cluster|count|
+-------+-----+
|      1|    4|
|      2|   20|
|      0|   19|
+-------+-----+



## 6. Calculate Silhouette Score

In [23]:
from pyspark.ml.evaluation import ClusteringEvaluator

#### 6.1. Method 1 (using single K value)

In [24]:
evaluator = ClusteringEvaluator()\
.setFeaturesCol("features")\
.setPredictionCol("cluster")\
.setMetricName("silhouette")

score = evaluator.evaluate(result_df)
print("Silhouette Score (for k=3): ", score)

Silhouette Score (for k=3):  0.13655564392757646


#### 6.2. Method 2 (using for loop)

In [25]:
def bestKmeans(kValue):
    for i in range(1,kValue):
        k_means = KMeans()\
        .setFeaturesCol("vectorized_features")\
        .setPredictionCol("cluster")\
        .setK(i+1)
        kmeans_model = k_means.fit(train_df)
        result_df = kmeans_model.transform(test_df)

        evaluator = ClusteringEvaluator()\
        .setFeaturesCol("features")\
        .setPredictionCol("cluster")\
        .setMetricName("silhouette")

        score = evaluator.evaluate(result_df)
        print("Silhouette Score (for k=",i+1, "): ", score)

### Result
We can say that k=6 gives the best score compared to other k values.

In [26]:
maxKValue = 10
bestKmeans(maxKValue)

Silhouette Score (for k= 2 ):  0.25460377577865234
Silhouette Score (for k= 3 ):  0.13655564392757646
Silhouette Score (for k= 4 ):  0.07646596820179498
Silhouette Score (for k= 5 ):  0.05283117853305441
Silhouette Score (for k= 6 ):  0.07634479551635422
Silhouette Score (for k= 7 ):  0.0025455579209068243
Silhouette Score (for k= 8 ):  -0.12702671178000344
Silhouette Score (for k= 9 ):  -0.15919934575144346
Silhouette Score (for k= 10 ):  -0.21882903110324425
