# Unsupervised Learning and Customer Segmentation with Spark

## Business Problem :

#### The firm wants to segment customers and determine marketing strategies according to these segments. For this purpose, the behaviors of the customers will be defined and groups will be formed according to the clusters in these behaviors.

## Task 1 : Data preparation and first look

### Step 1: Start 

In [1]:
import findspark
findspark.init("/opt/manual/spark/")

In [2]:
from pyspark.sql import SparkSession, functions as F
import pandas as pd

In [3]:
spark = SparkSession.builder \
.appName("Customer Segmentation with Spark") \
.master("yarn") \
.config("spark.sql.shuffle.paritions","20") \
.config("spark.driver.memory", "1g") \
.config("spark.executer.memory","2g") \
.enableHiveSupport() \
.getOrCreate()

In [4]:
df = spark.read \
.format("csv") \
.option("header",True) \
.option("inferSchema", True) \
.option("sep", "|") \
.load("/user/train/datasets/flo100k.csv")

### Step 2: First look

In [5]:
df.limit(5).toPandas()

Unnamed: 0,master_id,order_channel,platform_type,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12,online_product_group_amount_top_name_12,offline_product_group_name_12,last_order_date_new,store_type
0,b3ace094-a17f-11e9-a2fc-000d3a38a36f,Offline,Offline,Offline,2019-02-23 12:59:17,2019-02-23 12:59:17,NaT,2019-02-23 12:59:17,,1.0,212.98,0.0,,,,2019-02-23,A
1,c57d7c4c-a950-11e9-a2fc-000d3a38a36f,Offline,OmniChannel,Offline,2019-12-01 16:48:09,2019-12-01 16:48:09,NaT,2019-12-01 16:48:09,,1.0,199.98,0.0,,,,2019-12-01,A
2,602897a6-cdac-11ea-b31f-000d3a38a36f,Offline,Offline,Offline,2020-07-24 15:49:47,2020-07-24 15:49:47,NaT,2020-07-24 15:49:47,,1.0,140.49,0.0,[ERKEK],,ERKEK,2020-07-24,A
3,388e4c4e-af86-11e9-a2fc-000d3a38a36f,Mobile,Online,Mobile,2018-12-31 07:22:07,2018-12-31 07:22:07,2018-12-31 07:22:07,NaT,1.0,,0.0,174.99,,,,2018-12-31,A
4,80664354-adf0-11eb-8f64-000d3a299ebf,Desktop,Online,Desktop,2021-05-05 21:07:02,2021-05-05 22:39:36,2021-05-05 22:39:36,NaT,2.0,,0.0,283.95,[],,,2021-05-05,A


In [6]:
df_count = df.count()
print(df_count)

100000


In [7]:
len(df.columns)

17

In [8]:
df.dtypes

[('master_id', 'string'),
 ('order_channel', 'string'),
 ('platform_type', 'string'),
 ('last_order_channel', 'string'),
 ('first_order_date', 'timestamp'),
 ('last_order_date', 'timestamp'),
 ('last_order_date_online', 'timestamp'),
 ('last_order_date_offline', 'timestamp'),
 ('order_num_total_ever_online', 'double'),
 ('order_num_total_ever_offline', 'double'),
 ('customer_value_total_ever_offline', 'double'),
 ('customer_value_total_ever_online', 'double'),
 ('interested_in_categories_12', 'string'),
 ('online_product_group_amount_top_name_12', 'string'),
 ('offline_product_group_name_12', 'string'),
 ('last_order_date_new', 'string'),
 ('store_type', 'string')]

### Step 3: Null check

In [9]:
for col_name in df.dtypes:
    null_count = df.filter((F.col(col_name[0]).isNull()) | (F.col(col_name[0]) == "") | (F.col(col_name[0])== "NA")).count()
    
    if null_count > 0:
        print(f"{col_name[0]}:  {null_count} % {null_count / df_count * 100} ")
        

last_order_date_online:  70784 % 70.784 
last_order_date_offline:  21703 % 21.703 
order_num_total_ever_online:  70784 % 70.784 
order_num_total_ever_offline:  21703 % 21.703 
interested_in_categories_12:  56590 % 56.589999999999996 
online_product_group_amount_top_name_12:  88295 % 88.295 
offline_product_group_name_12:  77209 % 77.209 


## Task 2 : Data analysis

### Step 1: Checḱ unique varialbe

In [10]:
df.select("master_id").distinct().count()

100000

### Step 2: Understanding the data

In [11]:
df.select("platform_type","order_channel").groupBy("platform_type","order_channel").count().orderBy(F.desc("count")).show()

+-------------+-------------+-----+
|platform_type|order_channel|count|
+-------------+-------------+-----+
|      Offline|      Offline|65991|
|       Online|  Android App| 8728|
|       Online|       Mobile| 6451|
|  OmniChannel|      Offline| 4793|
|  OmniChannel|  Android App| 3261|
|       Online|      Desktop| 3253|
|       Online|      Ios App| 3008|
|  OmniChannel|       Mobile| 2061|
|  OmniChannel|      Desktop| 1498|
|  OmniChannel|      Ios App|  956|
+-------------+-------------+-----+



### Step 3: Omnichannel

In [12]:
# fill na with 0
 
df = df.na.fill(value=0,
               subset=['order_num_total_ever_offline',
                      'order_num_total_ever_online',
                      'customer_value_total_ever_offline',
                      'customer_value_total_ever_online'])

In [13]:
# check values
df = df.filter((df.order_num_total_ever_offline >= 0) &
               (df.order_num_total_ever_online >= 0) &
               (df.customer_value_total_ever_offline >=0) &
               (df.customer_value_total_ever_online >=0))
                

In [14]:
df = df.withColumn("order_num_total",
                  df.order_num_total_ever_offline + df.order_num_total_ever_online)

In [15]:
df = df.withColumn("customer_values_total",
                  df.customer_value_total_ever_offline + df.customer_value_total_ever_online)

### Step 4: channel review

In [16]:
df.groupBy("order_channel")\
.agg(F.count("master_id").alias("customer_count")
    ,F.avg("order_num_total").alias("avg_order_num_total")
     ,F.avg("customer_values_total").alias("avg_customer_value_total")).show()

+-------------+--------------+-------------------+------------------------+
|order_channel|customer_count|avg_order_num_total|avg_customer_value_total|
+-------------+--------------+-------------------+------------------------+
|  Android App|         11989|   3.50971724080407|       532.8462840937639|
|       Mobile|          8512|  2.798637218045113|       391.7418761748012|
|      Ios App|          3964|  3.377648839556004|       568.4640312815283|
|      Desktop|          4751|  2.538623447695222|      376.91355504103785|
|      Offline|         70784| 1.6003475361663653|      218.16394849129728|
+-------------+--------------+-------------------+------------------------+



### Step 5: Sales values by platform

In [17]:
df.groupBy("platform_type")\
.agg(F.count("master_id").alias("customer_count")
    ,F.avg("order_num_total").alias("avg_order_num_total")
     ,F.avg("customer_values_total").alias("avg_customer_value_total")).show()

+-------------+--------------+-------------------+------------------------+
|platform_type|customer_count|avg_order_num_total|avg_customer_value_total|
+-------------+--------------+-------------------+------------------------+
|  OmniChannel|         12569| 3.4947092051873656|       500.7937512928769|
|       Online|         21440| 2.6207089552238805|       404.7896576492903|
|      Offline|         65991| 1.5837917291751906|       215.7303068601296|
+-------------+--------------+-------------------+------------------------+



## Task 3 : Calculate Kmeans metrics

### Step 1: RFM

In [18]:
rfm = df.select("master_id", "last_order_date_new", "order_num_total", "customer_values_total")

In [19]:
rfm.show(5)

+--------------------+-------------------+---------------+---------------------+
|           master_id|last_order_date_new|order_num_total|customer_values_total|
+--------------------+-------------------+---------------+---------------------+
|b3ace094-a17f-11e...|         2019-02-23|            1.0|               212.98|
|c57d7c4c-a950-11e...|         2019-12-01|            1.0|               199.98|
|602897a6-cdac-11e...|         2020-07-24|            1.0|               140.49|
|388e4c4e-af86-11e...|         2018-12-31|            1.0|               174.99|
|80664354-adf0-11e...|         2021-05-05|            2.0|               283.95|
+--------------------+-------------------+---------------+---------------------+
only showing top 5 rows



### Step 2: Last purchase date

In [20]:
last_order_date = df.agg({"last_order_date":"max"}).collect()[0][0]
last_order_date.date()

datetime.date(2021, 5, 30)

### Step 3: Recency

In [21]:
rfm = rfm.withColumn("Recency", F.expr("datediff('2021-6-1',last_order_date_new)"))
rfm.show()

+--------------------+-------------------+---------------+---------------------+-------+
|           master_id|last_order_date_new|order_num_total|customer_values_total|Recency|
+--------------------+-------------------+---------------+---------------------+-------+
|b3ace094-a17f-11e...|         2019-02-23|            1.0|               212.98|    829|
|c57d7c4c-a950-11e...|         2019-12-01|            1.0|               199.98|    548|
|602897a6-cdac-11e...|         2020-07-24|            1.0|               140.49|    312|
|388e4c4e-af86-11e...|         2018-12-31|            1.0|               174.99|    883|
|80664354-adf0-11e...|         2021-05-05|            2.0|               283.95|     27|
|47511f36-aeb4-11e...|         2018-11-11|            1.0|               139.98|    933|
|77f7c318-3407-11e...|         2020-12-01|            1.0|                269.9|    182|
|399d6dd2-ecf1-11e...|         2020-09-02|            1.0|                95.73|    272|
|b3d4a6f2-a368-11e...

### Step 4: Renaming

In [22]:
rfm = rfm.withColumnRenamed("order_num_total","Frequency")\
.withColumnRenamed("customer_values_total","Monetary")

In [23]:
rfm.columns

['master_id', 'last_order_date_new', 'Frequency', 'Monetary', 'Recency']

### Step 5: Filter

In [24]:
rfm = rfm.drop("last_order_date_new")
rfm.columns

['master_id', 'Frequency', 'Monetary', 'Recency']

## Task 4: Model preparation

### Step 1: Vector Assembler

In [25]:
from pyspark.ml.feature import VectorAssembler

In [26]:
rfm_cols = ['Frequency', 'Monetary', 'Recency']

In [27]:
assembler = VectorAssembler()\
.setHandleInvalid("skip") \
.setInputCols(rfm_cols)\
.setOutputCol("unscaled_features")

### Step 2: Standart Scaler

In [28]:
from pyspark.ml.feature import StandardScaler

In [29]:
scaler = StandardScaler() \
.setInputCol("unscaled_features") \
.setOutputCol("features")

### Step 3: Pipeline

In [30]:
from pyspark.ml import Pipeline

In [31]:
pipeline_obj = Pipeline().setStages([assembler,scaler])

In [32]:
pipeline_model = pipeline_obj.fit(rfm)

In [33]:
pipeline_df = pipeline_model.transform(rfm)

In [34]:
pipeline_df.limit(5).toPandas()

Unnamed: 0,master_id,Frequency,Monetary,Recency,unscaled_features,features
0,b3ace094-a17f-11e9-a2fc-000d3a38a36f,1.0,212.98,829,"[1.0, 212.98, 829.0]","[0.32770120750468923, 0.4598810576445157, 3.24..."
1,c57d7c4c-a950-11e9-a2fc-000d3a38a36f,1.0,199.98,548,"[1.0, 199.98, 548.0]","[0.32770120750468923, 0.43181056393910344, 2.1..."
2,602897a6-cdac-11ea-b31f-000d3a38a36f,1.0,140.49,312,"[1.0, 140.49, 312.0]","[0.32770120750468923, 0.3033556662056438, 1.22..."
3,388e4c4e-af86-11e9-a2fc-000d3a38a36f,1.0,174.99,883,"[1.0, 174.99, 883.0]","[0.32770120750468923, 0.3778504379623148, 3.45..."
4,80664354-adf0-11eb-8f64-000d3a299ebf,2.0,283.95,27,"[2.0, 283.95, 27.0]","[0.6554024150093785, 0.613124360588601, 0.1057..."


## Task 5: K-Means

### Step 1: Optimum number of clusters

In [35]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [36]:
def compute_kmeans_model(df,k):
    kmeans_obj = KMeans()\
    .setSeed(142)\
    .setK(k)
    
    return kmeans_obj.fit(df)

In [37]:
evaluator = ClusteringEvaluator()

In [38]:
for k in range(2,12):
    
    kmeans_model = compute_kmeans_model(pipeline_df,k)
    
    transformed_df = kmeans_model.transform(pipeline_df)
    
    score = evaluator.evaluate(transformed_df)
    
    print(f"k:{k}  - score:{score}")

k:2  - score:0.4403139929816374
k:3  - score:0.655033488104594
k:4  - score:0.6658979549823151
k:5  - score:0.6777415637605754
k:6  - score:0.5104147694093093
k:7  - score:0.6780801245958924
k:8  - score:0.6494848988252858
k:9  - score:0.5039729078361186
k:10  - score:0.5132605636800477
k:11  - score:0.5526101794299998


### Step 2: Final Model

In [39]:
kmeans_model = compute_kmeans_model(pipeline_df,7)

In [40]:
# Prediction

transformed_df = kmeans_model.transform(pipeline_df)

In [41]:
transformed_df.limit(5).toPandas()

Unnamed: 0,master_id,Frequency,Monetary,Recency,unscaled_features,features,prediction
0,b3ace094-a17f-11e9-a2fc-000d3a38a36f,1.0,212.98,829,"[1.0, 212.98, 829.0]","[0.32770120750468923, 0.4598810576445157, 3.24...",1
1,c57d7c4c-a950-11e9-a2fc-000d3a38a36f,1.0,199.98,548,"[1.0, 199.98, 548.0]","[0.32770120750468923, 0.43181056393910344, 2.1...",1
2,602897a6-cdac-11ea-b31f-000d3a38a36f,1.0,140.49,312,"[1.0, 140.49, 312.0]","[0.32770120750468923, 0.3033556662056438, 1.22...",0
3,388e4c4e-af86-11e9-a2fc-000d3a38a36f,1.0,174.99,883,"[1.0, 174.99, 883.0]","[0.32770120750468923, 0.3778504379623148, 3.45...",1
4,80664354-adf0-11eb-8f64-000d3a299ebf,2.0,283.95,27,"[2.0, 283.95, 27.0]","[0.6554024150093785, 0.613124360588601, 0.1057...",0


### Step 3 : Descriptive statistics

In [42]:
transformed_df.groupBy("prediction").agg(F.count("Monetary").alias("count"),
                                        F.mean("Monetary").alias("avg_monetary"),
                                        F.mean("Recency").alias("avg_recency"),
                                        F.mean("Frequency").alias("avg_frequency")).sort(F.desc("avg_monetary")).show()

+----------+-----+------------------+------------------+------------------+
|prediction|count|      avg_monetary|       avg_recency|     avg_frequency|
+----------+-----+------------------+------------------+------------------+
|         6|    2|44160.100000000006|             629.0|             337.0|
|         5|    6|21260.864999999998|             223.5|150.16666666666666|
|         2|   58| 6210.019655172414|190.31034482758622| 44.86206896551724|
|         4| 1243| 2249.367779565568|188.83588093322606|13.729686242960579|
|         3| 9554| 841.2370232363438|191.11953108645594| 5.167050450073268|
|         0|36205|235.98326419006816|179.86767021129677|1.6333931777378816|
|         1|52932|174.87422202077803| 619.1671011864279|1.4147018816594876|
+----------+-----+------------------+------------------+------------------+



In [43]:
spark.stop()