<a href="https://colab.research.google.com/github/codingniket/Python-Training/blob/main/15-01-2026/Solution/Customer_Purchase_Behavior_%26_Loyalty_Analysis_using_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PHASE 1 – Ingestion & Cleaning
Use the same cleaning logic as Case Study 1:
1. Read orders.csv as all StringType.
2. Trim text columns.
3. Normalize city, category, product.
4. Clean amount:
Remove commas
Convert to IntegerType
Handle invalid values safely.
5. Parse order_date into DateType → order_date_clean .
6. Remove duplicate order_id.
7. Keep only Completed orders.
From this point onward, the dataset is considered clean_orders_df.

In [74]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
from pyspark.sql.functions import trim, col, when, to_date, sum as spark_sum, avg, desc, rank, lit, broadcast,coalesce, isnull,try_to_timestamp,regexp_extract,initcap, count,countDistinct, min as spark_min, max as spark_max
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("Project1").getOrCreate()

In [2]:
df_raw = spark.read \
.option("header", "true") \
.option("inferSchema", "false") \
.csv("orders.csv")

df_raw.show()
df_raw.printSchema()
df_raw.count()
df_raw.describe()

+-----------+-----------+-----------+-----------+-----------+-------+----------+---------+
|   order_id|customer_id|       city|   category|    product| amount|order_date|   status|
+-----------+-----------+-----------+-----------+-----------+-------+----------+---------+
|ORD00000000|    C000000| hyderabad |   grocery |       Oil |invalid|01/01/2024|Cancelled|
|ORD00000001|    C000001|       Pune|    Grocery|      Sugar|  35430|2024-01-02|Completed|
|ORD00000002|    C000002|       Pune|Electronics|     Mobile|  65358|2024-01-03|Completed|
|ORD00000003|    C000003|  Bangalore|Electronics|     Laptop|   5558|2024-01-04|Completed|
|ORD00000004|    C000004|       Pune|       Home|AirPurifier|  33659|2024-01-05|Completed|
|ORD00000005|    C000005|      Delhi|    Fashion|      Jeans|   8521|2024-01-06|Completed|
|ORD00000006|    C000006|      Delhi|    Grocery|      Sugar|  42383|2024-01-07|Completed|
|ORD00000007|    C000007|       Pune|    Grocery|       Rice|  45362|2024-01-08|Completed|

DataFrame[summary: string, order_id: string, customer_id: string, city: string, category: string, product: string, amount: string, order_date: string, status: string]

In [3]:
cols_to_clean = ["city", "category", "product", "amount","status"]
df_cleaned_str = df_raw
for col_name in cols_to_clean:
  df_cleaned_str = df_cleaned_str.withColumn(col_name, initcap(trim(col(col_name))))

In [4]:
df_cleaned_str.show()
print(df_cleaned_str.count())

+-----------+-----------+---------+-----------+-----------+-------+----------+---------+
|   order_id|customer_id|     city|   category|    product| amount|order_date|   status|
+-----------+-----------+---------+-----------+-----------+-------+----------+---------+
|ORD00000000|    C000000|Hyderabad|    Grocery|        Oil|Invalid|01/01/2024|Cancelled|
|ORD00000001|    C000001|     Pune|    Grocery|      Sugar|  35430|2024-01-02|Completed|
|ORD00000002|    C000002|     Pune|Electronics|     Mobile|  65358|2024-01-03|Completed|
|ORD00000003|    C000003|Bangalore|Electronics|     Laptop|   5558|2024-01-04|Completed|
|ORD00000004|    C000004|     Pune|       Home|Airpurifier|  33659|2024-01-05|Completed|
|ORD00000005|    C000005|    Delhi|    Fashion|      Jeans|   8521|2024-01-06|Completed|
|ORD00000006|    C000006|    Delhi|    Grocery|      Sugar|  42383|2024-01-07|Completed|
|ORD00000007|    C000007|     Pune|    Grocery|       Rice|  45362|2024-01-08|Completed|
|ORD00000008|    C000

In [5]:
numeric_price_str = regexp_extract(col("amount"), r"(\d+)", 0)
df_error_free = df_cleaned_str.withColumn("amount",when((numeric_price_str == "") | numeric_price_str.isNull(), lit(0))
.otherwise(numeric_price_str.cast('int')))\
.withColumn(
    "order_date",
    coalesce(
        to_date(try_to_timestamp(col("order_date"), lit("yyyy-MM-dd"))),
        to_date(try_to_timestamp(col("order_date"), lit("dd/MM/yyyy"))),
        to_date(try_to_timestamp(col("order_date"), lit("yyyy/MM/dd")))
    )
)

df_error_free.show()
df_error_free.printSchema()

+-----------+-----------+---------+-----------+-----------+------+----------+---------+
|   order_id|customer_id|     city|   category|    product|amount|order_date|   status|
+-----------+-----------+---------+-----------+-----------+------+----------+---------+
|ORD00000000|    C000000|Hyderabad|    Grocery|        Oil|     0|2024-01-01|Cancelled|
|ORD00000001|    C000001|     Pune|    Grocery|      Sugar| 35430|2024-01-02|Completed|
|ORD00000002|    C000002|     Pune|Electronics|     Mobile| 65358|2024-01-03|Completed|
|ORD00000003|    C000003|Bangalore|Electronics|     Laptop|  5558|2024-01-04|Completed|
|ORD00000004|    C000004|     Pune|       Home|Airpurifier| 33659|2024-01-05|Completed|
|ORD00000005|    C000005|    Delhi|    Fashion|      Jeans|  8521|2024-01-06|Completed|
|ORD00000006|    C000006|    Delhi|    Grocery|      Sugar| 42383|2024-01-07|Completed|
|ORD00000007|    C000007|     Pune|    Grocery|       Rice| 45362|2024-01-08|Completed|
|ORD00000008|    C000008|Bangalo

In [6]:
print(f"Count before cleaning: {df_error_free.count()}")

df_valid = df_error_free.dropna(subset=["amount", "order_date"])

print(f"Count after cleaning: {df_valid.count()}")


Count before cleaning: 300000
Count after cleaning: 297405


In [7]:
df_clean = df_valid.dropDuplicates(["order_id"])

print(f"Count after dropping duplicates: {df_clean.count()}")

Count after dropping duplicates: 297405


In [8]:
clean_orders_df = df_clean.filter(col("status") == "Completed")

print(f"Count after filtering: {clean_orders_df.count()}")

Count after filtering: 282535


PHASE 2 – Customer Metrics
Compute the following for each customer:
1. Total number of orders.
2. Total spending.
3. Average order value.
4. First purchase date.
5. Last purchase date.
6. Number of distinct cities ordered from.
7. Number of distinct categories ordered from.
These define the customer profile.

In [16]:
customer_per_order = clean_orders_df.groupBy("customer_id").agg(count("order_id").alias("Total Order"))
customer_per_order.show()

+-----------+-----------+
|customer_id|Total Order|
+-----------+-----------+
|    C000142|          6|
|    C000299|          6|
|    C000433|          6|
|    C001115|          6|
|    C001875|          6|
|    C002484|          5|
|    C002512|          6|
|    C002837|          6|
|    C003194|          6|
|    C003484|          6|
|    C004744|          6|
|    C004804|          6|
|    C005119|          6|
|    C005781|          6|
|    C006654|          6|
|    C007013|          6|
|    C008123|          6|
|    C008343|          6|
|    C008471|          6|
|    C009248|          6|
+-----------+-----------+
only showing top 20 rows


In [17]:
total_spending = clean_orders_df.groupBy("customer_id").agg(spark_sum("amount").alias("Total Spending"))
total_spending.show()

+-----------+--------------+
|customer_id|Total Spending|
+-----------+--------------+
|    C000142|        301300|
|    C000299|        216273|
|    C000433|        285507|
|    C001115|        163614|
|    C001875|        213381|
|    C002484|        106303|
|    C002512|        336838|
|    C002837|        213260|
|    C003194|        227384|
|    C003484|        300712|
|    C004744|        170512|
|    C004804|        170880|
|    C005119|        197410|
|    C005781|        197168|
|    C006654|        130659|
|    C007013|        241427|
|    C008123|        220137|
|    C008343|        230184|
|    C008471|        214694|
|    C009248|        203246|
+-----------+--------------+
only showing top 20 rows


In [18]:
avg_order_value = clean_orders_df.groupBy("customer_id").agg(avg("amount").alias("Avg Order Spending"))
avg_order_value.show()

+-----------+------------------+
|customer_id|Avg Order Spending|
+-----------+------------------+
|    C000142|50216.666666666664|
|    C000299|           36045.5|
|    C000433|           47584.5|
|    C001115|           27269.0|
|    C001875|           35563.5|
|    C002484|           21260.6|
|    C002512|56139.666666666664|
|    C002837|35543.333333333336|
|    C003194|37897.333333333336|
|    C003484|50118.666666666664|
|    C004744|28418.666666666668|
|    C004804|           28480.0|
|    C005119|32901.666666666664|
|    C005781|32861.333333333336|
|    C006654|           21776.5|
|    C007013|40237.833333333336|
|    C008123|           36689.5|
|    C008343|           38364.0|
|    C008471|35782.333333333336|
|    C009248|33874.333333333336|
+-----------+------------------+
only showing top 20 rows


In [45]:
first_purchase_date = clean_orders_df.groupBy("customer_id").agg(spark_min(col("order_date")).alias("First_Purchase_Date"))
first_purchase_date.show()

+-----------+-------------------+
|customer_id|First_Purchase_Date|
+-----------+-------------------+
|    C000142|         2024-01-03|
|    C000299|         2024-01-20|
|    C000433|         2024-01-14|
|    C001115|         2024-01-16|
|    C001875|         2024-01-16|
|    C002484|         2024-01-05|
|    C002512|         2024-01-13|
|    C002837|         2024-01-18|
|    C003194|         2024-01-15|
|    C003484|         2024-01-05|
|    C004744|         2024-01-05|
|    C004804|         2024-01-05|
|    C005119|         2024-01-20|
|    C005781|         2024-01-02|
|    C006654|         2024-01-15|
|    C007013|         2024-01-14|
|    C008123|         2024-01-04|
|    C008343|         2024-01-04|
|    C008471|         2024-01-12|
|    C009248|         2024-01-09|
+-----------+-------------------+
only showing top 20 rows


In [46]:
last_purchase_date = clean_orders_df.groupBy("customer_id").agg(spark_max(col("order_date")).alias("Last_Purchase_Date"))
last_purchase_date.show()

+-----------+------------------+
|customer_id|Last_Purchase_Date|
+-----------+------------------+
|    C000142|        2024-02-12|
|    C000299|        2024-02-29|
|    C000433|        2024-02-23|
|    C001115|        2024-02-25|
|    C001875|        2024-02-25|
|    C002484|        2024-02-14|
|    C002512|        2024-02-22|
|    C002837|        2024-02-27|
|    C003194|        2024-02-24|
|    C003484|        2024-02-14|
|    C004744|        2024-02-14|
|    C004804|        2024-02-14|
|    C005119|        2024-02-29|
|    C005781|        2024-02-11|
|    C006654|        2024-02-24|
|    C007013|        2024-02-23|
|    C008123|        2024-02-13|
|    C008343|        2024-02-13|
|    C008471|        2024-02-21|
|    C009248|        2024-02-18|
+-----------+------------------+
only showing top 20 rows


In [29]:
distinct_cities = clean_orders_df.groupBy("customer_id").agg(countDistinct("city").alias("Distinct_Cities"))
distinct_cities.show()

+-----------+---------------+
|customer_id|Distinct_Cities|
+-----------+---------------+
|    C001875|              5|
|    C030046|              5|
|    C042719|              4|
|    C026844|              4|
|    C021938|              4|
|    C001019|              4|
|    C037904|              4|
|    C046477|              5|
|    C003053|              5|
|    C013125|              4|
|    C008538|              4|
|    C041743|              4|
|    C012569|              5|
|    C002837|              5|
|    C027384|              4|
|    C002646|              5|
|    C042151|              5|
|    C046807|              4|
|    C006517|              4|
|    C027419|              5|
+-----------+---------------+
only showing top 20 rows


In [30]:
distinct_categories = clean_orders_df.groupBy("customer_id").agg(countDistinct("category").alias("Distinct_Categories"))
distinct_categories.show()

+-----------+-------------------+
|customer_id|Distinct_Categories|
+-----------+-------------------+
|    C022166|                  4|
|    C014126|                  3|
|    C019758|                  4|
|    C008343|                  4|
|    C032026|                  3|
|    C001875|                  3|
|    C018567|                  2|
|    C005781|                  4|
|    C017258|                  4|
|    C017645|                  4|
|    C040925|                  2|
|    C008471|                  4|
|    C048710|                  3|
|    C036742|                  3|
|    C012244|                  3|
|    C030565|                  4|
|    C030828|                  3|
|    C014566|                  3|
|    C038722|                  4|
|    C040253|                  4|
+-----------+-------------------+
only showing top 20 rows


PHASE 3 – Customer Segmentation
Create customer segments using business logic:

Total Spend >= 200000 AND Orders >= 5 → "VIP"
Total Spend >= 100000 → "Premium"
Else → "Regular"

Add a column:

customer_segment

Count customers in each segment.

In [47]:
all_customer_metrics = customer_per_order \
    .join(total_spending, "customer_id") \
    .join(avg_order_value, "customer_id") \
    .join(first_purchase_date, "customer_id") \
    .join(last_purchase_date, "customer_id") \
    .join(distinct_cities, "customer_id") \
    .join(distinct_categories, "customer_id")

all_customer_metrics.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+
|    C000142|          6|        301300|50216.666666666664|         2024-01-03|        2024-02-12|              4|                  4|
|    C000299|          6|        216273|           36045.5|         2024-01-20|        2024-02-29|              4|                  4|
|    C000433|          6|        285507|           47584.5|         2024-01-14|        2024-02-23|              4|                  3|
|    C001115|          6|        163614|           27269.0|         2024-01-16|        2024-02-25|              6|                  4|
|    C001875|          6|        213381|           3556

In [48]:
segmented_df = all_customer_metrics.withColumn("customer_segment", when((col("Total Spending") >= 200000) & (col("Total Order") >= 5), "VIP")
                                                   .when((col("Total Spending") >= 100000), "Premium")
                                                   .otherwise("Regular"))

segmented_df.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+
|    C000142|          6|        301300|50216.666666666664|         2024-01-03|        2024-02-12|              4|                  4|             VIP|
|    C000299|          6|        216273|           36045.5|         2024-01-20|        2024-02-29|              4|                  4|             VIP|
|    C000433|          6|        285507|           47584.5|         2024-01-14|        2024-02-23|              4|                  3|             VIP|
|    C001115|          6|        163614|           27269.0|         2024-01-16|        2

PHASE 4 – Window Functions
Using Window functions:
1. Rank customers by total spending (overall).
2. Rank customers inside each city by total spending.
3. Identify top 3 customers per city.
4. Identify top 10 customers across all cities.
This phase must use:

Window.partitionBy()

In [49]:
customer_window = Window.orderBy(col("Total Spending").desc())
df_customer_rank = segmented_df.withColumn("rank", rank().over(customer_window))
df_customer_rank.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+----+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|rank|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+----+
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-02-26|              5|                  4|             VIP|   1|
|    C034689|          6|        486879|           81146.5|         2024-01-10|        2024-02-19|              4|                  3|             VIP|   2|
|    C039985|          6|        484057| 80676.16666666667|         2024-01-06|        2024-02-15|              3|                  4|             VIP|   3|
|    C026691|          6|        477147|           79524.5

In [50]:
total_data = segmented_df.join(clean_orders_df, "customer_id")
total_data.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-------+------+----------+---------+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|   order_id|     city|   category|product|amount|order_date|   status|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-------+------+----------+---------+
|    C000002|          6|        143436|           23906.0|         2024-01-03|        2024-02-12|              4|                  4|         Premium|ORD00150002|     Pune|    Grocery|   Rice| 16360|2024-01-03|Completed|
|    C000002|          6|        143436|           23906.0|         2024-01-03|        2024-02-12|              

In [51]:
df_customer_city_rank = total_data.withColumn("rank", rank().over(customer_window))
df_customer_city_rank.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-----------+------+----------+---------+----+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|   order_id|     city|   category|    product|amount|order_date|   status|rank|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-----------+------+----------+---------+----+
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-02-26|              5|                  4|             VIP|ORD00143076|Bangalore|       Home|     Vacuum| 87330|2024-02-06|Completed|   1|
|    C043076|          6|        493949| 82324.83333333333|         2024-01-

In [52]:
df_customer_city_rank.limit(3).show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+--------+-------+------+----------+---------+----+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|   order_id|     city|category|product|amount|order_date|   status|rank|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+--------+-------+------+----------+---------+----+
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-02-26|              5|                  4|             VIP|ORD00143076|Bangalore|    Home| Vacuum| 87330|2024-02-06|Completed|   1|
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-02-26|      

In [53]:
df_customer_city_rank.limit(10).show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-------+------+----------+---------+----+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|   order_id|     city|   category|product|amount|order_date|   status|rank|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-------+------+----------+---------+----+
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-02-26|              5|                  4|             VIP|ORD00293076|Hyderabad|    Grocery|  Sugar| 73381|2024-02-06|Completed|   1|
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-

PHASE 5 – Customer Loyalty Analysis
Define loyalty:
A loyal customer is one who:
Has purchases on at least 3 different dates
Has ordered from at least 2 different categories
Tasks:
1. Identify loyal customers.
2. Count loyal customers per city.
3. Compare loyal vs non-loyal customer revenue contribution.

In [55]:
loyal_customer = total_data.withColumn("Type",when((col("Distinct_Categories") >= 2) & (col("First_Purchase_Date") != col("Last_Purchase_Date")), "Loyal").otherwise("Non-loyal"))
loyal_customer.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-------+------+----------+---------+-----+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|   order_id|     city|   category|product|amount|order_date|   status| Type|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-----------+---------+-----------+-------+------+----------+---------+-----+
|    C000002|          6|        143436|           23906.0|         2024-01-03|        2024-02-12|              4|                  4|         Premium|ORD00150002|     Pune|    Grocery|   Rice| 16360|2024-01-03|Completed|Loyal|
|    C000002|          6|        143436|           23906.0|         2024-01-03|        2

In [60]:
city_wise = loyal_customer.groupBy("customer_id","Type").agg(count("city").alias("Different City of Purchase"))
city_wise.show()

+-----------+-----+--------------------------+
|customer_id| Type|Different City of Purchase|
+-----------+-----+--------------------------+
|    C000142|Loyal|                         6|
|    C000299|Loyal|                         6|
|    C000433|Loyal|                         6|
|    C001115|Loyal|                         6|
|    C001875|Loyal|                         6|
|    C002484|Loyal|                         5|
|    C002512|Loyal|                         6|
|    C002837|Loyal|                         6|
|    C003194|Loyal|                         6|
|    C003484|Loyal|                         6|
|    C004744|Loyal|                         6|
|    C004804|Loyal|                         6|
|    C005119|Loyal|                         6|
|    C005781|Loyal|                         6|
|    C006654|Loyal|                         6|
|    C007013|Loyal|                         6|
|    C008123|Loyal|                         6|
|    C008343|Loyal|                         6|
|    C008471|

In [62]:
type_wise = loyal_customer.groupBy("Type").agg(spark_sum("amount").alias("Revenue"))
type_wise.show()

+---------+-----------+
|     Type|    Revenue|
+---------+-----------+
|    Loyal|11180454642|
|Non-loyal|   15638036|
+---------+-----------+



PHASE 6 – Time-Based Analysis
Using order_date_clean:
1. Compute monthly revenue per city.
2. Compute monthly order count per category.
3. Identify growth or decline trends.
This introduces:

Date functions
Time-series thinking

In [66]:
from pyspark.sql.functions import month, date_format

new_df = clean_orders_df.withColumn("month_integer", month(col("order_date")))

new_df.show()

+-----------+-----------+---------+-----------+-----------+------+----------+---------+-------------+
|   order_id|customer_id|     city|   category|    product|amount|order_date|   status|month_integer|
+-----------+-----------+---------+-----------+-----------+------+----------+---------+-------------+
|ORD00000001|    C000001|     Pune|    Grocery|      Sugar| 35430|2024-01-02|Completed|            1|
|ORD00000007|    C000007|     Pune|    Grocery|       Rice| 45362|2024-01-08|Completed|            1|
|ORD00000008|    C000008|Bangalore|    Fashion|      Jeans| 10563|2024-01-09|Completed|            1|
|ORD00000010|    C000010|Bangalore|    Grocery|      Sugar| 66576|2024-01-11|Completed|            1|
|ORD00000011|    C000011|  Kolkata|Electronics|     Tablet| 50318|2024-01-12|Completed|            1|
|ORD00000012|    C000012|Bangalore|    Grocery|      Sugar| 84768|2024-01-13|Completed|            1|
|ORD00000014|    C000014|   Mumbai|Electronics|     Tablet| 79469|2024-01-15|Compl

In [69]:
monthly_revenue_city = new_df.groupBy("city", "month_integer").agg(spark_sum("amount").alias("monthly_revenue"))
monthly_revenue_city.show()

+---------+-------------+---------------+
|     city|month_integer|monthly_revenue|
+---------+-------------+---------------+
|   Mumbai|            1|      806518278|
|   Mumbai|            2|      786301679|
|    Delhi|            1|      807202773|
|Hyderabad|            2|      786254815|
|  Chennai|            2|      786495303|
|  Chennai|            1|      808473493|
|Bangalore|            2|      783004473|
|Hyderabad|            1|      823005673|
|  Kolkata|            2|      775433858|
|    Delhi|            2|      795483411|
|     Pune|            2|      787817529|
|  Kolkata|            1|      814526860|
|     Pune|            1|      823485156|
|Bangalore|            1|      812089377|
+---------+-------------+---------------+



In [71]:
monthly_order_count_category = new_df.groupBy("category", "month_integer").agg(count("order_id").alias("monthly_revenue"))
monthly_order_count_category.show()

+-----------+-------------+---------------+
|   category|month_integer|monthly_revenue|
+-----------+-------------+---------------+
|Electronics|            1|          35994|
|       Home|            1|          36163|
|Electronics|            2|          34766|
|    Fashion|            1|          35571|
|       Home|            2|          34631|
|    Grocery|            2|          34672|
|    Fashion|            2|          34720|
|    Grocery|            1|          36018|
+-----------+-------------+---------------+



PHASE 7 – Performance Engineering
1. Identify which DataFrames are reused.
2. Apply caching.
3. Use explain(True) on:
Customer aggregation
Window ranking
4. Identify shuffle stages.
5. Justify any repartitioning strategy.

 `clean_orders_df` and `segmented_df`


In [72]:
print("--- Phase 7: Explain Plan for Customer Aggregation ---")
total_data.explain(True)

print("--- Phase 7: Explain Plan for Window Ranking ---")
new_df.explain(True)



--- Phase 7: Explain Plan for Customer Aggregation ---
== Parsed Logical Plan ==
'Join UsingJoin(Inner, [customer_id])
:- Project [customer_id#18, Total Order#504L, Total Spending#569L, Avg Order Spending#646, First_Purchase_Date#6697, Last_Purchase_Date#6774, Distinct_Cities#960L, Distinct_Categories#1043L, CASE WHEN ((Total Spending#569L >= cast(200000 as bigint)) AND (Total Order#504L >= cast(5 as bigint))) THEN VIP WHEN (Total Spending#569L >= cast(100000 as bigint)) THEN Premium ELSE Regular END AS customer_segment#7501]
:  +- Project [customer_id#18, Total Order#504L, Total Spending#569L, Avg Order Spending#646, First_Purchase_Date#6697, Last_Purchase_Date#6774, Distinct_Cities#960L, Distinct_Categories#1043L]
:     +- Join Inner, (customer_id#18 = customer_id#6932)
:        :- Project [customer_id#18, Total Order#504L, Total Spending#569L, Avg Order Spending#646, First_Purchase_Date#6697, Last_Purchase_Date#6774, Distinct_Cities#960L]
:        :  +- Join Inner, (customer_id#18 =

4 & 5. Strategy:
Shuffle occurs during groupBy and Window operations.
Repartitioning by 'city' before the Window function in Phase 4 could optimize performance if data is skewed.

Example: city_spend_df.repartition("city")

PHASE 8 – Broadcast Join (Light Use)
Create a small lookup:

segment_code,segment_label
1,VIP
2,Premium
3,Regular

Map:

VIP → 1
Premium → 2
Regular → 3

Tasks:
1. Create this as a small DataFrame.
2. Join with customer segmentation output.
3. Force broadcast join.
4. Verify BroadcastHashJoin in plan.

In [73]:
data_lookup = [("VIP", 1), ("Premium", 2), ("Regular", 3)]
lookup_df = spark.createDataFrame(data_lookup, ["segment_label", "segment_code"])

In [75]:
broadcast_result = segmented_df.join(
broadcast(lookup_df),
segmented_df.customer_segment == lookup_df.segment_label,
"inner"
)

In [76]:
broadcast_result.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [customer_segment#7501], [segment_label#15104], Inner, BuildRight, false
   :- Project [customer_id#15107, Total Order#504L, Total Spending#569L, Avg Order Spending#646, First_Purchase_Date#6697, Last_Purchase_Date#6774, Distinct_Cities#960L, Distinct_Categories#1043L, CASE WHEN ((Total Spending#569L >= 200000) AND (Total Order#504L >= 5)) THEN VIP WHEN (Total Spending#569L >= 100000) THEN Premium ELSE Regular END AS customer_segment#7501]
   :  +- BroadcastHashJoin [customer_id#15107], [customer_id#15191], Inner, BuildRight, false
   :     :- Project [customer_id#15107, Total Order#504L, Total Spending#569L, Avg Order Spending#646, First_Purchase_Date#6697, Last_Purchase_Date#6774, Distinct_Cities#960L]
   :     :  +- BroadcastHashJoin [customer_id#15107], [customer_id#15177], Inner, BuildRight, false
   :     :     :- Project [customer_id#15107, Total Order#504L, Total Spending#569L, Avg Order Spending#646, 

In [77]:
broadcast_result.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-------------+------------+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|segment_label|segment_code|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+-------------+------------+
|    C000142|          6|        301300|50216.666666666664|         2024-01-03|        2024-02-12|              4|                  4|             VIP|          VIP|           1|
|    C000299|          6|        216273|           36045.5|         2024-01-20|        2024-02-29|              4|                  4|             VIP|          VIP|           1|
|    C000433|          6|        285507|           47584.5|         2024-01-14|        2024-02-23|       

PHASE 9 – Sorting & Set Operations
1. Sort customers by:

Total spend descending
Order count descending

2. Create two sets:
Customers who bought Electronics
Customers who bought Grocery

3. Find:
Customers in both sets
Customers in only one set

In [79]:
segmented_df.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+
|    C000142|          6|        301300|50216.666666666664|         2024-01-03|        2024-02-12|              4|                  4|             VIP|
|    C000299|          6|        216273|           36045.5|         2024-01-20|        2024-02-29|              4|                  4|             VIP|
|    C000433|          6|        285507|           47584.5|         2024-01-14|        2024-02-23|              4|                  3|             VIP|
|    C001115|          6|        163614|           27269.0|         2024-01-16|        2

In [81]:
sorted_customers = segmented_df.orderBy(col("Total Spending").desc(), col("Total Order").desc())
sorted_customers.show()

+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+
|customer_id|Total Order|Total Spending|Avg Order Spending|First_Purchase_Date|Last_Purchase_Date|Distinct_Cities|Distinct_Categories|customer_segment|
+-----------+-----------+--------------+------------------+-------------------+------------------+---------------+-------------------+----------------+
|    C043076|          6|        493949| 82324.83333333333|         2024-01-17|        2024-02-26|              5|                  4|             VIP|
|    C034689|          6|        486879|           81146.5|         2024-01-10|        2024-02-19|              4|                  3|             VIP|
|    C039985|          6|        484057| 80676.16666666667|         2024-01-06|        2024-02-15|              3|                  4|             VIP|
|    C026691|          6|        477147|           79524.5|         2024-01-12|        2

In [82]:
electronics_cust = clean_orders_df.filter(col("category") == "Electronics").select("customer_id").distinct()
grocery_cust = clean_orders_df.filter(col("category") == "Grocery").select("customer_id").distinct()

both_sets = electronics_cust.intersect(grocery_cust)

only_electronics = electronics_cust.subtract(grocery_cust)

print(f"Both: {both_sets.count()}, Only Electronics: {only_electronics.count()}")

Both: 31006, Only Electronics: 7831


PHASE 10 – Storage Strategy
1. Write customer master dataset to:

Parquet

Partitioned by:

customer_segment

2. Write monthly analytics to:

ORC

3. Read back and validate.

In [None]:
segmented_df.write.mode("overwrite").partitionBy("customer_segment").parquet("output/customer_master_parquet")

monthly_revenue_city.write.mode("overwrite").orc("output/monthly_analytics_orc")



PHASE 11 – Debugging
Explain why this is dangerous:

df = df.groupBy("customer_id").sum("amount").show()

Explain:
What df becomes
Why pipeline breaks
Correct approach

EXPLANATION:
1. What df becomes:
The .show() method is an Action that prints to console and returns 'None'.

Therefore, the variable 'df' is assigned the value 'None' (null).