## Retail Data Analysis with PySpark 🚀

🗂️ Sections in the Notebook:
Setup Spark Session

- Load & Explore Dataset

- Data Cleaning

- Feature Engineering

- Exploratory Analysis (Aggregations)

- Joins

- Window Functions

- Rollup & Cube (Advanced Grouping)

- Conclusion

### Setup Spark Session

In [1]:
from pyspark.sql import SparkSession

spark=SparkSession.builder.appName("Retail Data analysis").getOrCreate()

### Load and Explore Dataset

In [2]:
#read and load csv
df = spark.read.format('csv')\
    .option("header","true")\
    .option("inferSchema","true")\
    .load("retail-data/all/online-retail-dataset.csv")

In [3]:
#show schema
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



In [4]:
#Show few rows
df.show(5)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
only showing top 5 rows



### Data Cleaning

In [5]:
#Remove rows with nulls in important columns
df_clean=df.dropna(subset=["CustomerID","Description"])

In [6]:
#Remove negative quanitities or unit prices
df_clean=df_clean.filter((df_clean["Quantity"]>0) & (df_clean["UnitPrice"]>0))

In [7]:
#Remove canceled orders (InvoiceNo starts with 'C')
df_clean=df_clean.filter(~df_clean["InvoiceNo"].startswith("C"))

In [8]:
#show cleaned data
df_clean.show()

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|12/1/2010 8:26|     7.65|     17850|United Kingdom|
|   536365|    21730|GLASS STAR FROSTE...|       6|12/1/2010 8:26|     4.

In [9]:
df_clean.select("InvoiceDate").show(5, False)


+--------------+
|InvoiceDate   |
+--------------+
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
+--------------+
only showing top 5 rows



In [10]:
df_clean.select("InvoiceDate").show(10,truncate=False)

+--------------+
|InvoiceDate   |
+--------------+
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:28|
|12/1/2010 8:28|
|12/1/2010 8:34|
+--------------+
only showing top 10 rows



### Feature Engineering

In [11]:
from pyspark.sql.functions import to_timestamp

#Convert InoviceDate
df_clean=df_clean.withColumn("InvoiceDate",to_timestamp("InvoiceDate","M/d/yyyy H:mm"))

In [12]:
df_clean.show(2)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 2 rows



In [13]:
df_clean.select("InvoiceDate").show(5, False)


+-------------------+
|InvoiceDate        |
+-------------------+
|2010-12-01 08:26:00|
|2010-12-01 08:26:00|
|2010-12-01 08:26:00|
|2010-12-01 08:26:00|
|2010-12-01 08:26:00|
+-------------------+
only showing top 5 rows



In [14]:
from pyspark.sql.functions import col
#Add TotalPrice column
df_clean = df_clean.withColumn("TotalPrice",col("Quantity")*col("UnitPrice"))

In [15]:
df_clean.show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+------------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|        TotalPrice|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+------------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|     17850|United Kingdom|15.299999999999999|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|             20.34|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|     17850|United Kingdom|              22.0|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|             20.34|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|    

In [16]:
from pyspark.sql.functions import month,dayofweek
#Extract month and weekday
df_clean=df_clean.withColumn("Month",month("InvoiceDate"))
df_clean=df_clean.withColumn("Weekday",dayofweek("InvoiceDate"))

In [17]:
df_clean.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+------------------+-----+-------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|        TotalPrice|Month|Weekday|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+------------------+-----+-------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|     17850|United Kingdom|15.299999999999999|   12|      4|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|             20.34|   12|      4|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|     17850|United Kingdom|              22.0|   12|      4|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|     17850|United Kingdom|             20.34|   12|      4|
|   536365|  

In [18]:
df_clean.select("InvoiceNo","Quantity","UnitPrice","TotalPrice","Month","Weekday").show(5)

+---------+--------+---------+------------------+-----+-------+
|InvoiceNo|Quantity|UnitPrice|        TotalPrice|Month|Weekday|
+---------+--------+---------+------------------+-----+-------+
|   536365|       6|     2.55|15.299999999999999|   12|      4|
|   536365|       6|     3.39|             20.34|   12|      4|
|   536365|       8|     2.75|              22.0|   12|      4|
|   536365|       6|     3.39|             20.34|   12|      4|
|   536365|       6|     3.39|             20.34|   12|      4|
+---------+--------+---------+------------------+-----+-------+
only showing top 5 rows



### Aggregations - Prdouct & Country Analysis

In [19]:
#Top 10 products by revenue
from pyspark.sql.functions import desc

df_clean.groupBy("Description") \
    .agg({"TotalPrice": "sum"}) \
    .withColumnRenamed("sum(TotalPrice)", "Revenue") \
    .orderBy(col("Revenue").desc()) \
    .show(10)

+--------------------+------------------+
|         Description|           Revenue|
+--------------------+------------------+
|PAPER CRAFT , LIT...|          168469.6|
|REGENCY CAKESTAND...| 142592.9500000001|
|WHITE HANGING HEA...|100448.15000000013|
|JUMBO BAG RED RET...| 85220.77999999987|
|MEDIUM CERAMIC TO...| 81416.73000000001|
|             POSTAGE| 77803.95999999999|
|       PARTY BUNTING| 68844.32999999994|
|ASSORTED COLOUR B...| 56580.34000000013|
|              Manual|          53779.93|
|  RABBIT NIGHT LIGHT|51346.199999999866|
+--------------------+------------------+
only showing top 10 rows



In [20]:
#Top 5 countries by number of customers
df_clean.select("CustomerID","Country").distinct()\
    .groupBy("Country")\
    .count()\
    .orderBy(col("count").desc())\
    .show(5)

+--------------+-----+
|       Country|count|
+--------------+-----+
|United Kingdom| 3920|
|       Germany|   94|
|        France|   87|
|         Spain|   30|
|       Belgium|   25|
+--------------+-----+
only showing top 5 rows



### Joins - Customer Summary + Product Summary

In [21]:
# Customer summary
customer_df=df_clean.groupBy("CustomerID","Country")\
    .agg({"TotalPrice":"sum","InvoiceNo":"count"})\
    .withColumnRenamed("sum(TotalPrice)","TotalSpend")\
    .withColumnRenamed("count(InvoiceNo)","TotalOrders")

#Product summary
product_df=df_clean.groupBy("StockCode","Description")\
    .agg({"Quantity":"avg","TotalPrice":"sum"})\
    .withColumnRenamed("avg(Quantity)","Avgqty")\
    .withColumnRenamed("sum(TotalPrice)","ProductRevenue")

#join

joined_df=customer_df.join(df_clean,on="CustomerID",how="inner")
joined_df.show(5)

+----------+--------------+-----------------+-----------+---------+---------+--------------------+--------+-------------------+---------+--------------+----------+-----+-------+
|CustomerID|       Country|       TotalSpend|TotalOrders|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|       Country|TotalPrice|Month|Weekday|
+----------+--------------+-----------------+-----------+---------+---------+--------------------+--------+-------------------+---------+--------------+----------+-----+-------+
|     17420|United Kingdom|598.8299999999999|         30|   536385|    22783|SET 3 WICKER OVAL...|       1|2010-12-01 09:56:00|    19.95|United Kingdom|     19.95|   12|      4|
|     17420|United Kingdom|598.8299999999999|         30|   536385|    22961|JAM MAKING SET PR...|      12|2010-12-01 09:56:00|     1.45|United Kingdom|      17.4|   12|      4|
|     17420|United Kingdom|598.8299999999999|         30|   536385|    22960|JAM MAKING SET WI...|       6|201

### Window Funtions - Rank Top Customers in Each Country

In [22]:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy("Country").orderBy(col("TotalSpend").desc())

ranked_df = customer_df.withColumn("Rank",rank().over(windowSpec))

ranked_df.filter("Rank<=3").orderBy("Country","Rank").show()

+----------+---------------+------------------+-----------+----+
|CustomerID|        Country|        TotalSpend|TotalOrders|Rank|
+----------+---------------+------------------+-----------+----+
|     12415|      Australia|124914.53000000003|        714|   1|
|     12431|      Australia|           5514.67|        181|   2|
|     12388|      Australia|2780.6600000000003|        100|   3|
|     12360|        Austria|           2662.06|        129|   1|
|     12865|        Austria|1568.2299999999996|         95|   2|
|     12818|        Austria|           1542.08|         14|   3|
|     12355|        Bahrain|             459.4|         13|   1|
|     12353|        Bahrain|              89.0|          4|   2|
|     12362|        Belgium|5226.2300000000005|        266|   1|
|     12449|        Belgium|           4067.29|        191|   2|
|     12417|        Belgium|            3212.8|        169|   3|
|     12769|         Brazil|1143.6000000000001|         32|   1|
|     17444|         Cana

### Rollup & Cube

In [24]:
#Rollup: Revenue by Country & Month
df_clean.rollup("Country","Month")\
    .agg({"TotalPrice":"sum"})\
    .withColumnRenamed("sum(TotalPrice)","Revenue")\
    .orderBy("Country","Month")\
    .show(20)

+---------+-----+------------------+
|  Country|Month|           Revenue|
+---------+-----+------------------+
|     NULL| NULL| 8911407.903999643|
|Australia| NULL|         138521.31|
|Australia|    1| 9017.709999999995|
|Australia|    2|14695.419999999996|
|Australia|    3|17223.990000000013|
|Australia|    4| 771.6000000000001|
|Australia|    5|13638.410000000005|
|Australia|    6|25187.769999999997|
|Australia|    7|  4964.37999999999|
|Australia|    8|22489.199999999993|
|Australia|    9|5106.7300000000005|
|Australia|   10|          17150.53|
|Australia|   11| 7242.719999999999|
|Australia|   12|           1032.85|
|  Austria| NULL|          10198.68|
|  Austria|    2| 518.3600000000001|
|  Austria|    3|1708.1200000000001|
|  Austria|    4| 680.7800000000002|
|  Austria|    5|1249.4299999999998|
|  Austria|    7|1191.9499999999998|
+---------+-----+------------------+
only showing top 20 rows



In [26]:
# Cube: Revenue by Country & Weekday

df_clean.cube("Country","Weekday")\
    .agg({"TotalPrice":"sum"})\
    .withColumnRenamed("sum(TotalPrice)","Revenue")\
    .orderBy("Country","Weekday")\
    .show(20)

+---------+-------+------------------+
|  Country|Weekday|           Revenue|
+---------+-------+------------------+
|     NULL|   NULL| 8911407.903999643|
|     NULL|      1| 792514.2209999886|
|     NULL|      2|1367146.4109999998|
|     NULL|      3|1700634.6310000008|
|     NULL|      4|1588336.1700000032|
|     NULL|      5| 1976859.070000013|
|     NULL|      6|1485917.4009999963|
|Australia|   NULL|         138521.31|
|Australia|      1|1743.9299999999998|
|Australia|      2|1392.6399999999999|
|Australia|      3|32526.090000000004|
|Australia|      4| 45057.23000000001|
|Australia|      5| 53555.98999999999|
|Australia|      6|           4245.43|
|  Austria|   NULL|          10198.68|
|  Austria|      1| 586.8499999999999|
|  Austria|      2|2807.8899999999994|
|  Austria|      3|2574.0199999999995|
|  Austria|      4|1695.8400000000001|
|  Austria|      5|           1556.33|
+---------+-------+------------------+
only showing top 20 rows



### Conclusion

- Spark is powerful for analyzing large transactional datasets.
- We explored data cleaning, transformations, joins, aggregations, and window functions.
- This dataset can be further used for RFM Analysis or market basket analysis.