### Scenario: E-commerce Sales Data Analysis 
#### Business Context: You're working with an e-commerce company that needs to analyze sales data to identify top-performing products and customer buying patterns.

##### Tasks
- Task 1: Calculate Total Revenue per Order
- Task 2: Filter High-Value Orders (> $500)
- Task 3: Aggregate Sales by Category
- Task 4: Find Top 3 Products by Revenue
- Task 5: Add Ranking within Category
- Task 6: Convert Date and Extract Features
- Task 7: Customer Segmentation (Multiple Orders)

In [3]:
# create spark session 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/10 14:53:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# Sample sales data
sales_data = [
    (1, "2024-01-15", "PROD001", "Electronics", 5, 299.99, "CUST001", "USA"),
    (2, "2024-01-15", "PROD002", "Clothing", 2, 49.99, "CUST002", "UK"),
    (3, "2024-01-16", "PROD001", "Electronics", 1, 299.99, "CUST003", "USA"),
    (4, "2024-01-16", "PROD003", "Books", 3, 15.99, "CUST001", "USA"),
    (5, "2024-01-17", "PROD002", "Clothing", 1, 49.99, "CUST004", "Canada"),
    (6, "2024-01-17", "PROD004", "Electronics", 2, 799.99, "CUST002", "UK"),
]

columns = ["order_id", "order_date", "product_id", "category", 
           "quantity", "unit_price", "customer_id", "country"]

df = spark.createDataFrame(sales_data, columns)

In [None]:
# Task 1: Calculate Total Revenue per Orde
df_revenue = df.withColumn("total_revenue", col("quantity") * col("unit_price"))

# Task 2:Filter High-Value Orders (> $500)
df_high_value = df_revenue.filter(col("total_revenue") > 500)

# Task 3: Aggregate Sales by Category
category_sales = df_revenue.groupBy("category").agg(
    sum("total_revenue").alias("total_sales"),
    count("order_id").alias("order_count"),
    avg("total_revenue").alias("avg_order_value")
).orderBy(desc("total_sales"))

In [None]:
top_products = df_revenue.groupBy("product_id", "category").agg(
    sum("total_revenue").alias("product_revenue")
).orderBy(desc("product_revenue")).limit(3)