# Lab Solution - Grouping and Aggregating E-Commerce Data

In this lab, you'll practice working with grouping and aggregation in Spark using a dataset of e-commerce transactions. You'll perform various analyses to uncover patterns and insights in customer purchasing behavior.

### Objectives
- Use `groupBy` operations to summarize data
- Implement multiple aggregations
- Apply different ordering techniques
- (Bonus) Use window functions for advanced analytics

## Initial Setup

Load the retail transactions data and examine its structure.

In [0]:
from pyspark.sql.functions import *

## Read the e-commerce transactions data
transactions_df = spark.read.table("samples.bakehouse.sales_transactions")

## display a sample of the data
transactions_df.printSchema()

display(transactions_df.limit(10))

## Basic Grouping Operations

Let's start with simple grouping operations to understand product sales patterns.

In [0]:
# 1. Group the data by products and count the number of sales
# 2. Order the results by the most popular products

In [0]:
## Count transactions by product
product_counts = transactions_df \
    .groupBy("product") \
    .count() \
    .orderBy(desc("count"))

display(product_counts)

## Multiple Aggregations

Now let's perform multiple aggregations to get deeper insights.

In [0]:
# 1. Analyze sales by payment method
# 2. Calculate the total revenue, average transaction value, and count of transactions for each payment method
# 3. Order by total revenue (highest first)

In [0]:
## Analyze sales by payment method
payment_analysis = transactions_df \
    .groupBy("paymentMethod") \
    .agg(
        round(sum(col("totalPrice")), 2).alias("total_revenue"),
        round(avg(col("totalPrice")), 2).alias("avg_transaction_value"),
        count("*").alias("transaction_count")
    ) \
    .orderBy(desc("total_revenue"))

display(payment_analysis)

## Bonus Challenge: Window Functions

If you have time, try using window functions for advanced analytics.

In [0]:

## First, calculate total revenue by product and 
product_revenue_df = transactions_df \
    .groupBy("product") \
    .agg(
        round(sum(col("totalPrice")), 2).alias("total_revenue")
    )

## Using window functions to add rankings
## Ranking products by total revenue

from pyspark.sql.window import Window

## Create window spec for ranking categories
window_by_revenue = Window.orderBy(desc("total_revenue"))

## Add rankings
ranked_products_df = product_revenue_df \
    .withColumn("revenue_rank", rank().over(window_by_revenue))

## Display the rankings
display(ranked_products_df)

# Lab Solution - Working with Complex Data Types in E-Commerce Data

In this lab, you'll practice working with complex data types in Spark, including handling JSON strings, converting them to structured types, and manipulating nested data structures.

## Scenario

You are a data engineer at an e-commerce company that collects data about customer orders, product reviews, and customer browsing behavior. The data contains nested structures that need to be properly processed for analysis.

### Objectives
- Convert JSON string data to Spark SQL native complex types
- Work with arrays and structs
- Use functions like explode, collect_list, and pivot
- Extract and analyze valuable insights from nested data

## Dataset Setup

Run the following cell to configure your working environment for this course. 

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json

# Define our sample e-commerce data with JSON strings
data = [
    (1001, "Jordan Smith", "jordan.smith@email.com", "2022-03-15",
     """["loyal", "premium", "tech-enthusiast"]""",
     """[
         {"order_id": "O8823", "date": "2023-01-05", "total": 799.99, "items": [
           {"product_id": "PHONE-256", "name": "Smartphone XS", "price": 699.99, "quantity": 1},
           {"product_id": "CASE-101", "name": "Phone Case", "price": 29.99, "quantity": 1},
           {"product_id": "CHGR-201", "name": "Fast Charger", "price": 49.99, "quantity": 1}
         ]},
         {"order_id": "O9012", "date": "2023-02-18", "total": 129.95, "items": [
           {"product_id": "HDPHN-110", "name": "Wireless Headphones", "price": 129.95, "quantity": 1}
         ]}
       ]""",
     """["smartphones", "accessories", "audio", "wearables"]"""
    ),
    
    (1002, "Alex Johnson", "alex.j@email.com", "2021-11-20",
     """["new", "standard", "home-office"]""",
     """[
         {"order_id": "O8901", "date": "2023-01-10", "total": 1299.99, "items": [
           {"product_id": "LAPTOP-15", "name": "Ultrabook Pro", "price": 1199.99, "quantity": 1},
           {"product_id": "MOUSE-202", "name": "Ergonomic Mouse", "price": 49.99, "quantity": 1},
           {"product_id": "KYBRD-303", "name": "Mechanical Keyboard", "price": 89.99, "quantity": 1}
         ]}
       ]""",
     """["laptops", "office-equipment", "monitors", "storage"]"""
    ),
    
    (1003, "Taylor Williams", "t.williams@email.com", "2022-08-05",
     """["standard", "gamer"]""",
     """[
         {"order_id": "O9188", "date": "2023-02-01", "total": 2099.97, "items": [
           {"product_id": "GPU-3080", "name": "Graphics Card RTX", "price": 899.99, "quantity": 1},
           {"product_id": "CPU-i9", "name": "Processor i9", "price": 499.99, "quantity": 1},
           {"product_id": "RAM-32GB", "name": "Gaming RAM 32GB", "price": 189.99, "quantity": 2},
           {"product_id": "MBOARD-Z", "name": "Gaming Motherboard", "price": 319.99, "quantity": 1}
         ]}
       ]""",
     """["gaming", "pc-components", "monitors", "accessories"]"""
    ),
    
    (1004, "Morgan Lee", "morgan.lee@email.com", "2022-06-10",
     """["standard", "photography"]""",
     """[
         {"order_id": "O9021", "date": "2023-01-15", "total": 3299.98, "items": [
           {"product_id": "CAM-DSLR", "name": "Professional Camera", "price": 2499.99, "quantity": 1},
           {"product_id": "LENS-50mm", "name": "Prime Lens", "price": 349.99, "quantity": 1},
           {"product_id": "TRIPOD-P", "name": "Premium Tripod", "price": 149.99, "quantity": 1},
           {"product_id": "SDCARD-128", "name": "Memory Card 128GB", "price": 79.99, "quantity": 3}
         ]},
         {"order_id": "O9254", "date": "2023-02-28", "total": 299.98, "items": [
           {"product_id": "BAG-CAM", "name": "Camera Bag", "price": 189.99, "quantity": 1},
           {"product_id": "CLEAN-KIT", "name": "Lens Cleaning Kit", "price": 29.99, "quantity": 1}
         ]}
       ]""",
     """["cameras", "photography", "lenses", "accessories"]"""
    ),
    
    (1005, "Casey Rivera", "casey.r@email.com", "2021-09-30",
     """["premium", "smart-home"]""",
     """[
         {"order_id": "O8765", "date": "2023-01-02", "total": 1029.95, "items": [
           {"product_id": "SMHUB-01", "name": "Smart Home Hub", "price": 249.99, "quantity": 1},
           {"product_id": "SMSPK-02", "name": "Smart Speaker", "price": 179.99, "quantity": 2},
           {"product_id": "SMBLB-03", "name": "Smart Bulbs Pack", "price": 119.99, "quantity": 3},
           {"product_id": "SMSENS-04", "name": "Motion Sensors", "price": 89.99, "quantity": 1}
         ]},
         {"order_id": "O9181", "date": "2023-02-15", "total": 349.98, "items": [
           {"product_id": "SMDLOCK-05", "name": "Smart Door Lock", "price": 249.99, "quantity": 1},
           {"product_id": "SMCAM-06", "name": "Indoor Camera", "price": 99.99, "quantity": 1}
         ]}
       ]""",
     """["smart-home", "security", "automation", "speakers"]"""
    )
]

# Define the schema for the raw data
schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("registration_date", StringType(), True),
    StructField("tags", StringType(), True),
    StructField("recent_orders", StringType(), True),
    StructField("browsing_history", StringType(), True)
])

# Create DataFrame
ecommerce_df = spark.createDataFrame(data, schema)

# Create temporary view
ecommerce_df.createOrReplaceTempView("ecommerce_raw")

#### Querying the newly created table

In [0]:
%sql
select * from ecommerce_raw

## Load and Inspect Raw Data with JSON Strings

Load and examine the retail dataset which includes JSON strings.

In [0]:
## Read the sample dataset
events_df = spark.read.table("ecommerce_raw")

## Examine the schema and display sample data
events_df.printSchema()
display(events_df)

## Convert JSON Strings to Structured Types

The `tags`, `recent_orders`, and `browsing_history` columns contain JSON strings. Let's convert them to proper Spark structured types.

In [0]:
# 1. Get a sample of the JSON strings in each column
# 2. Infer schemas from the JSON samples
# 3. Convert the JSON strings to structured types using from_json and display the resulting DataFrame

In [0]:
## Get sample JSON strings
tags_json = ecommerce_df.select("tags").limit(1).collect()[0][0]
recent_orders_json = ecommerce_df.select("recent_orders").limit(1).collect()[0][0]
browsing_history_json = ecommerce_df.select("browsing_history").limit(1).collect()[0][0]

print("Tags sample:", tags_json)
print("\nRecent orders sample:", recent_orders_json)
print("\nBrowsing history sample:", browsing_history_json)

In [0]:
## Infer schemas from the JSON samples

## Define/infer schemas
tags_schema = schema_of_json(lit(tags_json))
recent_orders_schema = schema_of_json(lit(recent_orders_json))
browsing_history_schema = schema_of_json(lit(browsing_history_json))

In [0]:
parsed_df = ecommerce_df.select(
    "customer_id",
    "name",
    "email",
    "registration_date",
    from_json("tags", tags_schema).alias("tags"),
    from_json("recent_orders", recent_orders_schema).alias("recent_orders"),
    from_json("browsing_history", browsing_history_schema).alias("browsing_history")
)

## Examine the schema and display sample data
parsed_df.printSchema()
display(parsed_df)

## Working with Arrays

Now that we have proper structured data, let's analyze the customer tags and browsing history.

In [0]:
# 1. Calculate the number of tags and browsing history items for each customer
# 2. Explode the tags array to see all unique customer tags
# 3. Find the most common browsing categories across all customers
# HINT: use the `array_size` function or its alias `size`

In [0]:
## Calculate the number of tags and browsing history items for each customer
array_sizes_df = parsed_df.select(
    "customer_id",
    "name",
    size("tags").alias("num_tags"),
    size("browsing_history").alias("num_browsing_categories")
)

display(array_sizes_df)

In [0]:
## Explode tags to see all customer categorizations
exploded_tags_df = parsed_df.select(
    "customer_id",
    "name",
    explode("tags").alias("tag")
)

display(exploded_tags_df)

In [0]:
## Find the most common customer tags
## Count frequency of each tag
tag_counts_df = exploded_tags_df.groupBy("tag").count().orderBy(desc("count"))
display(tag_counts_df)

In [0]:
# 1. Explode the recent_orders array to analyze individual orders
# 2. Calculate total revenue per customer

In [0]:
## Explode recent_orders to analyze individual orders
orders_df = parsed_df.select(
    "customer_id",
    "name",
    explode("recent_orders").alias("order")
)

## Calculate total revenue per customer
customer_revenue_df = orders_df.groupBy(
    "customer_id",
    "name"
).agg(
    sum("order.total").alias("total_revenue"),
    count("order.order_id").alias("order_count")
).orderBy(desc("total_revenue"))

display(customer_revenue_df)

## Bonus Challenge: Analyze Customer Purchasing Patterns

Let's use the `collect_list` and `collect_set` aggregate functions to create summaries of customer purchasing patterns.

In [0]:
## First, create a flattened view of orders
order_items_df = orders_df.select(
    "customer_id",
    "name",
    "order.order_id",
    "order.date",
    explode("order.items").alias("item")
)

## Now extract the name field from each item
item_details_df = order_items_df.selectExpr(
    "customer_id",
    "name",
    "item.name as product_name"
)

# Inspect the data
display(item_details_df)

In [0]:
## Collect all products purchased by each customer, creating new columns called "all_products_purchased" and "unique_products_purchased" for each "customer_id"
customer_products_df = item_details_df.groupBy(
    "customer_id"
).agg(
    collect_list("product_name").alias("all_products_purchased"),
    collect_set("product_name").alias("unique_products_purchased")
)

display(customer_products_df)