Dataset Name: **Northwind Traders Database**

Kaggle Link: https://www.kaggle.com/datasets/jealousleopard/northwind

Why this dataset?

- Contains 8 related tables (perfect for joins)
- Medium-sized dataset (realistic for practice)
- Classic relational database structure
- Includes customers, orders, products, employees, etc.

Dataset Tables:
Customers - Customer information

- `Orders` - Order headers
- `OrderDetails` - Order line items
- `Products` - Product information
- `Employees` - Employee data
- `Categories` - Product categories
- `Shippers` - Shipping companies

 ## Initialize Spark Session and Load Data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("NorthwindAnalysis") \
    .getOrCreate()

In [2]:
# Load all tables
customers = spark.read.csv("customers.csv", header=True, inferSchema=True)
orders = spark.read.csv("orders.csv", header=True, inferSchema=True)
order_details = spark.read.csv("order_details.csv", header=True, inferSchema=True)
products = spark.read.csv("products.csv", header=True, inferSchema=True)
employees = spark.read.csv("employees.csv", header=True, inferSchema=True)
categories = spark.read.csv("categories.csv", header=True, inferSchema=True)
shippers = spark.read.csv("shippers.csv", header=True, inferSchema=True)
# suppliers = spark.read.csv("suppliers.csv", header=True, inferSchema=True)

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/content/customers.csv.

## Transformation Examples with Northwind Data

 Basic Transformations

In [None]:
# Filter - German customers
german_customers = customers.filter(col("Country") == "Germany")

# Select - Specific columns
customer_contact = customers.select("CustomerID", "CompanyName", "ContactName")

# WithColumn - Calculate order age
from pyspark.sql.functions import datediff, current_date
orders_with_age = orders.withColumn(
    "OrderAgeDays",
    datediff(current_date(), col("OrderDate"))
)

# Drop - Remove unnecessary columns
products_clean = products.drop("QuantityPerUnit")

Aggregations

In [None]:
# Product sales aggregation
product_sales = order_details.groupBy("ProductID") \
    .agg(
        sum("Quantity").alias("TotalUnitsSold"),
        sum(col("Quantity") * col("UnitPrice")).alias("TotalRevenue"),
        avg("UnitPrice").alias("AvgUnitPrice")
    )

# Employee order count
employee_performance = orders.groupBy("EmployeeID") \
    .agg(
        count("OrderID").alias("OrderCount"),
        min("OrderDate").alias("FirstOrderDate"),
        max("OrderDate").alias("LastOrderDate")
    )

Join Operations

In [None]:
# Complete order information
complete_orders = orders.join(order_details, "OrderID") \
                      .join(products.withColumnRenamed("unitPrice", "product_unitPrice"), "ProductID") \
                      .join(customers.withColumnRenamed("country", "customer_country"), "CustomerID") \
                      .join(employees.withColumnRenamed("city", "employee_city").withColumnRenamed("country", "employee_country"), "EmployeeID")

# Products with category names
products_with_categories = products.join(categories, "CategoryID")

# Left join to find unsold products
unsold_products = products.join(
    order_details,
    "ProductID",
    "left"
).filter(col("OrderID").isNull())

Window Functions

In [None]:
# Customer order ranking
customer_window = Window.partitionBy("CustomerID").orderBy(col("OrderDate").desc())
customer_orders_ranked = orders.withColumn(
    "OrderRank",
    rank().over(customer_window)
)

# Monthly sales growth
monthly_sales = orders.join(order_details, "OrderID") \
    .groupBy(month("OrderDate").alias("Month")) \
    .agg(sum(col("Quantity") * col("UnitPrice")).alias("MonthlySales"))

sales_window = Window.orderBy("Month")
monthly_growth = monthly_sales.withColumn(
    "PrevMonthSales",
    lag("MonthlySales").over(sales_window)
).withColumn(
    "GrowthPct",
    (col("MonthlySales") - col("PrevMonthSales")) / col("PrevMonthSales") * 100
)

## Action Examples

In [None]:
# Show results
complete_orders.show(5)

# Count records
print(f"Total customers: {customers.count()}")
print(f"Total orders: {orders.count()}")

# Collect top products to driver
top_products = product_sales.orderBy(col("TotalRevenue").desc()).take(5)

# Save results
complete_orders.write.mode("overwrite").parquet("output/complete_orders.parquet")
product_sales.write.mode("overwrite").csv("output/product_sales.csv", header=True)

## Complex Business Queries

Top 5 Customers by Revenue

In [None]:
top_customers = complete_orders.groupBy(
    "CustomerID", "CompanyName"
).agg(
    sum(col("Quantity") * col("UnitPrice")).alias("TotalSpent"),
    countDistinct("OrderID").alias("OrderCount")
).orderBy(
    col("TotalSpent").desc()
).limit(5)

Employee Sales Performance

In [None]:
employee_sales = complete_orders.groupBy(
    "EmployeeID",
    "employeeName"
).agg(
    sum(col("Quantity") * col("UnitPrice")).alias("TotalSales"),
    countDistinct("OrderID").alias("OrderCount"),
    avg(col("Quantity") * col("UnitPrice")).alias("AvgOrderValue")
).orderBy(
    col("TotalSales").desc()
)

Product Category Analysis

In [None]:
category_performance = products_with_categories.join(
    order_details.withColumnRenamed("unitPrice", "order_unitPrice"), "ProductID"
).groupBy(
    "CategoryID", "CategoryName"
).agg(
    sum(col("Quantity") * col("order_unitPrice")).alias("CategoryRevenue"),
    avg(col("order_unitPrice")).alias("AvgProductPrice"),
    countDistinct("ProductID").alias("ProductCount")
)

This notebook demonstrates various PySpark transformations and actions applied to the Northwind Traders database. The goal is to showcase common data manipulation techniques using Spark DataFrames.

The dataset tables used include:
- `Customers`
- `Orders`
- `OrderDetails`
- `Products`
- `Employees`
- `Categories`
- `Shippers`

### PySpark Transformations

Transformations are lazy operations that define the data manipulation logic. They do not execute immediately but build a plan that is executed when an action is called.

---

#### Filtering Data

Filtering selects rows based on a condition. Below is an example of filtering the `customers` DataFrame to get only customers from Germany.


#### Selecting Columns

- Selecting chooses specific columns from a DataFrame. Here, we select the CustomerID, CompanyName, and ContactName from the `customers` DataFrame.



#### Adding New Columns (`withColumn`)

- `withColumn` is used to add a new column to a DataFrame or replace an existing one. In this example, we calculate the age of each order in days.



#### Dropping Columns (`drop`)

- The `drop` transformation removes specified columns from a DataFrame. Here, we remove the `QuantityPerUnit` column from the `products` DataFrame.

---

#### Aggregations (`groupBy` and `agg`)

Aggregations are used to group data by one or more columns and then perform aggregate functions (like sum, count, average, min, max) on other columns.

Below is an example of calculating total units sold, total revenue, and average unit price for each product.



Here, we aggregate the `orders` DataFrame to find the order count, first order date, and last order date for each employee.

---

#### Join Operations

Joins combine data from two or more DataFrames based on related columns. Different types of joins exist, such as inner, outer, left, and right joins.

This example demonstrates joining multiple tables (`orders`, `order_details`, `products`, `customers`, `employees`) to create a comprehensive view of orders.

This code joins `products` with `categories` to add category names to the product information.

This example uses a left join to identify products that have not been sold by checking for null OrderID values in the joined result.

---


#### Window Functions

Window functions perform calculations across a set of DataFrame rows that are related to the current row. They are used for tasks like ranking, calculating moving averages, and accessing previous or subsequent rows.

This example uses a window function to rank orders for each customer based on the order date.

This code calculates the monthly sales and then uses a window function (`lag`) to determine the previous month's sales and calculate the monthly growth percentage.

---

### PySpark Actions

Actions are operations that trigger the execution of the transformations plan and return a result to the driver program or write data to storage.

#### Displaying Results (`show`)

- The `show()` action displays the top rows of a DataFrame.

#### Counting Records (`count`)

- The `count()` action returns the number of rows in a DataFrame.

#### Collecting Data (`collect`, `take`)

- Actions like `collect()` and `take()` return data from the DataFrame to the driver program. `collect()` brings all data (use with caution on large datasets), while `take(n)` brings the first `n` rows. Here, we collect the top 5 products by revenue.

#### Saving Data (`write`)

- The `write` action saves the contents of a DataFrame to various data sources (e.g., Parquet, CSV, JSON). The `mode("overwrite")` option is used here to replace the file if it already exists.