Silver Table

In [0]:
from pyspark.sql.functions import *

orders_df = spark.table("ecommerce.bronze_olist.orders")

Applying Silver transformations (for business-safe)

In [0]:
orders_clean_df = (
    orders_df
    .withColumn("order_purchase_ts", to_timestamp("order_purchase_timestamp"))
    .withColumn("order_approved_ts", to_timestamp("order_approved_at"))
    .withColumn("order_delivered_customer_ts", to_timestamp("order_delivered_customer_date"))
    .withColumn("order_estimated_delivery_ts", to_timestamp("order_estimated_delivery_date"))
    .dropDuplicates(["order_id"])
)

Creating the Silver schema

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS ecommerce.silver_olist;

In [0]:
from pyspark.sql.functions import *

orders_df = spark.table("ecommerce.bronze_olist.orders")

In [0]:
orders_clean_df = (
    orders_df
    .withColumn("order_purchase_ts", to_timestamp("order_purchase_timestamp"))
    .withColumn("order_approved_ts", to_timestamp("order_approved_at"))
    .withColumn("order_delivered_customer_ts", to_timestamp("order_delivered_customer_date"))
    .withColumn("order_estimated_delivery_ts", to_timestamp("order_estimated_delivery_date"))
    .dropDuplicates(["order_id"])
)

In [0]:
orders_clean_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("ecommerce.silver_olist.orders_clean")

In [0]:
%sql
SELECT COUNT(*) FROM ecommerce.silver_olist.orders_clean;

COUNT(*)
99441


orders_clean Silver table created successfully

Creating customers_clean (Silver) to: Removes duplicates, Standardizes columns, and it Becomes the dimension table for joins

Read Bronze customers

In [0]:
customers_df = spark.table("ecommerce.bronze_olist.customers")

Clean customers (Silver logic) for Deduplication and Column standardization

In [0]:
customers_clean_df = (
    customers_df
    .dropDuplicates(["customer_id"])
    .withColumnRenamed("customer_zip_code_prefix", "zip_code_prefix")
)

Write Silver table

In [0]:
customers_clean_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("ecommerce.silver_olist.customers_clean")

Verifying with SQL

In [0]:
%sql
SELECT COUNT(*) FROM ecommerce.silver_olist.customers_clean;

COUNT(*)
99441


Now Creating orders_with_customers (FIRST JOIN)

In [0]:
orders_with_customers_df = (
    spark.table("ecommerce.silver_olist.orders_clean").alias("o")
    .join(
        spark.table("ecommerce.silver_olist.customers_clean").alias("c"),
        col("o.customer_id") == col("c.customer_id"),
        "left"
    )
)

Joined Silver Table

In [0]:
from pyspark.sql.functions import col

orders_with_customers_df = (
    spark.table("ecommerce.silver_olist.orders_clean").alias("o")
    .join(
        spark.table("ecommerce.silver_olist.customers_clean").alias("c"),
        col("o.customer_id") == col("c.customer_id"),
        "left"
    )
    .drop(col("c.customer_id"))   # 👈 THIS fixes the issue
)

In [0]:
orders_with_customers_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("ecommerce.silver_olist.orders_with_customers")

In [0]:
%sql
SELECT COUNT(*) FROM ecommerce.silver_olist.orders_with_customers;

COUNT(*)
99441


Silver orders_with_customers table created successfully

SILVER DELIVERY-DELAY LOGIC

Read the Silver Base Table

In [0]:
from pyspark.sql.functions import *

orders_wc_df = spark.table(
    "ecommerce.silver_olist.orders_with_customers"
)

Applying delivery-delay business logic | Delay = actual delivery date − estimated delivery date

If delay > 0 → order is delayed

Handle null deliveries safely

In [0]:
orders_delivery_status_df = (
    orders_wc_df
    .withColumn(
        "delivery_delay_days",
        datediff(
            col("order_delivered_customer_ts"),
            col("order_estimated_delivery_ts")
        )
    )
    .withColumn(
        "is_delayed",
        when(col("delivery_delay_days") > 0, 1).otherwise(0)
    )
)

Write the New Silver Table

In [0]:
orders_delivery_status_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(
        "ecommerce.silver_olist.orders_delivery_status"
    )

Verify

In [0]:
%sql
SELECT 
  COUNT(*) AS total_orders,
  SUM(is_delayed) AS delayed_orders
FROM ecommerce.silver_olist.orders_delivery_status;

total_orders,delayed_orders
99441,6535


An order is delayed if the actual delivery date exceeds the estimated delivery date. 
**Silver delivery-delay table created successfully**