# 🏗️ Simple Medallion Pipeline

A beginner-friendly guide to building data pipelines with the medallion architecture.

## What is Medallion Architecture?

**Bronze → Silver → Gold**

- **Bronze**: Raw data (exactly as received)
- **Silver**: Clean data (validated and standardized) 
- **Gold**: Business data (aggregated and ready for analytics)

## Benefits:
- **Traceability**: Can trace any issue back to raw data
- **Flexibility**: Can reprocess data if business rules change
- **Quality**: Multiple validation layers ensure data quality
- **Performance**: Optimized tables for different use cases

## 1. Setup Environment

Let's start with basic setup and libraries.

In [0]:
# Import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create Spark session
#spark = SparkSession.builder \
#    .appName("SimpleMedallionPipeline") \
#    .getOrCreate()

print("✅ Spark session created")
print(f"Spark version: {spark.version}")

✅ Spark session created
Spark version: 4.0.0


## 🥉 Bronze Layer - Raw Data

Store data exactly as received from source systems.

In [0]:
# Sample raw sales data (as it comes from different systems)
raw_sales_data = [
    ("001", "John", "2024-01-15", "laptop", "1200.00", "electronics"),
    ("002", "jane", "2024-01-15", "book", "25", "books"),
    ("003", "Bob", "2024-01-16", "phone", "800.0", "Electronics"),
    ("004", "alice", "2024-01-16", "desk", "300.00", "furniture"),
    ("005", "Charlie", "2024-01-17", "chair", "150", "Furniture")
]

# Create DataFrame (Bronze layer)
bronze_df = spark.createDataFrame(raw_sales_data, [
    "order_id", "customer_name", "order_date", 
    "product", "amount", "category"
])

# Add metadata
bronze_df = bronze_df.withColumn("ingested_at", current_timestamp())

print("🥉 Bronze Layer - Raw Data:")
bronze_df.show()

# Save bronze data
bronze_df.write.mode("overwrite").saveAsTable("bronze_sales")
print("✅ Bronze data saved")

🥉 Bronze Layer - Raw Data:
+--------+-------------+----------+-------+-------+-----------+--------------------+
|order_id|customer_name|order_date|product| amount|   category|         ingested_at|
+--------+-------------+----------+-------+-------+-----------+--------------------+
|     001|         John|2024-01-15| laptop|1200.00|electronics|2025-09-16 11:43:...|
|     002|         jane|2024-01-15|   book|     25|      books|2025-09-16 11:43:...|
|     003|          Bob|2024-01-16|  phone|  800.0|Electronics|2025-09-16 11:43:...|
|     004|        alice|2024-01-16|   desk| 300.00|  furniture|2025-09-16 11:43:...|
|     005|      Charlie|2024-01-17|  chair|    150|  Furniture|2025-09-16 11:43:...|
+--------+-------------+----------+-------+-------+-----------+--------------------+

✅ Bronze data saved


## 🥈 Silver Layer - Clean Data

Clean and standardize the data for consistency.

In [0]:
# Read from bronze layer
bronze_data = spark.table("bronze_sales")

# Clean and standardize data
silver_df = bronze_data \
    .withColumn("customer_name", initcap(col("customer_name"))) \
    .withColumn("amount", col("amount").cast("double")) \
    .withColumn("category", lower(col("category"))) \
    .withColumn("order_date", to_date(col("order_date"))) \
    .filter(col("amount") > 0) \
    .withColumn("processed_at", current_timestamp())

print("🥈 Silver Layer - Clean Data:")
silver_df.show()

# Save silver data
silver_df.write.mode("overwrite").saveAsTable("silver_sales")
print("✅ Silver data saved")

## 🥇 Gold Layer - Business Ready

Create aggregated tables for business analytics.

In [0]:
# Read from silver layer
silver_data = spark.table("silver_sales")

# Create business metrics
daily_sales = silver_data \
    .groupBy("order_date") \
    .agg(
        count("*").alias("total_orders"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_order_value")
    ) \
    .withColumn("avg_order_value", round(col("avg_order_value"), 2)) \
    .orderBy("order_date")

print("🥇 Gold Layer - Daily Sales Summary:")
daily_sales.show()

# Category performance
category_performance = silver_data \
    .groupBy("category") \
    .agg(
        count("*").alias("total_orders"),
        sum("amount").alias("total_revenue")
    ) \
    .orderBy(desc("total_revenue"))

print("📊 Category Performance:")
category_performance.show()

# Save gold tables
daily_sales.write.mode("overwrite").saveAsTable("gold_daily_sales")
category_performance.write.mode("overwrite").saveAsTable("gold_category_performance")
print("✅ Gold tables saved")

## 📊 Data Quality Checks

Simple data quality validation.

In [0]:
print("🔍 Data Quality Report:")
print("=" * 30)

# Check record counts
bronze_count = spark.table("bronze_sales").count()
silver_count = spark.table("silver_sales").count()

print(f"Bronze records: {bronze_count}")
print(f"Silver records: {silver_count}")
print(f"Data quality: {silver_count/bronze_count*100:.1f}% records passed")

# Check for missing values
silver_data = spark.table("silver_sales")
null_counts = silver_data.select([count(when(col(c).isNull(), c)).alias(c) for c in silver_data.columns])

print("\n📋 Null Value Check:")
null_counts.show()

# Revenue validation
total_revenue = silver_data.agg(sum("amount")).collect()[0][0]
print(f"\n💰 Total Revenue: ${total_revenue:,.2f}")

print("\n✅ Data quality checks complete!")

## 🔄 Complete Pipeline Function

Put it all together in one reusable function.

In [0]:
def run_medallion_pipeline(raw_data):
    """
    Simple medallion pipeline function
    """
    print("🚀 Starting Medallion Pipeline...")
    
    # 1. Bronze Layer
    bronze_df = spark.createDataFrame(raw_data, [
        "order_id", "customer_name", "order_date", 
        "product", "amount", "category"
    ]).withColumn("ingested_at", current_timestamp())
    
    # 2. Silver Layer  
    silver_df = bronze_df \
        .withColumn("customer_name", initcap(col("customer_name"))) \
        .withColumn("amount", col("amount").cast("double")) \
        .withColumn("category", lower(col("category"))) \
        .withColumn("order_date", to_date(col("order_date"))) \
        .filter(col("amount") > 0)
    
    # 3. Gold Layer
    gold_df = silver_df \
        .groupBy("category") \
        .agg(
            count("*").alias("orders"),
            sum("amount").alias("revenue")
        )
    
    print("✅ Pipeline completed successfully!")
    return bronze_df, silver_df, gold_df

# Test the pipeline
test_data = [
    ("T001", "mike", "2024-01-20", "mouse", "25.99", "electronics"),
    ("T002", "sara", "2024-01-20", "table", "199.99", "furniture")
]

bronze, silver, gold = run_medallion_pipeline(test_data)

print("\n📊 Pipeline Results:")
gold.show()

## 📚 Summary

**What we learned:**

✅ **Bronze Layer**: Store raw data exactly as received  
✅ **Silver Layer**: Clean and validate data  
✅ **Gold Layer**: Create business-ready analytics tables  
✅ **Data Quality**: Check and monitor data quality  
✅ **Pipeline**: Combine all steps into reusable function  

**Next Steps:**
- Add Delta Lake for ACID transactions
- Implement real-time streaming
- Add data lineage tracking
- Build monitoring and alerting

🎯 **Key Takeaway**: The medallion architecture provides a systematic approach to building reliable, traceable data pipelines!