# 6.1 Modular Design and Project Structure

This notebook demonstrates how to organize functional PySpark code into modular, reusable, and maintainable projects.

## Learning Objectives
- Understand how to structure PySpark projects for scalability
- Learn to create reusable functional modules
- Master abstraction layers and separation of concerns
- Design idempotent transformations for fault tolerance
- Build production-ready project structures

In [None]:
# For local development: Uncomment the next line
# %run 00_Environment_Setup.ipynb

## Why Modular Design Matters

As PySpark projects grow from simple notebooks to production data pipelines, proper modular design becomes critical:

**Benefits of Modular Design**:
1. **Reusability**: Transform once, use everywhere
2. **Testability**: Pure functions are easy to unit test
3. **Maintainability**: Smaller, focused modules reduce cognitive load
4. **Collaboration**: Teams can work on separate modules independently
5. **Scalability**: Add new features without breaking existing code

**Functional Programming Alignment**:
- Modular design naturally supports pure functions
- Separation of concerns isolates side effects
- Composable modules enable powerful pipelines

In [None]:
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F
from pyspark.sql.types import *
from typing import List, Dict, Optional, Callable
from dataclasses import dataclass
from datetime import datetime

# Create sample customer transaction data
def create_sample_transactions(num_records: int = 10000) -> DataFrame:
    """
    Generate sample e-commerce transaction data.
    Pure function - deterministic data generation for examples.
    """
    import random
    random.seed(42)  # Deterministic for reproducibility
    
    data = []
    products = ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse', 'Headphones']
    categories = ['Electronics', 'Accessories', 'Computers']
    statuses = ['completed', 'pending', 'cancelled', 'returned']
    
    for i in range(num_records):
        product = random.choice(products)
        data.append((
            i + 1,  # transaction_id
            f"CUST_{random.randint(1, num_records // 10):05d}",  # customer_id
            product,  # product_name
            random.choice(categories),  # category
            round(random.uniform(50, 2000), 2),  # amount
            random.randint(1, 5),  # quantity
            random.choice(statuses),  # status
            f"2023-{random.randint(1, 12):02d}-{random.randint(1, 28):02d}",  # transaction_date
            random.choice(['US', 'CA', 'UK', 'DE', 'FR'])  # country
        ))
    
    schema = StructType([
        StructField("transaction_id", IntegerType(), False),
        StructField("customer_id", StringType(), False),
        StructField("product_name", StringType(), False),
        StructField("category", StringType(), False),
        StructField("amount", DoubleType(), False),
        StructField("quantity", IntegerType(), False),
        StructField("status", StringType(), False),
        StructField("transaction_date", StringType(), False),
        StructField("country", StringType(), False)
    ])
    
    return spark.createDataFrame(data, schema)

# Generate sample data
transactions_df = create_sample_transactions(5000)
transactions_df = transactions_df.withColumn("transaction_date", F.to_date("transaction_date"))

print(f"Generated {transactions_df.count():,} transactions")
print("\nSample data:")
transactions_df.show(5)
transactions_df.printSchema()

## Anti-Pattern: Monolithic Notebook Pipelines

Let's first see what NOT to do - a monolithic pipeline with everything in one place:

In [None]:
print("=== ❌ ANTI-PATTERN: Monolithic Pipeline ===")

# ❌ BAD: Everything in one massive function with hardcoded logic
def monolithic_pipeline(df):
    """
    ANTI-PATTERN: All logic in one function
    Problems:
    - Hard to test individual steps
    - No reusability
    - Difficult to debug
    - Hardcoded configuration
    - Side effects mixed with transformations
    """
    # Hardcoded filtering
    df = df.filter(F.col("status") == "completed")
    
    # Hardcoded tax calculation
    df = df.withColumn("tax", F.col("amount") * 0.08)
    df = df.withColumn("total", F.col("amount") + F.col("tax"))
    
    # More hardcoded logic
    df = df.withColumn("revenue", F.col("total") * F.col("quantity"))
    
    # Aggregation mixed in
    summary = df.groupBy("category").agg(
        F.sum("revenue").alias("total_revenue"),
        F.count("*").alias("transaction_count")
    )
    
    # Side effect: showing data (breaks lazy evaluation)
    summary.show()  # ❌ Action inside transformation!
    
    return summary

result = monolithic_pipeline(transactions_df)

print("\n⚠️  Problems with monolithic design:")
print("  - Cannot test individual transformation steps")
print("  - Cannot reuse logic in other pipelines")
print("  - Hardcoded values (tax rate, status filter)")
print("  - Side effects (show()) break testability")
print("  - Single point of failure")
print("  - Difficult to extend or modify")

## Modular Design Pattern: Separation of Concerns

Let's refactor the monolithic pipeline into modular, reusable components:

In [None]:
print("=== ✅ BEST PRACTICE: Modular Design ===")

# ========================================
# MODULE 1: Data Cleaning
# ========================================

class DataCleaning:
    """
    Module for data cleaning and validation transformations.
    All methods are pure functions (DataFrame in, DataFrame out).
    """
    
    @staticmethod
    def filter_by_status(df: DataFrame, status: str) -> DataFrame:
        """
        Pure function: Filter transactions by status.
        Configurable and reusable.
        """
        return df.filter(F.col("status") == status)
    
    @staticmethod
    def filter_valid_amounts(df: DataFrame, min_amount: float = 0) -> DataFrame:
        """
        Pure function: Filter out invalid or negative amounts.
        """
        return df.filter(
            (F.col("amount") > min_amount) & 
            (F.col("amount").isNotNull())
        )
    
    @staticmethod
    def remove_duplicates(df: DataFrame, key_columns: List[str]) -> DataFrame:
        """
        Pure function: Remove duplicate records based on key columns.
        """
        return df.dropDuplicates(key_columns)
    
    @staticmethod
    def standardize_dates(df: DataFrame, date_column: str) -> DataFrame:
        """
        Pure function: Ensure date column is in proper date format.
        """
        return df.withColumn(date_column, F.to_date(F.col(date_column)))

# ========================================
# MODULE 2: Business Logic
# ========================================

class BusinessLogic:
    """
    Module for business logic transformations.
    All methods are pure functions with configurable parameters.
    """
    
    @staticmethod
    def calculate_tax(df: DataFrame, tax_rate: float) -> DataFrame:
        """
        Pure function: Calculate tax based on configurable rate.
        """
        return df.withColumn("tax", F.col("amount") * F.lit(tax_rate))
    
    @staticmethod
    def calculate_total(df: DataFrame) -> DataFrame:
        """
        Pure function: Calculate total including tax.
        Assumes 'amount' and 'tax' columns exist.
        """
        return df.withColumn("total", F.col("amount") + F.col("tax"))
    
    @staticmethod
    def calculate_revenue(df: DataFrame) -> DataFrame:
        """
        Pure function: Calculate revenue (total * quantity).
        """
        return df.withColumn("revenue", F.col("total") * F.col("quantity"))
    
    @staticmethod
    def apply_discount(df: DataFrame, discount_rate: float, 
                      min_amount: float = 100) -> DataFrame:
        """
        Pure function: Apply discount to high-value transactions.
        """
        return df.withColumn("discount",
            F.when(F.col("amount") >= min_amount, 
                   F.col("amount") * F.lit(discount_rate))
             .otherwise(0))

# ========================================
# MODULE 3: Analytics
# ========================================

class Analytics:
    """
    Module for analytical aggregations and metrics.
    All methods are pure functions.
    """
    
    @staticmethod
    def revenue_by_category(df: DataFrame) -> DataFrame:
        """
        Pure function: Aggregate revenue by category.
        """
        return df.groupBy("category").agg(
            F.sum("revenue").alias("total_revenue"),
            F.count("*").alias("transaction_count"),
            F.avg("revenue").alias("avg_revenue")
        )
    
    @staticmethod
    def revenue_by_country(df: DataFrame) -> DataFrame:
        """
        Pure function: Aggregate revenue by country.
        """
        return df.groupBy("country").agg(
            F.sum("revenue").alias("total_revenue"),
            F.countDistinct("customer_id").alias("unique_customers")
        )
    
    @staticmethod
    def customer_lifetime_value(df: DataFrame) -> DataFrame:
        """
        Pure function: Calculate customer lifetime value metrics.
        """
        return df.groupBy("customer_id").agg(
            F.sum("revenue").alias("lifetime_value"),
            F.count("*").alias("purchase_count"),
            F.avg("revenue").alias("avg_purchase_value"),
            F.min("transaction_date").alias("first_purchase"),
            F.max("transaction_date").alias("last_purchase")
        )

print("✅ Modular design created with three separate modules:")
print("  1. DataCleaning - Pure data cleaning functions")
print("  2. BusinessLogic - Pure business transformation functions")
print("  3. Analytics - Pure analytical aggregation functions")
print("\n✅ Benefits:")
print("  - Each function is independently testable")
print("  - Functions are reusable across pipelines")
print("  - Configuration through parameters (not hardcoded)")
print("  - Clear separation of concerns")
print("  - Easy to extend and maintain")

## Composing Modular Functions into Pipelines

Now let's compose these modules into complete, readable pipelines:

In [None]:
print("=== Composing Modular Functions ===")

# ✅ GOOD: Pipeline composition with modular functions
def build_revenue_pipeline(df: DataFrame, tax_rate: float = 0.08) -> DataFrame:
    """
    Pure function: Compose modular functions into a revenue calculation pipeline.
    No side effects - returns transformed DataFrame.
    """
    return (df
        .transform(DataCleaning.filter_by_status, "completed")
        .transform(DataCleaning.filter_valid_amounts)
        .transform(DataCleaning.remove_duplicates, ["transaction_id"])
        .transform(BusinessLogic.calculate_tax, tax_rate)
        .transform(BusinessLogic.calculate_total)
        .transform(BusinessLogic.calculate_revenue)
    )

# ✅ GOOD: Different pipeline for different analysis
def build_discount_pipeline(df: DataFrame, tax_rate: float = 0.08, 
                           discount_rate: float = 0.10) -> DataFrame:
    """
    Pure function: Pipeline with discount calculation.
    Reuses existing modules and adds discount logic.
    """
    return (df
        .transform(DataCleaning.filter_by_status, "completed")
        .transform(DataCleaning.filter_valid_amounts)
        .transform(BusinessLogic.calculate_tax, tax_rate)
        .transform(BusinessLogic.calculate_total)
        .transform(BusinessLogic.apply_discount, discount_rate)
        .transform(BusinessLogic.calculate_revenue)
    )

# Test the modular pipelines
print("\n1. Testing revenue pipeline:")
revenue_df = build_revenue_pipeline(transactions_df, tax_rate=0.10)
print(f"Processed {revenue_df.count():,} transactions")
revenue_df.select("transaction_id", "amount", "tax", "total", "revenue").show(5)

print("\n2. Testing discount pipeline:")
discount_df = build_discount_pipeline(transactions_df, tax_rate=0.10, discount_rate=0.15)
print(f"Processed {discount_df.count():,} transactions with discounts")
discount_df.select("transaction_id", "amount", "discount", "total", "revenue").show(5)

# Apply analytics
print("\n3. Applying analytics:")
category_summary = Analytics.revenue_by_category(revenue_df)
print("Revenue by Category:")
category_summary.show()

country_summary = Analytics.revenue_by_country(revenue_df)
print("Revenue by Country:")
country_summary.show()

print("\n✅ Modular pipelines demonstrate:")
print("  - Functions are reusable across different pipelines")
print("  - Easy to test individual components")
print("  - Configuration through parameters")
print("  - Clear data flow and transformations")

## Configuration Management with Immutable Dataclasses

Use immutable configuration objects for pipeline parameterization:

In [None]:
print("=== Configuration Management Pattern ===")

@dataclass(frozen=True)  # Immutable configuration
class PipelineConfig:
    """
    Immutable configuration for revenue pipeline.
    Frozen dataclass ensures no accidental mutations.
    """
    tax_rate: float = 0.08
    discount_rate: float = 0.10
    min_discount_amount: float = 100.0
    status_filter: str = "completed"
    min_valid_amount: float = 0.0
    
    def __post_init__(self):
        """Validate configuration on initialization"""
        if self.tax_rate < 0 or self.tax_rate > 1:
            raise ValueError(f"Tax rate must be between 0 and 1, got {self.tax_rate}")
        if self.discount_rate < 0 or self.discount_rate > 1:
            raise ValueError(f"Discount rate must be between 0 and 1, got {self.discount_rate}")

# ✅ GOOD: Pipeline using configuration object
def build_configurable_pipeline(df: DataFrame, config: PipelineConfig) -> DataFrame:
    """
    Pure function: Pipeline driven by immutable configuration.
    Enables different configurations without code changes.
    """
    return (df
        .transform(DataCleaning.filter_by_status, config.status_filter)
        .transform(DataCleaning.filter_valid_amounts, config.min_valid_amount)
        .transform(DataCleaning.remove_duplicates, ["transaction_id"])
        .transform(BusinessLogic.calculate_tax, config.tax_rate)
        .transform(BusinessLogic.calculate_total)
        .transform(BusinessLogic.apply_discount, config.discount_rate, config.min_discount_amount)
        .transform(BusinessLogic.calculate_revenue)
    )

# Test with different configurations
print("\n1. Standard configuration:")
standard_config = PipelineConfig()
standard_result = build_configurable_pipeline(transactions_df, standard_config)
print(f"Tax rate: {standard_config.tax_rate}, Discount: {standard_config.discount_rate}")
print(f"Processed: {standard_result.count():,} transactions")

print("\n2. High-tax configuration:")
high_tax_config = PipelineConfig(tax_rate=0.15, discount_rate=0.05)
high_tax_result = build_configurable_pipeline(transactions_df, high_tax_config)
print(f"Tax rate: {high_tax_config.tax_rate}, Discount: {high_tax_config.discount_rate}")
print(f"Processed: {high_tax_result.count():,} transactions")

print("\n3. Low-discount threshold configuration:")
low_threshold_config = PipelineConfig(discount_rate=0.20, min_discount_amount=50.0)
low_threshold_result = build_configurable_pipeline(transactions_df, low_threshold_config)
print(f"Discount: {low_threshold_config.discount_rate}, Min amount: ${low_threshold_config.min_discount_amount}")
print(f"Processed: {low_threshold_result.count():,} transactions")

# Compare average discounts
print("\nComparing average discounts:")
for name, result in [("Standard", standard_result), 
                     ("High Tax", high_tax_result),
                     ("Low Threshold", low_threshold_result)]:
    avg_discount = result.agg(F.avg("discount")).collect()[0][0]
    print(f"  {name}: ${avg_discount:.2f} average discount")

print("\n✅ Configuration benefits:")
print("  - Immutable configuration prevents accidental changes")
print("  - Easy to test with different configurations")
print("  - Validation at configuration creation time")
print("  - Type-safe parameter passing")

## Abstraction Layers: Extract-Transform-Load Pattern

Separate data extraction, transformation, and loading into distinct layers:

In [None]:
print("=== Abstraction Layers: ETL Pattern ===")

# ========================================
# LAYER 1: Extraction
# ========================================

class DataExtraction:
    """
    Extraction layer: Responsible for reading data from sources.
    Encapsulates all I/O operations in one place.
    """
    
    @staticmethod
    def read_transactions(path: str, format: str = "parquet") -> DataFrame:
        """
        Read transaction data from storage.
        Centralized data reading logic.
        """
        return spark.read.format(format).load(path)
    
    @staticmethod
    def read_with_schema(path: str, schema: StructType, format: str = "parquet") -> DataFrame:
        """
        Read data with explicit schema enforcement.
        """
        return spark.read.format(format).schema(schema).load(path)
    
    @staticmethod
    def read_incremental(path: str, checkpoint_column: str, 
                        last_checkpoint: str) -> DataFrame:
        """
        Read incremental data based on checkpoint.
        """
        df = spark.read.format("delta").load(path)
        return df.filter(F.col(checkpoint_column) > F.lit(last_checkpoint))

# ========================================
# LAYER 2: Transformation (Already defined)
# ========================================
# Uses DataCleaning, BusinessLogic, Analytics modules

# ========================================
# LAYER 3: Loading
# ========================================

class DataLoading:
    """
    Loading layer: Responsible for writing data to destinations.
    Encapsulates all write operations and side effects.
    """
    
    @staticmethod
    def write_to_delta(df: DataFrame, path: str, mode: str = "overwrite",
                      partition_by: Optional[List[str]] = None) -> None:
        """
        Write DataFrame to Delta Lake table.
        Side effect: Persists data to storage.
        """
        writer = df.write.format("delta").mode(mode)
        if partition_by:
            writer = writer.partitionBy(*partition_by)
        writer.save(path)
    
    @staticmethod
    def write_to_parquet(df: DataFrame, path: str, mode: str = "overwrite") -> None:
        """
        Write DataFrame to Parquet format.
        """
        df.write.format("parquet").mode(mode).save(path)
    
    @staticmethod
    def create_or_replace_view(df: DataFrame, view_name: str, 
                              global_view: bool = False) -> None:
        """
        Create temporary view for SQL access.
        """
        if global_view:
            df.createOrReplaceGlobalTempView(view_name)
        else:
            df.createOrReplaceTempView(view_name)

# ========================================
# ORCHESTRATION LAYER
# ========================================

class PipelineOrchestrator:
    """
    Orchestration layer: Coordinates ETL flow.
    Separates pure transformation logic from I/O side effects.
    """
    
    @staticmethod
    def run_revenue_analytics_pipeline(source_path: str, destination_path: str,
                                      config: PipelineConfig) -> Dict[str, int]:
        """
        Complete ETL pipeline with clear separation of concerns.
        Returns metrics about the pipeline execution.
        """
        # EXTRACT (Side effect: I/O)
        print("📥 EXTRACT: Reading source data...")
        raw_df = DataExtraction.read_transactions(source_path)
        input_count = raw_df.count()
        
        # TRANSFORM (Pure functions - no side effects)
        print("🔄 TRANSFORM: Applying transformations...")
        transformed_df = build_configurable_pipeline(raw_df, config)
        output_count = transformed_df.count()
        
        # Analytical transformations
        category_summary = Analytics.revenue_by_category(transformed_df)
        country_summary = Analytics.revenue_by_country(transformed_df)
        
        # LOAD (Side effects: I/O and view creation)
        print("💾 LOAD: Writing results...")
        DataLoading.write_to_delta(transformed_df, f"{destination_path}/transactions", 
                                  partition_by=["country"])
        DataLoading.write_to_delta(category_summary, f"{destination_path}/category_summary")
        DataLoading.write_to_delta(country_summary, f"{destination_path}/country_summary")
        DataLoading.create_or_replace_view(transformed_df, "revenue_transactions")
        
        # Return metrics (not side effect - just return value)
        return {
            "input_records": input_count,
            "output_records": output_count,
            "filtered_records": input_count - output_count,
            "categories": category_summary.count(),
            "countries": country_summary.count()
        }

# Demonstrate the layered architecture
print("\nDemonstrating layered ETL architecture:")

# First, write some sample data for extraction
sample_path = "/tmp/sample_transactions"
output_path = "/tmp/revenue_analytics"

try:
    dbutils.fs.rm(sample_path, True)
    dbutils.fs.rm(output_path, True)
except:
    pass

# Write sample data
transactions_df.write.format("parquet").mode("overwrite").save(sample_path)
print(f"✅ Sample data written to {sample_path}")

# Run the orchestrated pipeline
print("\n🚀 Running orchestrated ETL pipeline...")
config = PipelineConfig(tax_rate=0.10, discount_rate=0.15)
metrics = PipelineOrchestrator.run_revenue_analytics_pipeline(
    source_path=sample_path,
    destination_path=output_path,
    config=config
)

print("\n📊 Pipeline Execution Metrics:")
for key, value in metrics.items():
    print(f"  {key}: {value:,}")

print("\n✅ Layered architecture benefits:")
print("  - Clear separation: Extract, Transform, Load")
print("  - Side effects isolated to Extract and Load layers")
print("  - Transform layer is pure and testable")
print("  - Orchestrator coordinates the flow")
print("  - Easy to swap implementations (e.g., different sources)")

## Idempotent Transformations for Fault Tolerance

Design transformations that produce the same result when applied multiple times:

In [None]:
print("=== Idempotent Transformations ===")

@dataclass(frozen=True)
class ProcessingMetadata:
    """
    Metadata to track processing for idempotency.
    """
    processed_at: str
    processing_version: str
    is_processed: bool = True

class IdempotentTransformations:
    """
    Idempotent transformation patterns.
    Multiple applications yield the same result.
    """
    
    @staticmethod
    def mark_as_processed(df: DataFrame, version: str = "1.0") -> DataFrame:
        """
        Idempotent: Mark records as processed.
        Multiple applications don't change already marked records.
        """
        return df.withColumn("processed_at", 
            F.when(F.col("processed_at").isNull(), F.current_timestamp())
             .otherwise(F.col("processed_at"))
        ).withColumn("processing_version",
            F.when(F.col("processing_version").isNull(), F.lit(version))
             .otherwise(F.col("processing_version"))
        )
    
    @staticmethod
    def deduplicate_by_key(df: DataFrame, key_columns: List[str], 
                          order_by: str = "transaction_date") -> DataFrame:
        """
        Idempotent: Remove duplicates, keeping most recent.
        Multiple applications yield the same result.
        """
        from pyspark.sql.window import Window
        
        window = Window.partitionBy(*key_columns).orderBy(F.desc(order_by))
        
        return (df
            .withColumn("row_num", F.row_number().over(window))
            .filter(F.col("row_num") == 1)
            .drop("row_num")
        )
    
    @staticmethod
    def standardize_column_names(df: DataFrame) -> DataFrame:
        """
        Idempotent: Standardize column names to lowercase with underscores.
        Multiple applications don't change already standardized names.
        """
        for col_name in df.columns:
            # Convert to lowercase and replace spaces with underscores
            new_name = col_name.lower().replace(" ", "_").replace("-", "_")
            if new_name != col_name:
                df = df.withColumnRenamed(col_name, new_name)
        return df

# Test idempotency
print("\nTesting idempotent transformations:")

test_df = transactions_df.limit(100)

# Apply idempotent transformation once
print("\n1. First application:")
result1 = IdempotentTransformations.mark_as_processed(test_df)
processed_count1 = result1.filter(F.col("processed_at").isNotNull()).count()
print(f"Records with processed_at: {processed_count1}")

# Apply same transformation again (idempotent)
print("\n2. Second application (should be idempotent):")
result2 = IdempotentTransformations.mark_as_processed(result1)
processed_count2 = result2.filter(F.col("processed_at").isNotNull()).count()
print(f"Records with processed_at: {processed_count2}")

# Verify idempotency
sample1 = result1.select("transaction_id", "processed_at", "processing_version").collect()[0]
sample2 = result2.select("transaction_id", "processed_at", "processing_version").collect()[0]

print(f"\n✅ Idempotency verified:")
print(f"  First application: {sample1['processed_at']}")
print(f"  Second application: {sample2['processed_at']}")
print(f"  Values unchanged: {sample1['processed_at'] == sample2['processed_at']}")

# Test deduplication idempotency
print("\n3. Testing deduplication idempotency:")

# Create data with duplicates
with_duplicates = test_df.union(test_df.limit(10))  # Add 10 duplicates
print(f"Original with duplicates: {with_duplicates.count()} records")

# First deduplication
dedup1 = IdempotentTransformations.deduplicate_by_key(with_duplicates, ["transaction_id"])
print(f"After first deduplication: {dedup1.count()} records")

# Second deduplication (should yield same result)
dedup2 = IdempotentTransformations.deduplicate_by_key(dedup1, ["transaction_id"])
print(f"After second deduplication: {dedup2.count()} records")
print(f"✅ Idempotent: {dedup1.count() == dedup2.count()}")

print("\n✅ Idempotent transformation benefits:")
print("  - Safe to retry failed jobs")
print("  - No unintended side effects from re-processing")
print("  - Fault-tolerant pipeline design")
print("  - Deterministic behavior")

## Project Structure Best Practices

Recommended directory structure for production PySpark projects:

In [None]:
print("=== Recommended Project Structure ===")

project_structure = """
my_pyspark_project/
│
├── src/                           # Source code
│   ├── __init__.py
│   ├── extraction/                # Data extraction layer
│   │   ├── __init__.py
│   │   ├── readers.py            # Pure functions for reading data
│   │   └── sources.py            # Source configurations
│   │
│   ├── transformation/            # Transformation layer
│   │   ├── __init__.py
│   │   ├── cleaning.py           # Data cleaning module
│   │   ├── business_logic.py     # Business transformations
│   │   ├── enrichment.py         # Data enrichment functions
│   │   └── analytics.py          # Analytical aggregations
│   │
│   ├── loading/                   # Data loading layer
│   │   ├── __init__.py
│   │   ├── writers.py            # Data writers
│   │   └── destinations.py       # Destination configurations
│   │
│   ├── pipelines/                 # Pipeline orchestration
│   │   ├── __init__.py
│   │   ├── revenue_pipeline.py   # Revenue analytics pipeline
│   │   └── customer_pipeline.py  # Customer analytics pipeline
│   │
│   ├── utils/                     # Utility functions
│   │   ├── __init__.py
│   │   ├── spark_utils.py        # Spark session management
│   │   ├── schema_utils.py       # Schema definitions and validation
│   │   └── config_utils.py       # Configuration management
│   │
│   └── config/                    # Configuration files
│       ├── __init__.py
│       ├── pipeline_config.py    # Pipeline configurations
│       └── environments.py       # Environment-specific configs
│
├── tests/                         # Test directory
│   ├── __init__.py
│   ├── conftest.py               # Pytest fixtures
│   ├── unit/                     # Unit tests
│   │   ├── test_cleaning.py
│   │   ├── test_business_logic.py
│   │   └── test_analytics.py
│   │
│   ├── integration/              # Integration tests
│   │   ├── test_revenue_pipeline.py
│   │   └── test_customer_pipeline.py
│   │
│   └── fixtures/                 # Test data fixtures
│       └── sample_data.py
│
├── notebooks/                     # Databricks notebooks
│   ├── exploratory/              # Exploratory analysis
│   ├── production/               # Production job notebooks
│   └── validation/               # Data validation notebooks
│
├── scripts/                       # Utility scripts
│   ├── setup_environment.py
│   └── run_tests.py
│
├── docs/                          # Documentation
│   ├── architecture.md
│   ├── pipeline_design.md
│   └── deployment.md
│
├── .github/                       # CI/CD configuration
│   └── workflows/
│       └── ci.yml
│
├── requirements.txt               # Python dependencies
├── setup.py                       # Package setup
├── pytest.ini                     # Pytest configuration
├── .gitignore                     # Git ignore rules
└── README.md                      # Project documentation
"""

print(project_structure)

print("\n✅ Project Structure Principles:")
print("\n1. Separation of Concerns:")
print("   - extraction/ - Data reading logic")
print("   - transformation/ - Pure transformation functions")
print("   - loading/ - Data writing logic")
print("   - pipelines/ - Orchestration")

print("\n2. Modular Organization:")
print("   - Each module has single responsibility")
print("   - Related functions grouped together")
print("   - Clear import paths")

print("\n3. Testability:")
print("   - tests/ mirrors src/ structure")
print("   - Unit tests for pure functions")
print("   - Integration tests for pipelines")
print("   - Shared fixtures in conftest.py")

print("\n4. Configuration Management:")
print("   - Centralized config/ directory")
print("   - Environment-specific configurations")
print("   - Immutable configuration objects")

print("\n5. Documentation:")
print("   - README.md for project overview")
print("   - docs/ for detailed documentation")
print("   - Docstrings in all functions")

print("\n6. CI/CD Integration:")
print("   - .github/workflows/ for automation")
print("   - Automated testing on commits")
print("   - Deployment scripts")

## Summary

**Key Takeaways:**

1. **Modular Design Benefits**:
   - Reusability across projects and pipelines
   - Independent testing of components
   - Easier maintenance and debugging
   - Clear separation of concerns
   - Scalable team collaboration

2. **Functional Module Patterns**:
   - Pure functions for all transformations
   - Immutable configuration objects
   - Composable pipeline building
   - Parameter injection over hardcoding

3. **Abstraction Layers**:
   - Extract layer: I/O for reading data
   - Transform layer: Pure business logic
   - Load layer: I/O for writing data
   - Orchestration layer: Coordinates flow

4. **Idempotent Transformations**:
   - Safe to retry on failure
   - Deterministic behavior
   - Fault-tolerant design
   - Production-ready patterns

5. **Project Structure**:
   - Organized by layer and concern
   - Testable architecture
   - Clear module boundaries
   - CI/CD ready

**Best Practices for Modular PySpark**:
- Keep transformation modules pure (DataFrame in, DataFrame out)
- Isolate side effects to extraction and loading layers
- Use immutable configuration objects
- Design idempotent transformations for fault tolerance
- Structure projects for team collaboration
- Write comprehensive unit tests for all modules
- Document module responsibilities clearly

**Next Steps**: In the next notebook (6.2), we'll explore dependency management and package distribution for PySpark projects in Databricks.

## Exercise

Practice modular design with your own data:

1. Create a modular data cleaning module with at least 3 pure functions
2. Create a business logic module specific to your domain
3. Create an analytics module with aggregation functions
4. Build a configurable pipeline using immutable configuration
5. Implement idempotent transformations for your use case
6. Design extraction and loading layers
7. Write unit tests for each module

In [None]:
# Your exercise code here

# 1. Create your data cleaning module
class MyDataCleaning:
    """Your data cleaning transformations"""
    
    @staticmethod
    def your_cleaning_function(df: DataFrame) -> DataFrame:
        # Your implementation
        pass

# 2. Create your business logic module
class MyBusinessLogic:
    """Your business transformations"""
    
    @staticmethod
    def your_business_function(df: DataFrame) -> DataFrame:
        # Your implementation
        pass

# 3. Create your analytics module
class MyAnalytics:
    """Your analytical aggregations"""
    
    @staticmethod
    def your_analytics_function(df: DataFrame) -> DataFrame:
        # Your implementation
        pass

# 4. Create your configuration
@dataclass(frozen=True)
class MyPipelineConfig:
    """Your pipeline configuration"""
    # Your configuration parameters
    pass

# 5. Build your pipeline
def build_my_pipeline(df: DataFrame, config: MyPipelineConfig) -> DataFrame:
    """Your modular pipeline"""
    # Your pipeline composition
    pass

# 6. Test your modules
def test_my_modules():
    """Test your modular functions"""
    # Your tests
    pass

# Run your exercise
# test_my_modules()
# my_config = MyPipelineConfig(...)
# result = build_my_pipeline(your_data, my_config)