# 2.3 Effective Chaining and Composition of Transformations

This notebook demonstrates advanced patterns for chaining and composing PySpark transformations while maintaining code readability and maintainability.

## Learning Objectives
- Master transformation chaining patterns in PySpark
- Use schema contracts with structured `select` statements
- Balance functional composition with code readability
- Break down complex transformations into manageable functions
- Apply best practices for method chaining
- Leverage `.transform()` for clean pipeline composition

In [None]:
# For local development: Uncomment the next line
# %run 00_Environment_Setup.ipynb

## Why Chaining and Composition Matter

**Functional Composition** is the heart of functional programming - combining simple functions to build complex operations.

**In PySpark**:
- Transformations return new DataFrames (immutability)
- Lazy evaluation allows building complex transformation chains without execution overhead
- Catalyst optimizer analyzes the entire chain for optimization

**Benefits**:
1. **Declarative Code**: Express *what* you want, not *how* to get it
2. **Optimizable**: Spark optimizes the entire transformation chain
3. **Composable**: Build complex pipelines from simple building blocks
4. **Testable**: Each transformation can be tested independently

**Challenge**: Balance composition power with code readability

In [None]:
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F
from pyspark.sql.types import *
from typing import List, Callable
import random

# Create sample employee data for demonstrations
def create_employee_data(num_records: int = 1000) -> DataFrame:
    """
    Generate sample employee data.
    Pure function - deterministic data generation.
    """
    random.seed(42)
    
    departments = ['Engineering', 'Sales', 'Marketing', 'HR', 'Finance', 'Operations']
    locations = ['New York', 'San Francisco', 'Austin', 'Seattle', 'Boston']
    job_levels = ['Junior', 'Mid', 'Senior', 'Lead', 'Principal']
    
    data = []
    for i in range(num_records):
        data.append((
            f"EMP{i+1:05d}",  # employee_id
            f"Employee {i+1}",  # name
            random.choice(departments),  # department
            random.choice(locations),  # location
            random.choice(job_levels),  # job_level
            round(random.uniform(50000, 200000), 2),  # salary
            random.randint(0, 20),  # years_experience
            random.randint(0, 30),  # vacation_days
            f"2020-{random.randint(1, 12):02d}-{random.randint(1, 28):02d}",  # hire_date
            random.choice([True, False])  # is_remote
        ))
    
    schema = StructType([
        StructField("employee_id", StringType(), False),
        StructField("name", StringType(), False),
        StructField("department", StringType(), False),
        StructField("location", StringType(), False),
        StructField("job_level", StringType(), False),
        StructField("salary", DoubleType(), False),
        StructField("years_experience", IntegerType(), False),
        StructField("vacation_days", IntegerType(), False),
        StructField("hire_date", StringType(), False),
        StructField("is_remote", BooleanType(), False)
    ])
    
    return spark.createDataFrame(data, schema)

# Generate sample data
employees_df = create_employee_data(500)
employees_df = employees_df.withColumn("hire_date", F.to_date("hire_date"))

print(f"Generated {employees_df.count():,} employee records")
print("\nSample data:")
employees_df.show(5, truncate=False)
employees_df.printSchema()

## Basic Chaining Patterns

PySpark naturally supports method chaining due to immutability:

In [None]:
print("=== Basic Method Chaining ===")

# ✅ GOOD: Simple, readable chain
result = (employees_df
    .filter(F.col("department") == "Engineering")
    .filter(F.col("salary") > 100000)
    .select("employee_id", "name", "salary", "job_level")
    .orderBy(F.desc("salary"))
)

print("High-salary engineers:")
result.show(5)

print("\n✅ Why this works well:")
print("  - Each method returns a new DataFrame")
print("  - Operations are clearly sequenced")
print("  - Easy to read top-to-bottom")
print("  - Lazy evaluation - nothing executed until show()")
print("  - Catalyst optimizer sees entire chain")

## Anti-Pattern: Overly Long Chains

While chaining is powerful, excessive chaining can harm readability:

In [None]:
print("=== ❌ ANTI-PATTERN: Overly Long Chain ===")

# ❌ BAD: Too many operations in one chain (hard to read and debug)
bad_result = (employees_df
    .filter(F.col("department").isin(["Engineering", "Sales"]))
    .filter(F.col("years_experience") >= 3)
    .withColumn("salary_grade", F.when(F.col("salary") < 75000, "L1").when(F.col("salary") < 100000, "L2").when(F.col("salary") < 150000, "L3").otherwise("L4"))
    .withColumn("bonus", F.when(F.col("job_level") == "Senior", F.col("salary") * 0.15).when(F.col("job_level") == "Lead", F.col("salary") * 0.20).when(F.col("job_level") == "Principal", F.col("salary") * 0.25).otherwise(F.col("salary") * 0.10))
    .withColumn("total_comp", F.col("salary") + F.col("bonus"))
    .withColumn("tenure_months", F.months_between(F.current_date(), F.col("hire_date")))
    .withColumn("is_high_performer", (F.col("salary") > 120000) & (F.col("years_experience") > 5))
    .filter(F.col("total_comp") > 100000)
    .groupBy("department", "salary_grade").agg(F.avg("total_comp").alias("avg_comp"), F.count("*").alias("employee_count"))
    .orderBy(F.desc("avg_comp"))
)

print("Result (but hard to understand how we got here):")
bad_result.show(5)

print("\n⚠️  Problems with this approach:")
print("  - Hard to read and understand")
print("  - Difficult to debug intermediate steps")
print("  - Complex withColumn expressions are unreadable")
print("  - Can't easily test individual transformations")
print("  - Violates 'max 5 statements' guideline")

## Best Practice: Breaking Down Complex Chains

Extract complex logic into named, reusable functions:

In [None]:
print("=== ✅ BEST PRACTICE: Modular Transformation Functions ===")

# ✅ GOOD: Extract complex logic into pure functions

def add_salary_grade(df: DataFrame) -> DataFrame:
    """
    Pure function: Categorize employees by salary grade.
    Single responsibility - easy to test.
    """
    return df.withColumn("salary_grade",
        F.when(F.col("salary") < 75000, "L1")
         .when(F.col("salary") < 100000, "L2")
         .when(F.col("salary") < 150000, "L3")
         .otherwise("L4")
    )

def calculate_bonus(df: DataFrame) -> DataFrame:
    """
    Pure function: Calculate performance bonus by job level.
    """
    return df.withColumn("bonus",
        F.when(F.col("job_level") == "Senior", F.col("salary") * 0.15)
         .when(F.col("job_level") == "Lead", F.col("salary") * 0.20)
         .when(F.col("job_level") == "Principal", F.col("salary") * 0.25)
         .otherwise(F.col("salary") * 0.10)
    )

def calculate_total_compensation(df: DataFrame) -> DataFrame:
    """
    Pure function: Calculate total compensation.
    """
    return df.withColumn("total_comp", F.col("salary") + F.col("bonus"))

def add_tenure_metrics(df: DataFrame) -> DataFrame:
    """
    Pure function: Add tenure-related metrics.
    """
    return df.withColumn("tenure_months", 
                        F.months_between(F.current_date(), F.col("hire_date")))

def identify_high_performers(df: DataFrame) -> DataFrame:
    """
    Pure function: Identify high performers.
    """
    return df.withColumn("is_high_performer",
        (F.col("salary") > 120000) & (F.col("years_experience") > 5)
    )

# ✅ GOOD: Clean, readable chain using modular functions
good_result = (employees_df
    .filter(F.col("department").isin(["Engineering", "Sales"]))
    .filter(F.col("years_experience") >= 3)
    .transform(add_salary_grade)
    .transform(calculate_bonus)
    .transform(calculate_total_compensation)
    .transform(add_tenure_metrics)
    .transform(identify_high_performers)
    .filter(F.col("total_comp") > 100000)
    .groupBy("department", "salary_grade")
    .agg(
        F.avg("total_comp").alias("avg_comp"),
        F.count("*").alias("employee_count")
    )
    .orderBy(F.desc("avg_comp"))
)

print("Result (with readable transformation chain):")
good_result.show(5)

print("\n✅ Benefits of modular functions:")
print("  - Each function has single responsibility")
print("  - Easy to test independently")
print("  - Reusable across pipelines")
print("  - Self-documenting code")
print("  - Chain remains readable")
print("  - Uses .transform() for clean composition")

## Schema Contracts with Structured Select Statements

Use `select` statements to explicitly define schema contracts:

In [None]:
print("=== Schema Contracts with Select ===")

# ✅ GOOD: Select as schema contract at pipeline boundaries
def prepare_employee_summary(df: DataFrame) -> DataFrame:
    """
    Transform employee data with explicit output schema.
    """
    # Input schema contract (implicit - could add assertions)
    required_columns = {"employee_id", "name", "department", "salary", "job_level"}
    assert required_columns.issubset(set(df.columns)), "Missing required columns"
    
    # Transformations
    result = (df
        .transform(add_salary_grade)
        .transform(calculate_bonus)
        .transform(calculate_total_compensation)
    )
    
    # Output schema contract (explicit select)
    return result.select(
        F.col("employee_id"),
        F.col("name"),
        F.col("department"),
        F.col("job_level"),
        F.col("salary"),
        F.col("salary_grade"),
        F.col("bonus"),
        F.col("total_comp").alias("total_compensation")
    )

# Test the function
summary_df = prepare_employee_summary(employees_df)

print("Employee summary with explicit schema:")
summary_df.show(5)
print("\nOutput schema:")
summary_df.printSchema()

print("\n✅ Schema contract benefits:")
print("  - Explicit output schema definition")
print("  - Downstream consumers know what to expect")
print("  - Easy to spot breaking changes")
print("  - Self-documenting data contracts")
print("  - Helps Catalyst optimizer")

## Advanced: Select Statement Guidelines

Best practices for structuring select statements:

In [None]:
print("=== Select Statement Best Practices ===")

# ❌ BAD: Complex logic inside select (hard to test)
bad_select = employees_df.select(
    F.col("employee_id"),
    F.col("name"),
    F.when(F.col("salary") < 75000, "L1").when(F.col("salary") < 100000, "L2").when(F.col("salary") < 150000, "L3").otherwise("L4").alias("salary_grade"),
    (F.col("salary") * F.when(F.col("job_level") == "Senior", 0.15).when(F.col("job_level") == "Lead", 0.20).otherwise(0.10)).alias("bonus"),
    (F.col("salary") + (F.col("salary") * 0.10)).alias("total_comp")
)

print("❌ Complex select (hard to read):")
bad_select.show(3)

# ✅ GOOD: Simple select with one function per column
# First, compute complex columns
computed_df = (employees_df
    .transform(add_salary_grade)
    .transform(calculate_bonus)
    .transform(calculate_total_compensation)
)

# Then, select with simple column references
good_select = computed_df.select(
    F.col("employee_id"),
    F.col("name"),
    F.col("department"),
    F.col("salary"),
    F.col("salary_grade"),
    F.col("bonus"),
    F.col("total_comp")
)

print("\n✅ Simple select (readable):")
good_select.show(3)

print("\n✅ Select statement guidelines:")
print("  1. Keep select statements simple")
print("  2. One function per selected column (ideally just F.col())")
print("  3. Complex expressions → separate withColumn or function")
print("  4. Use select at beginning (input schema) and end (output schema)")
print("  5. Avoid nesting complex when() logic in select")
print("  6. Use aliases for clarity")

## Chaining Limit Guidelines

Balance between composition power and readability:

In [None]:
print("=== Chaining Limit Guidelines ===")

# ✅ GOOD: Chain of 5 or fewer statements
short_chain = (employees_df
    .filter(F.col("department") == "Engineering")
    .withColumn("bonus", F.col("salary") * 0.10)
    .select("employee_id", "name", "salary", "bonus")
    .orderBy(F.desc("salary"))
)

print("✅ Good: Short chain (4 operations)")
print("Easy to read and understand at a glance")

# ⚠️  ACCEPTABLE: Longer chain with .transform() and named functions
medium_chain = (employees_df
    .filter(F.col("department").isin(["Engineering", "Sales"]))
    .filter(F.col("years_experience") >= 3)
    .transform(add_salary_grade)  # Named function - self-documenting
    .transform(calculate_bonus)    # Named function - self-documenting
    .transform(calculate_total_compensation)  # Named function
    .select("employee_id", "name", "department", "salary_grade", "total_comp")
    .orderBy(F.desc("total_comp"))
)

print("\n⚠️  Acceptable: Medium chain (7 operations)")
print("Still readable due to named .transform() functions")

# ✅ BEST: Break very long chains into logical steps
# Step 1: Filtering
filtered_df = (employees_df
    .filter(F.col("department").isin(["Engineering", "Sales"]))
    .filter(F.col("years_experience") >= 3)
)

# Step 2: Enrichment
enriched_df = (filtered_df
    .transform(add_salary_grade)
    .transform(calculate_bonus)
    .transform(calculate_total_compensation)
    .transform(add_tenure_metrics)
)

# Step 3: Aggregation
aggregated_df = (enriched_df
    .groupBy("department", "salary_grade")
    .agg(
        F.avg("total_comp").alias("avg_compensation"),
        F.count("*").alias("employee_count"),
        F.max("tenure_months").alias("max_tenure")
    )
)

# Step 4: Output formatting
final_df = (aggregated_df
    .orderBy(F.desc("avg_compensation"))
    .select(
        F.col("department"),
        F.col("salary_grade"),
        F.col("employee_count"),
        F.round("avg_compensation", 2).alias("avg_compensation"),
        F.round("max_tenure", 0).alias("max_tenure_months")
    )
)

print("\n✅ Best: Logical grouping of operations")
final_df.show(5)

print("\n📋 Chaining Guidelines:")
print("  ✅ 1-5 operations: Single chain is fine")
print("  ⚠️  6-10 operations: Use .transform() with named functions")
print("  ❌ 10+ operations: Break into logical step variables")
print("  💡 Use meaningful variable names for intermediate steps")
print("  💡 Group operations by logical phase (filter, enrich, aggregate, format)")

## Higher-Order Functions for Composition

Create composable transformation pipelines:

In [None]:
print("=== Higher-Order Function Composition ===")

def compose(*functions: Callable[[DataFrame], DataFrame]) -> Callable[[DataFrame], DataFrame]:
    """
    Compose multiple transformation functions into a single function.
    Higher-order function that returns a function.
    
    Example:
        pipeline = compose(add_salary_grade, calculate_bonus, calculate_total_compensation)
        result = pipeline(df)
    """
    def composed_function(df: DataFrame) -> DataFrame:
        result = df
        for func in functions:
            result = func(result)
        return result
    return composed_function

# Create reusable pipelines
compensation_pipeline = compose(
    add_salary_grade,
    calculate_bonus,
    calculate_total_compensation
)

full_enrichment_pipeline = compose(
    add_salary_grade,
    calculate_bonus,
    calculate_total_compensation,
    add_tenure_metrics,
    identify_high_performers
)

# Use the composed pipelines
print("Using composed compensation pipeline:")
comp_result = compensation_pipeline(employees_df)
comp_result.select("employee_id", "name", "salary", "salary_grade", "bonus", "total_comp").show(5)

print("\nUsing full enrichment pipeline:")
full_result = full_enrichment_pipeline(employees_df)
print(f"Added columns: {[c for c in full_result.columns if c not in employees_df.columns]}")

print("\n✅ Benefits of higher-order composition:")
print("  - Reusable transformation pipelines")
print("  - Declarative pipeline definitions")
print("  - Easy to test composed pipelines")
print("  - Can be parameterized and customized")
print("  - Functional programming pattern")

## Pipeline Builder Pattern

Advanced pattern for configurable transformation pipelines:

In [None]:
print("=== Pipeline Builder Pattern ===")

class TransformationPipeline:
    """
    Builder pattern for composing transformation pipelines.
    Fluent API for readable pipeline construction.
    """
    
    def __init__(self, df: DataFrame):
        self._df = df
        self._transformations: List[Callable[[DataFrame], DataFrame]] = []
    
    def add_transformation(self, func: Callable[[DataFrame], DataFrame]) -> 'TransformationPipeline':
        """Add a transformation function to the pipeline."""
        self._transformations.append(func)
        return self  # Return self for chaining
    
    def filter_by(self, condition) -> 'TransformationPipeline':
        """Add a filter operation."""
        def filter_func(df: DataFrame) -> DataFrame:
            return df.filter(condition)
        return self.add_transformation(filter_func)
    
    def with_columns(self, **col_definitions) -> 'TransformationPipeline':
        """Add multiple columns."""
        def add_cols(df: DataFrame) -> DataFrame:
            result = df
            for col_name, col_expr in col_definitions.items():
                result = result.withColumn(col_name, col_expr)
            return result
        return self.add_transformation(add_cols)
    
    def select_columns(self, *columns) -> 'TransformationPipeline':
        """Add a select operation."""
        def select_func(df: DataFrame) -> DataFrame:
            return df.select(*columns)
        return self.add_transformation(select_func)
    
    def build(self) -> DataFrame:
        """Execute all transformations and return final DataFrame."""
        result = self._df
        for transform in self._transformations:
            result = transform(result)
        return result

# Use the builder pattern
pipeline_result = (
    TransformationPipeline(employees_df)
    .filter_by(F.col("department") == "Engineering")
    .filter_by(F.col("years_experience") >= 5)
    .add_transformation(add_salary_grade)
    .add_transformation(calculate_bonus)
    .add_transformation(calculate_total_compensation)
    .with_columns(
        comp_ratio=F.col("total_comp") / F.col("salary"),
        is_senior_eng=(F.col("job_level").isin(["Senior", "Lead", "Principal"]))
    )
    .select_columns(
        "employee_id", "name", "job_level", "salary", 
        "salary_grade", "total_comp", "comp_ratio"
    )
    .build()
)

print("Pipeline builder result:")
pipeline_result.show(5)

print("\n✅ Builder pattern benefits:")
print("  - Fluent, readable API")
print("  - Separates pipeline construction from execution")
print("  - Easy to add conditional transformations")
print("  - Can inspect pipeline before execution")
print("  - Reusable pipeline templates")

## Handling Complex Business Logic

Strategies for managing complex transformations:

In [None]:
print("=== Handling Complex Business Logic ===")

# Complex business rule: Employee rating system
def calculate_employee_rating(df: DataFrame) -> DataFrame:
    """
    Complex business logic broken into readable components.
    """
    # Step 1: Calculate performance score components
    df = df.withColumn("salary_score",
        F.when(F.col("salary") > 150000, 5)
         .when(F.col("salary") > 120000, 4)
         .when(F.col("salary") > 90000, 3)
         .when(F.col("salary") > 60000, 2)
         .otherwise(1)
    )
    
    # Step 2: Experience score
    df = df.withColumn("experience_score",
        F.when(F.col("years_experience") > 15, 5)
         .when(F.col("years_experience") > 10, 4)
         .when(F.col("years_experience") > 5, 3)
         .when(F.col("years_experience") > 2, 2)
         .otherwise(1)
    )
    
    # Step 3: Job level score
    job_level_scores = {
        "Junior": 1,
        "Mid": 2,
        "Senior": 3,
        "Lead": 4,
        "Principal": 5
    }
    
    mapping_expr = F.create_map([F.lit(x) for pair in job_level_scores.items() for x in pair])
    df = df.withColumn("level_score", mapping_expr[F.col("job_level")])
    
    # Step 4: Calculate overall rating
    df = df.withColumn("overall_rating",
        (
            (F.col("salary_score") * 0.4) +
            (F.col("experience_score") * 0.3) +
            (F.col("level_score") * 0.3)
        )
    )
    
    # Step 5: Rating category
    df = df.withColumn("rating_category",
        F.when(F.col("overall_rating") >= 4.5, "Exceptional")
         .when(F.col("overall_rating") >= 3.5, "Strong")
         .when(F.col("overall_rating") >= 2.5, "Solid")
         .when(F.col("overall_rating") >= 1.5, "Developing")
         .otherwise("Entry")
    )
    
    return df

# Apply complex business logic
rated_employees = calculate_employee_rating(employees_df)

print("Employee ratings:")
rated_employees.select(
    "employee_id", "name", "job_level", "years_experience", "salary",
    "salary_score", "experience_score", "level_score", 
    "overall_rating", "rating_category"
).show(10)

# Distribution of ratings
print("\nRating distribution:")
rated_employees.groupBy("rating_category").count().orderBy(F.desc("count")).show()

print("\n✅ Complex logic best practices:")
print("  - Break into logical steps with comments")
print("  - Use intermediate columns for clarity")
print("  - Extract to separate function")
print("  - Document business rules")
print("  - Make weights/thresholds configurable")
print("  - Test each component separately")

## Performance Considerations

Chaining and composition impact on performance:

In [None]:
print("=== Performance Considerations ===")

import time

# Create larger dataset for performance testing
large_df = create_employee_data(5000)

# Approach 1: Single long chain
start = time.time()
result1 = (
    large_df
    .filter(F.col("department") == "Engineering")
    .withColumn("bonus", F.col("salary") * 0.10)
    .withColumn("total_comp", F.col("salary") + F.col("bonus"))
    .groupBy("job_level")
    .agg(F.avg("total_comp").alias("avg_comp"))
).count()  # Trigger action
time1 = time.time() - start

# Approach 2: Broken into steps
start = time.time()
filtered = large_df.filter(F.col("department") == "Engineering")
with_bonus = filtered.withColumn("bonus", F.col("salary") * 0.10)
with_total = with_bonus.withColumn("total_comp", F.col("salary") + F.col("bonus"))
result2 = with_total.groupBy("job_level").agg(F.avg("total_comp").alias("avg_comp"))
count2 = result2.count()  # Trigger action
time2 = time.time() - start

# Approach 3: Using .transform()
start = time.time()
result3 = (
    large_df
    .filter(F.col("department") == "Engineering")
    .transform(calculate_bonus)
    .transform(calculate_total_compensation)
    .groupBy("job_level")
    .agg(F.avg("total_comp").alias("avg_comp"))
).count()  # Trigger action
time3 = time.time() - start

print(f"Approach 1 (single chain): {time1:.3f}s")
print(f"Approach 2 (broken steps): {time2:.3f}s")
print(f"Approach 3 (.transform()): {time3:.3f}s")

print("\n📊 Performance insights:")
print("  - All approaches have similar performance (lazy evaluation)")
print("  - Catalyst optimizer analyzes entire transformation graph")
print("  - Breaking into steps doesn't hurt performance")
print("  - Choose based on readability, not performance")
print("  - .transform() adds minimal overhead")
print("  - Real performance gains come from:")
print("    • Predicate pushdown")
print("    • Column pruning")
print("    • Using built-in functions vs UDFs")
print("    • Proper partitioning and caching")

## Summary

**Key Takeaways:**

1. **Chaining Fundamentals**:
   - Leverage immutability for natural method chaining
   - Lazy evaluation enables complex chains without overhead
   - Catalyst optimizer analyzes entire transformation graph

2. **Readability Guidelines**:
   - Keep chains to 5 operations or fewer
   - Use `.transform()` with named functions for longer chains
   - Break very long chains into logical step variables
   - Group operations by phase (filter, enrich, aggregate, format)

3. **Schema Contracts**:
   - Use `select` at pipeline boundaries for explicit schemas
   - Keep select statements simple (one function per column)
   - Extract complex logic to separate functions
   - Document input/output expectations

4. **Composition Patterns**:
   - Pure functions for reusable transformations
   - `.transform()` for clean pipeline composition
   - Higher-order functions for composable pipelines
   - Builder pattern for fluent API construction

5. **Complex Logic Management**:
   - Break complex rules into logical steps
   - Use intermediate columns for clarity
   - Document business logic thoroughly
   - Make thresholds and weights configurable

**Best Practices for Chaining and Composition**:
- Prioritize readability over cleverness
- Extract complex logic into named functions
- Use explicit schema contracts with select
- Compose transformations functionally
- Test individual transformations independently
- Trust Spark's optimizer - don't micro-optimize chains
- Document complex business rules
- Use consistent patterns across team

**Next Steps**: In Section 3, we'll explore test-first development patterns to ensure our composed transformations work correctly.

## Exercise

Practice effective chaining and composition:

1. Take a complex transformation and break it into modular functions
2. Create a pipeline using `.transform()` with your functions
3. Implement schema contracts with select statements
4. Build a higher-order compose function
5. Create a pipeline builder for your domain
6. Document input/output schemas for your transformations
7. Refactor a long chain into logical step variables

In [None]:
# Your exercise code here

# 1. Create modular transformation functions
def your_transformation_1(df: DataFrame) -> DataFrame:
    """Your first transformation"""
    # Your implementation
    pass

def your_transformation_2(df: DataFrame) -> DataFrame:
    """Your second transformation"""
    # Your implementation
    pass

# 2. Compose into a pipeline
def your_pipeline(df: DataFrame) -> DataFrame:
    """Your composed pipeline"""
    return (df
        .transform(your_transformation_1)
        .transform(your_transformation_2)
        # Add more transformations
    )

# 3. Add schema contract
def your_pipeline_with_schema(df: DataFrame) -> DataFrame:
    """Pipeline with explicit output schema"""
    result = your_pipeline(df)
    
    # Output schema contract
    return result.select(
        # Your output columns
    )

# 4. Test your pipeline
# result = your_pipeline_with_schema(your_data)
# result.show()