# 2.2 Leveraging PySpark's Built-in Functions and Higher-Order Functions

This notebook demonstrates how to maximize performance and maintain functional purity by using PySpark's built-in functions and higher-order functions instead of UDFs.

## Learning Objectives
- Understand the performance benefits of built-in functions over UDFs
- Learn to use higher-order functions for complex array operations
- Practice transforming UDF-based code to built-in function equivalents
- Explore advanced built-in functions for common data processing tasks

## Performance Comparison: Built-ins vs UDFs

PySpark's built-in functions are highly optimized by the Catalyst optimizer and execute in the JVM, avoiding the overhead of Python serialization/deserialization.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql.functions import udf, pandas_udf
import time

# Create sample data for performance comparison
large_data = [(i, f"user_{i}", i * 10.5, f"category_{i % 5}") for i in range(10000)]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("username", StringType(), True),
    StructField("value", DoubleType(), True),
    StructField("category", StringType(), True)
])

large_df = spark.createDataFrame(large_data, schema)
large_df.cache()  # Cache for fair performance comparison

print("Sample data created:")
large_df.show(5)
print(f"Total records: {large_df.count()}")

## Comparison: Built-in Functions vs UDFs

Let's compare the performance of built-in functions versus UDFs for common operations:

In [None]:
print("=== Performance Comparison: Built-in vs UDF ===")

# Task: Calculate a derived value based on input

# ❌ UDF Approach (slower)
def calculate_score_udf(value, id_val):
    """Python UDF - runs in Python interpreter"""
    return (value * 1.5 + id_val * 2) / 10

calculate_score_udf_func = udf(calculate_score_udf, DoubleType())

def test_udf_performance():
    start_time = time.time()
    result = large_df.withColumn("score_udf", 
                                calculate_score_udf_func(F.col("value"), F.col("id")))
    count = result.count()  # Force execution
    end_time = time.time()
    return end_time - start_time, count

# ✅ Built-in Functions Approach (faster)
def test_builtin_performance():
    start_time = time.time()
    result = large_df.withColumn("score_builtin", 
                                (F.col("value") * 1.5 + F.col("id") * 2) / 10)
    count = result.count()  # Force execution
    end_time = time.time()
    return end_time - start_time, count

# Performance test
udf_time, udf_count = test_udf_performance()
builtin_time, builtin_count = test_builtin_performance()

print(f"UDF Performance: {udf_time:.3f} seconds for {udf_count:,} records")
print(f"Built-in Performance: {builtin_time:.3f} seconds for {builtin_count:,} records")
print(f"Performance improvement: {udf_time/builtin_time:.1f}x faster with built-ins")

# Verify results are equivalent
comparison_df = (large_df
                .withColumn("score_udf", calculate_score_udf_func(F.col("value"), F.col("id")))
                .withColumn("score_builtin", (F.col("value") * 1.5 + F.col("id") * 2) / 10)
                .withColumn("difference", F.abs(F.col("score_udf") - F.col("score_builtin"))))

max_diff = comparison_df.agg(F.max("difference")).collect()[0][0]
print(f"\nResults verification - Maximum difference: {max_diff} (should be ~0)")

## Built-in Functions for Common Operations

Let's explore the rich set of built-in functions available in PySpark:

In [None]:
print("=== Built-in Functions Showcase ===")

# Create test data with various data types
test_data = [
    (1, "John Doe", "john.doe@email.com", "2023-01-15", 1200.50, "USD", ["python", "spark", "sql"]),
    (2, "jane smith", "JANE.SMITH@EMAIL.COM", "2023-02-20", 1500.75, "EUR", ["java", "scala"]),
    (3, "Bob Johnson", "bob@company.org", "2023-03-10", 980.25, "USD", ["python", "pandas", "numpy"]),
    (4, "Alice Brown", "alice.brown@tech.io", "2023-04-05", 2200.00, "GBP", ["spark", "hadoop", "kafka"])
]

test_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("date", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("currency", StringType(), True),
    StructField("skills", ArrayType(StringType()), True)
])

test_df = spark.createDataFrame(test_data, test_schema)
print("Test dataset:")
test_df.show(truncate=False)

In [None]:
print("\n=== String Functions ===")

string_transformations = (test_df
    # String case operations
    .withColumn("name_upper", F.upper(F.col("name")))
    .withColumn("name_lower", F.lower(F.col("name")))
    .withColumn("name_title", F.initcap(F.col("name")))
    
    # String manipulation
    .withColumn("name_length", F.length(F.col("name")))
    .withColumn("first_name", F.split(F.col("name"), " ").getItem(0))
    .withColumn("email_domain", F.split(F.col("email"), "@").getItem(1))
    
    # Pattern matching
    .withColumn("is_gmail", F.col("email").contains("gmail"))
    .withColumn("email_valid", F.col("email").rlike(r"^[\w\.-]+@[\w\.-]+\.[a-zA-Z]{2,}$"))
    
    .select("name", "name_title", "first_name", "email", "email_domain", "is_gmail", "email_valid")
)

string_transformations.show(truncate=False)

In [None]:
print("\n=== Date and Numeric Functions ===")

date_numeric_transformations = (test_df
    # Date functions
    .withColumn("date_parsed", F.to_date(F.col("date"), "yyyy-MM-dd"))
    .withColumn("year", F.year(F.to_date(F.col("date"), "yyyy-MM-dd")))
    .withColumn("month", F.month(F.to_date(F.col("date"), "yyyy-MM-dd")))
    .withColumn("day_of_year", F.dayofyear(F.to_date(F.col("date"), "yyyy-MM-dd")))
    
    # Numeric functions
    .withColumn("amount_rounded", F.round(F.col("amount"), 0))
    .withColumn("amount_ceil", F.ceil(F.col("amount")))
    .withColumn("amount_floor", F.floor(F.col("amount")))
    .withColumn("amount_log", F.log(F.col("amount")))
    
    .select("id", "date", "year", "month", "day_of_year", 
           "amount", "amount_rounded", "amount_ceil", "amount_floor")
)

date_numeric_transformations.show()

In [None]:
print("\n=== Conditional Logic with Built-ins ===")

conditional_transformations = (test_df
    # Complex when/otherwise logic
    .withColumn("amount_category",
               F.when(F.col("amount") < 1000, "Low")
                .when(F.col("amount") < 2000, "Medium")
                .otherwise("High"))
    
    # Case-insensitive matching
    .withColumn("currency_region",
               F.when(F.upper(F.col("currency")) == "USD", "North America")
                .when(F.upper(F.col("currency")) == "EUR", "Europe")
                .when(F.upper(F.col("currency")) == "GBP", "UK")
                .otherwise("Other"))
    
    # Null handling
    .withColumn("safe_amount", F.coalesce(F.col("amount"), F.lit(0.0)))
    .withColumn("amount_or_default", F.when(F.col("amount").isNull(), F.lit(999.99)).otherwise(F.col("amount")))
    
    .select("id", "amount", "amount_category", "currency", "currency_region")
)

conditional_transformations.show()

## Higher-Order Functions for Array Operations

PySpark provides powerful higher-order functions for working with array and map columns:

In [None]:
print("=== Higher-Order Functions for Arrays ===")

# Array operations using higher-order functions
array_operations = (test_df
    # Transform array elements
    .withColumn("skills_upper", 
               F.transform(F.col("skills"), lambda x: F.upper(x)))
    
    # Filter array elements
    .withColumn("python_skills", 
               F.filter(F.col("skills"), lambda x: x.contains("python")))
    
    # Check if array contains elements matching condition
    .withColumn("has_python", 
               F.exists(F.col("skills"), lambda x: x == "python"))
    
    # Aggregate array elements
    .withColumn("skills_count", F.size(F.col("skills")))
    .withColumn("skills_concat", 
               F.array_join(F.col("skills"), ", "))
    
    # Array sorting and distinct
    .withColumn("skills_sorted", F.array_sort(F.col("skills")))
    .withColumn("skills_distinct", F.array_distinct(F.col("skills")))
    
    .select("id", "name", "skills", "skills_upper", "python_skills", 
           "has_python", "skills_count", "skills_concat")
)

array_operations.show(truncate=False)

In [None]:
print("\n=== Advanced Array Operations ===")

# More complex array transformations
advanced_array_ops = (test_df
    # Add skill levels (simulate complex transformation)
    .withColumn("skill_levels", 
               F.transform(F.col("skills"), 
                         lambda skill: F.concat(skill, F.lit("_advanced"))))
    
    # Conditional array transformation
    .withColumn("enhanced_skills",
               F.transform(F.col("skills"),
                         lambda skill: F.when(skill == "python", "python_expert")
                                        .when(skill == "spark", "spark_ninja")
                                        .otherwise(skill)))
    
    # Aggregate with reduce (simulate counting characters)
    .withColumn("total_skill_chars",
               F.aggregate(F.col("skills"), 
                         F.lit(0),  # Initial value
                         lambda acc, x: acc + F.length(x)))  # Accumulator function
    
    .select("id", "name", "skills", "skill_levels", "enhanced_skills", "total_skill_chars")
)

advanced_array_ops.show(truncate=False)

## Converting UDF Logic to Built-in Functions

Let's practice converting common UDF patterns to built-in function equivalents:

In [None]:
print("=== Converting UDFs to Built-in Functions ===")

# Example 1: Email validation
print("\n1. Email Validation:")

# ❌ UDF approach
def validate_email_udf(email):
    import re
    pattern = r'^[\w\.-]+@[\w\.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

validate_email_udf_func = udf(validate_email_udf, BooleanType())

# ✅ Built-in approach
def validate_email_builtin(df):
    return df.withColumn("email_valid_builtin", 
                        F.col("email").rlike(r'^[\w\.-]+@[\w\.-]+\.[a-zA-Z]{2,}$'))

# Compare results
comparison1 = (test_df
              .withColumn("email_valid_udf", validate_email_udf_func(F.col("email")))
              .transform(validate_email_builtin)
              .select("email", "email_valid_udf", "email_valid_builtin"))

comparison1.show(truncate=False)

In [None]:
# Example 2: Complex scoring logic
print("\n2. Complex Scoring Logic:")

# ❌ UDF approach
def calculate_user_score_udf(amount, skills_count, email_domain):
    base_score = amount / 100
    skill_bonus = skills_count * 10
    domain_bonus = 5 if email_domain in ['gmail.com', 'company.org'] else 0
    return min(base_score + skill_bonus + domain_bonus, 100)

calculate_score_udf_func = udf(calculate_user_score_udf, DoubleType())

# ✅ Built-in approach
def calculate_user_score_builtin(df):
    return (df
            # Extract domain first
            .withColumn("email_domain", F.split(F.col("email"), "@").getItem(1))
            .withColumn("skills_count", F.size(F.col("skills")))
            
            # Calculate components using built-ins
            .withColumn("base_score", F.col("amount") / 100)
            .withColumn("skill_bonus", F.col("skills_count") * 10)
            .withColumn("domain_bonus", 
                       F.when(F.col("email_domain").isin(['gmail.com', 'company.org']), 5)
                        .otherwise(0))
            
            # Combine and cap at 100
            .withColumn("user_score_builtin", 
                       F.least(F.col("base_score") + F.col("skill_bonus") + F.col("domain_bonus"), 
                              F.lit(100))))

# Compare results
comparison2 = (test_df
              .withColumn("email_domain", F.split(F.col("email"), "@").getItem(1))
              .withColumn("skills_count", F.size(F.col("skills")))
              .withColumn("user_score_udf", 
                         calculate_score_udf_func(F.col("amount"), 
                                                 F.col("skills_count"), 
                                                 F.col("email_domain")))
              .transform(calculate_user_score_builtin)
              .select("id", "name", "amount", "skills_count", "email_domain",
                     "user_score_udf", "user_score_builtin"))

comparison2.show()

## When to Use Pandas UDFs

Sometimes UDFs are necessary. When you must use them, Pandas UDFs are preferred over regular Python UDFs:

In [None]:
print("=== When UDFs Are Necessary - Use Pandas UDFs ===")

import pandas as pd

# Scenario: Complex statistical calculation not available as built-in
# Example: Calculate rolling z-score within groups

@pandas_udf(returnType="double")
def calculate_zscore_pandas_udf(values: pd.Series) -> pd.Series:
    """
    Pandas UDF for complex statistical calculations
    Vectorized execution is much faster than row-by-row UDF
    """
    mean_val = values.mean()
    std_val = values.std()
    if std_val == 0:
        return pd.Series([0.0] * len(values))
    return (values - mean_val) / std_val

# Create test data with groups
group_data = [
    ("A", 10), ("A", 15), ("A", 20), ("A", 25), ("A", 30),
    ("B", 100), ("B", 110), ("B", 120), ("B", 130), ("B", 140),
    ("C", 5), ("C", 8), ("C", 12), ("C", 15), ("C", 18)
]

group_df = spark.createDataFrame(group_data, ["group", "value"])

# Use Pandas UDF with window function
from pyspark.sql.window import Window

window_spec = Window.partitionBy("group").orderBy("value")

result_with_zscore = (group_df
                     .withColumn("zscore", 
                                calculate_zscore_pandas_udf(F.col("value")).over(window_spec)))

print("Z-score calculation using Pandas UDF:")
result_with_zscore.orderBy("group", "value").show()

print("\n✅ Use Pandas UDFs when:")
print("- Complex statistical operations not available as built-ins")
print("- Leveraging existing Pandas/NumPy libraries")
print("- Vectorized operations that can't be expressed with built-ins")
print("\n❌ Avoid regular Python UDFs due to row-by-row processing overhead")

## Building a Function Library

Let's create a library of reusable transformation functions using built-in functions:

In [None]:
print("=== Building a Reusable Function Library ===")

class DataTransformations:
    """Library of pure transformation functions using built-in functions"""
    
    @staticmethod
    def standardize_names(df, name_column="name"):
        """Standardize name format to Title Case"""
        return df.withColumn(f"{name_column}_standardized", 
                           F.initcap(F.trim(F.col(name_column))))
    
    @staticmethod
    def extract_email_components(df, email_column="email"):
        """Extract username and domain from email"""
        return (df
               .withColumn("email_username", F.split(F.col(email_column), "@").getItem(0))
               .withColumn("email_domain", F.split(F.col(email_column), "@").getItem(1))
               .withColumn("email_tld", 
                          F.split(F.col("email_domain"), "\\.").getItem(-1)))
    
    @staticmethod
    def categorize_amounts(df, amount_column="amount", 
                          thresholds=[1000, 2000, 5000]):
        """Categorize amounts into buckets"""
        condition = F.when(F.col(amount_column) < thresholds[0], "Low")
        
        for i, threshold in enumerate(thresholds[1:], 1):
            condition = condition.when(F.col(amount_column) < threshold, 
                                     f"Medium_{i}")
        
        return df.withColumn(f"{amount_column}_category", 
                           condition.otherwise("High"))
    
    @staticmethod
    def add_array_metrics(df, array_column="skills"):
        """Add various metrics for array columns"""
        return (df
               .withColumn(f"{array_column}_count", F.size(F.col(array_column)))
               .withColumn(f"{array_column}_unique_count", 
                          F.size(F.array_distinct(F.col(array_column))))
               .withColumn(f"{array_column}_joined", 
                          F.array_join(F.col(array_column), ", "))
               .withColumn(f"{array_column}_sorted", 
                          F.array_sort(F.col(array_column))))
    
    @staticmethod
    def add_date_features(df, date_column="date"):
        """Add comprehensive date-based features"""
        date_col = F.to_date(F.col(date_column), "yyyy-MM-dd")
        
        return (df
               .withColumn(f"{date_column}_parsed", date_col)
               .withColumn(f"{date_column}_year", F.year(date_col))
               .withColumn(f"{date_column}_month", F.month(date_col))
               .withColumn(f"{date_column}_quarter", F.quarter(date_col))
               .withColumn(f"{date_column}_day_of_week", F.dayofweek(date_col))
               .withColumn(f"{date_column}_day_of_year", F.dayofyear(date_col))
               .withColumn(f"{date_column}_is_weekend", 
                          F.dayofweek(date_col).isin([1, 7])))

# Demonstrate the function library
print("\nApplying transformation library:")

enhanced_df = (test_df
              .transform(DataTransformations.standardize_names)
              .transform(DataTransformations.extract_email_components)
              .transform(DataTransformations.categorize_amounts)
              .transform(DataTransformations.add_array_metrics)
              .transform(DataTransformations.add_date_features))

print(f"Original columns: {len(test_df.columns)}")
print(f"Enhanced columns: {len(enhanced_df.columns)}")
print(f"New columns added: {len(enhanced_df.columns) - len(test_df.columns)}")

# Show sample of enhanced data
enhanced_df.select("id", "name_standardized", "email_domain", "email_tld",
                  "amount_category", "skills_count", "date_quarter", "date_is_weekend").show()

## Performance Best Practices Summary

Let's summarize the key performance considerations:

In [None]:
print("=== Performance Best Practices Summary ===")

performance_table = [
    ("Built-in Functions", "High", "JVM execution, Catalyst optimization", "Always prefer"),
    ("Pandas UDFs", "Medium", "Vectorized, Arrow serialization", "When built-ins insufficient"),
    ("Python UDFs", "Low", "Row-by-row Python execution", "Avoid if possible"),
    ("Higher-order Functions", "High", "Array operations in JVM", "For complex array logic"),
    ("SQL Functions", "High", "Native Catalyst optimization", "Alternative to DataFrame API")
]

perf_schema = StructType([
    StructField("Function_Type", StringType(), True),
    StructField("Performance", StringType(), True),
    StructField("Execution_Details", StringType(), True),
    StructField("Usage_Recommendation", StringType(), True)
])

perf_df = spark.createDataFrame(performance_table, perf_schema)
perf_df.show(truncate=False)

print("\n🎯 Key Takeaways:")
print("1. Built-in functions are optimized by Catalyst and execute in JVM")
print("2. Higher-order functions enable complex array operations without UDFs")
print("3. When UDFs are necessary, prefer Pandas UDFs for vectorization")
print("4. Most data transformations can be expressed with built-in functions")
print("5. Performance improvement can be 2-10x with built-ins vs UDFs")

## Summary

**Key Takeaways:**

1. **Built-in Function Advantages**:
   - JVM execution (no Python serialization overhead)
   - Catalyst optimizer can analyze and optimize
   - Significant performance improvements over UDFs

2. **Higher-Order Functions**:
   - Enable complex array and map operations
   - Functions like `transform()`, `filter()`, `exists()`, `aggregate()`
   - Maintain functional programming patterns

3. **UDF Guidelines**:
   - Avoid Python UDFs when possible
   - Use Pandas UDFs for vectorized operations when built-ins are insufficient
   - Reserve for truly complex logic not expressible with built-ins

4. **Function Library Design**:
   - Create reusable transformation functions using built-ins
   - Maintain functional purity and composability
   - Document performance characteristics

**Next Steps**: In the next notebook, we'll explore effective chaining and composition patterns to build complex, readable transformation pipelines.

## Exercise

Practice converting UDF logic to built-in functions:

1. Write a UDF that calculates a "risk score" based on multiple factors
2. Convert the same logic using only built-in functions
3. Compare performance between the two approaches
4. Create a higher-order function to process an array of risk factors
5. Build a reusable transformation function for your domain

In [None]:
# Your exercise code here

# 1. UDF approach
def calculate_risk_score_udf(amount, age, transaction_count):
    """Calculate risk score using UDF"""
    # Your UDF implementation
    pass

# 2. Built-in functions approach
def calculate_risk_score_builtin(df):
    """Calculate same risk score using built-in functions"""
    # Your built-in implementation
    pass

# 3. Performance comparison
def compare_performance(df):
    """Compare performance of both approaches"""
    # Your performance test
    pass

# 4. Higher-order function for arrays
def process_risk_factors(df, risk_factors_column):
    """Process array of risk factors using higher-order functions"""
    # Your higher-order function implementation
    pass

# Run your exercise
# compare_performance(your_test_data)