<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/17_UDFs_and_Pandas_UDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## UDFs (User-Defined Functions) Overview

UDFs allow you to extend Spark's functionality by writing custom logic. There are primarily three styles for Python UDFs: Regular UDFs, Pandas UDFs (Vectorized UDFs), and the new Spark 3.5+ Python UDFs.

### 1. Regular Python UDFs

Regular UDFs operate on a **row-by-row basis**, similar to a standard Python function applied iteratively to each element of a column.

*   **Definition:**
    *   Define a Python function.
    *   Wrap it with `pyspark.sql.functions.udf`.
    *   **Crucially, specify the `returnType`** (e.g., `StringType()`, `IntegerType()`) for Spark's optimizer.

*   **Example (Python):**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType, IntegerType

spark = SparkSession.builder.appName("RegularUDFs").getOrCreate()

data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

# Define a Python function
def age_category(age):
    if age < 25:
        return "Young"
    elif age >= 25 and age < 35:
        return "Mid"
    else:
        return "Senior"

# Register the Python function as a UDF
# Specify the returnType: StringType() in this case
age_category_udf = udf(age_category, StringType())

# Apply the UDF to the DataFrame
print("\nDataFrame with Age Category (using Regular UDF):")
df.withColumn("AgeCategory", age_category_udf(col("Age"))).show()

# Another UDF example: simple addition
def add_one(value):
    return value + 1

add_one_udf = udf(add_one, IntegerType())

print("\nDataFrame with Age + 1 (using Regular UDF):")
df.withColumn("AgePlusOne", add_one_udf(col("Age"))).show()

spark.stop()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+


DataFrame with Age Category (using Regular UDF):
+-------+---+-----------+
|   Name|Age|AgeCategory|
+-------+---+-----------+
|  Alice| 30|        Mid|
|    Bob| 25|        Mid|
|Charlie| 35|     Senior|
+-------+---+-----------+


DataFrame with Age + 1 (using Regular UDF):
+-------+---+----------+
|   Name|Age|AgePlusOne|
+-------+---+----------+
|  Alice| 30|        31|
|    Bob| 25|        26|
|Charlie| 35|        36|
+-------+---+----------+



*   **Performance Drawbacks (Why they are generally discouraged):**
    1.  **Serialization/Deserialization Overhead:** Spark's optimized internal format (Tungsten) must convert data to Python objects for the UDF, then convert results back. This constant conversion is slow.
    2.  **Optimization Barrier (Black Box):** Spark's Catalyst Optimizer cannot "see inside" a regular UDF. It treats it as a black box, preventing powerful optimizations like predicate pushdown, column pruning, and code generation.
    3.  **Python Process Overhead:** Each Spark executor launches separate Python processes to run UDFs, incurring overhead for launching, managing, and communicating between the JVM (Spark core) and Python.
    4.  **No Vectorization:** Regular UDFs process data one row at a time, which is inefficient compared to vectorized operations that process data in batches.

*   **When to Use:** Sparingly, and only when a built-in Spark SQL function cannot achieve the desired logic, or for simple, non-performance-critical transformations on small datasets.

### 2. Pandas UDFs (Vectorized UDFs)

Pandas UDFs significantly improve performance by leveraging **Apache Arrow** for efficient data transfer and processing data in **batches** using Pandas Series/DataFrames.

*   **Definition:**
    *   Define a Python function that takes one or more Pandas Series as input.
    *   It **must return a Pandas Series of the same length** as the input.
    *   Decorate the function with `@pandas_udf(returnType)`, specifying a Spark SQL data type.
    *   Operates on `pandas.Series` objects, enabling vectorized operations.

*   **Example (Python):**


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType, StringType
import pandas as pd
import numpy as np # numpy is typically used with pandas for such operations

spark = SparkSession.builder.appName("PandasUDFs").getOrCreate()

data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

# Define a Pandas UDF (Series to Series)
@pandas_udf(LongType()) # Return type is LongType
def multiply_by_ten(series: pd.Series) -> pd.Series:
    return series * 10

print("\nDataFrame with Age * 10 (using Pandas UDF):")
df.withColumn("AgeTimesTen", multiply_by_ten(col("Age"))).show()

# Another Pandas UDF: apply a more complex string transformation
@pandas_udf(StringType())
def categorize_age_pandas(ages: pd.Series) -> pd.Series:
    conditions = [
        ages < 25,
        (ages >= 25) & (ages < 35),
        ages >= 35
    ]
    choices = ["Young", "Mid", "Senior"]
    return pd.Series(np.select(conditions, choices, default="Unknown"))

print("\nDataFrame with Age Category (using Pandas UDF):")
df.withColumn("AgeCategoryPandas", categorize_age_pandas(col("Age"))).show()

spark.stop()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+


DataFrame with Age * 10 (using Pandas UDF):
+-------+---+-----------+
|   Name|Age|AgeTimesTen|
+-------+---+-----------+
|  Alice| 30|        300|
|    Bob| 25|        250|
|Charlie| 35|        350|
+-------+---+-----------+


DataFrame with Age Category (using Pandas UDF):
+-------+---+-----------------+
|   Name|Age|AgeCategoryPandas|
+-------+---+-----------------+
|  Alice| 30|              Mid|
|    Bob| 25|              Mid|
|Charlie| 35|           Senior|
+-------+---+-----------------+



*   **Performance Advantages (Vectorized, Arrow-based):**
    1.  **Vectorized Execution:** Processes batches of rows (as Pandas Series) instead of one row at a time. This allows efficient use of Pandas' optimized operations (often implemented in C).
    2.  **Apache Arrow Optimization:** Spark uses Apache Arrow, an in-memory columnar data format, for efficient data transfer between the JVM and Python processes. This minimizes serialization/deserialization overhead.
    3.  **Catalyst Integration (Improved):** While still somewhat of an optimization barrier, the batch processing nature and Arrow integration allow Spark to manage data transfer more efficiently, leading to better performance than regular UDFs.

*   **When to Use:**
    *   When existing Python libraries (like NumPy, Pandas, Scikit-learn) have functions well-suited for vectorized operations and are not available as Spark SQL functions.
    *   For complex custom logic that benefits significantly from batch processing.

### 3. New Python UDFs (Spark 3.5+ Style)

With Spark 3.5+, a more streamlined way to define Python UDFs was introduced, leveraging **Python type annotations**. Spark automatically infers the return type and often provides an optimized execution path similar to Pandas UDFs internally.

*   **Key Features:**
    1.  **Type Annotation Driven:** Spark infers the return type from the Python type hints in the function signature.
    2.  **Optimized Execution:** Spark can use a more optimized, vectorized execution path (similar to Pandas UDFs) if the types and operations allow, potentially using Apache Arrow.
    3.  **Simpler Syntax:** No need for explicit `udf()` wrapper or `@pandas_udf` decorator for basic UDFs.

*   **Example (Python - Spark 3.5+ style):**


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
# No need to import udf or pandas_udf explicitly for this style if using Spark 3.5+
from pyspark.sql.types import IntegerType, StringType

spark = SparkSession.builder.appName("NewPythonUDFs").getOrCreate()

data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

# Define a Python function with type annotations
# Spark 3.5+ will automatically infer this as a UDF
def increment_age(age: int) -> int:
    return age + 1

# Apply the function directly to the DataFrame.
# Spark's internal mechanisms will convert this to a UDF.
print("\nDataFrame with Age + 1 (using new Python UDF style):")
df.withColumn("AgePlusOne", increment_age(col("Age"))).show()

# Another example with conditional logic using Spark's when/otherwise
def get_age_status(age_col): # Pass the column object
    return when(age_col < 25, "Young") \
           .when((age_col >= 25) & (age_col < 35), "Adult") \
           .otherwise("Senior")

print("\nDataFrame with Age Status (using new Python UDF style):")
# Apply the function to the column
df.withColumn("AgeStatus", get_age_status(col("Age"))).show()

spark.stop()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+


DataFrame with Age + 1 (using new Python UDF style):
+-------+---+----------+
|   Name|Age|AgePlusOne|
+-------+---+----------+
|  Alice| 30|        31|
|    Bob| 25|        26|
|Charlie| 35|        36|
+-------+---+----------+


DataFrame with Age Status (using new Python UDF style):
+-------+---+---------+
|   Name|Age|AgeStatus|
+-------+---+---------+
|  Alice| 30|    Adult|
|    Bob| 25|    Adult|
|Charlie| 35|   Senior|
+-------+---+---------+



*   **Important Note:** While this style simplifies syntax and often improves performance (relative to regular UDFs), for the **absolute best performance**, especially with heavy numerical computation, **explicitly using `@pandas_udf` for vectorized operations is still a strong choice**. Always profile your UDFs to determine the most efficient approach for your specific workload.

---

## UDF Comparison Table for Beginner Data Engineers

| Feature / UDF Type       | Regular UDF (Legacy)    | Pandas UDF (Vectorized)      | Spark 3.5+ Python UDF (Type Annotated) |
| :----------------------- | :---------------------- | :--------------------------- | :--------------------------------------- |
| **Processing Mode**      | Row-by-row              | Batch (Pandas Series)        | Batch (often internally vectorized)      |
| **Syntax**               | `udf(func, returnType)` | `@pandas_udf(returnType)`    | Type annotations `(arg: type) -> type`   |
| **Data Transfer**        | Python object conversion | Apache Arrow                 | Apache Arrow (when vectorized)           |
| **Serialization Overhead**| High                    | Low                          | Low (when vectorized)                    |
| **Catalyst Optimization**| No (Black Box)          | Improved (still limited)     | Often Optimized (Spark can "see" types)  |
| **Performance**          | Low                     | High                         | Medium to High                           |
| **Key Use Case**         | Avoid if possible       | Complex vectorized logic, NumPy/Pandas/Scikit-learn integration | Simple custom logic with good performance, general use in modern Spark |
| **Recommendation**       | Avoid for large scale    | Preferred for high-perf and vectorized tasks | Good default for new UDFs, but profile!   |

---

## Conclusion

Mastering UDFs is part of becoming a proficient Spark Data Engineer. Always prioritize built-in Spark SQL functions when available, as they are typically the most optimized. When custom logic is needed, favor Pandas UDFs or the new Spark 3.5+ style for better performance. Remember to always profile your Spark jobs to ensure your chosen UDF approach is performing optimally for your specific workload.