<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üß¨ Serialization & Cross-Language UDFs

This notebook demonstrates how **heavy Python UDF usage** can slow down Spark jobs
and how to redesign the pipeline using:

- Built-in Spark SQL functions
- Vectorized (Pandas) UDFs with Arrow
- Efficient serialization formats (Avro)

The goal is to **improve performance** and **enable clean interoperability**
with a downstream microservice.


## üìÇ Dataset

**Dataset Name:** `events_for_udf_50k.csv` 

### Example Columns
- `user_id`
- `country`
- `segment`
- `event_type`
- `amount`

This dataset simulates user activity events used to compute
a **risk / engagement score**.


## üóÇÔ∏è Scenario

Your Spark job:
- Processes a medium-sized events dataset
- Uses **Python UDFs heavily**
- Runs noticeably slower than equivalent Scala jobs

Additionally:
- The processed data must be sent to a **downstream microservice**
- The microservice expects **Avro-encoded data**

You want to:
- Reduce Python‚ÄìJVM serialization overhead
- Improve Spark execution performance
- Use a standardized serialization format for interoperability

---

## üéØ Task

Redesign the job to:

1. Minimize Python UDF usage
2. Prefer Spark SQL / built-in expressions
3. Use vectorized UDFs only if necessary
4. Keep data columnar inside Spark
5. Serialize final output in **Avro**

---

## üß© Assumptions

- Dataset fits comfortably in Spark but has enough rows to show UDF overhead
- Business logic computes a per-row engagement score
- Microservice understands Avro schema
- Cluster is shared and performance-sensitive

---

## üì¶ Deliverables

- Optimized engagement score computation
- Reduced serialization overhead
- Avro output for downstream systems

### Expected Outcome

| Area | Result |
|----|----|
Python UDF overhead | Reduced |
Execution speed | Improved |
Serialization | Standardized (Avro) |
Interoperability | Simplified |

---

## üß† Notes

- **Python UDFs are slow** because each row must move between the JVM and Python.
- Spark SQL / DataFrame expressions run **inside the JVM** and benefit from:
  - whole-stage code generation
  - vectorized execution
- **Always try built-in functions first** before writing any UDF.
- If custom logic is unavoidable:
  - Prefer **Pandas (vectorized) UDFs** over normal Python UDFs.
  - Pandas UDFs use **Apache Arrow**, which reduces serialization overhead.
- Keep data in **columnar formats (Parquet / Delta)** while processing inside Spark.
- **Serialize only once at system boundaries** (for example, when sending data to a microservice).
- **Avro is ideal for cross-language systems** because:
  - it enforces schema
  - it is compact and fast
  - it supports schema evolution
- A common anti-pattern is:
  - heavy Python UDFs + CSV/JSON everywhere
- A common production pattern is:
  - Spark SQL / Scala logic ‚Üí columnar storage ‚Üí Avro at the boundary




## üß† Solution Strategy (High-Level)

1. Identify Python UDF bottlenecks
2. Replace row-based UDFs with Spark SQL expressions
3. Use Pandas UDFs only if custom Python logic is unavoidable
4. Keep data in columnar formats inside Spark
5. Serialize once at the boundary using Avro


In [0]:
from pyspark.sql import functions as F


## üõ¢Ô∏è Input Data


In [0]:
events = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("/your_data")
)

display(events.limit(5))


## ‚ùå Naive Approach: Heavy Python UDF (Slow)

This approach:
- Executes Python code **row by row**
- Requires Python ‚Üî JVM serialization for every record
- Does not benefit from Spark code generation


In [0]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf

def python_score(segment, event_type, amount):
    score = 0.0

    if segment == "churn_risk":
        score += 5
    elif segment == "active":
        score += 2

    if event_type == "purchase":
        score += 10
    elif event_type == "add_to_cart":
        score += 4

    score += float(amount) / 50.0
    return score

score_udf = udf(python_score, DoubleType())

scored_naive = events.withColumn(
    "engagement_score",
    score_udf("segment", "event_type", "amount")
)

scored_naive.limit(5)


## ‚úÖ Optimized Approach: Spark SQL Expressions

This version:
- Runs entirely inside the JVM
- Uses whole-stage code generation
- Avoids Python serialization overhead


In [0]:
scored_expr = (
    events
        .withColumn(
            "engagement_score",
            F.when(F.col("segment") == "churn_risk", F.lit(5.0))
             .when(F.col("segment") == "active", F.lit(2.0))
             .otherwise(F.lit(0.0))
            +
            F.when(F.col("event_type") == "purchase", F.lit(10.0))
             .when(F.col("event_type") == "add_to_cart", F.lit(4.0))
             .otherwise(F.lit(0.0))
            +
            (F.col("amount") / F.lit(50.0))
        )
)

display(scored_expr.limit(5))


## üü° Alternative: Pandas (Vectorized) UDF

Use **only if custom Python logic cannot be expressed in SQL**.

Pandas UDFs:
- Process data in batches
- Use Apache Arrow for efficient serialization
- Are faster than normal Python UDFs


In [0]:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

@pandas_udf(DoubleType())
def pandas_score(segment: pd.Series,
                 event_type: pd.Series,
                 amount: pd.Series) -> pd.Series:

    base = segment.map(
        {"churn_risk": 5.0, "active": 2.0}
    ).fillna(0.0)

    event_bonus = event_type.map(
        {"purchase": 10.0, "add_to_cart": 4.0}
    ).fillna(0.0)

    return base + event_bonus + (amount.astype(float) / 50.0)

scored_pandas = events.withColumn(
    "engagement_score",
    pandas_score("segment", "event_type", "amount")
)

display(scored_pandas.limit(5))


## üß† Comparison Summary

| Approach | Performance | Notes |
|------|-----------|------|
Python UDF | ‚ùå Slow | Per-row execution |
Pandas UDF | ‚ö†Ô∏è Medium | Vectorized, Arrow-based |
Spark SQL | ‚úÖ Fastest | JVM + codegen |


## üì¶ Preparing Data for Microservice (Avro)


In [0]:
final_df = (
    scored_expr
        .select(
            "user_id",
            "country",
            "segment",
            "event_type",
            "amount",
            "engagement_score"
        )
)


## üíæ Writing Output as Avro

Avro provides:
- Compact binary format
- Schema enforcement
- Cross-language compatibility


In [0]:
(
    final_df
        .write
        .mode("overwrite")
        .format("avro")
        .save("your_directory")
)


## üîç Why Avro Works Well Here

- Spark writes Avro natively
- Microservices can read Avro directly
- One schema shared across systems
- No JSON / CSV parsing overhead


## ‚úÖ Summary

- Python UDFs are expensive due to serialization
- Spark SQL expressions are fastest
- Pandas UDFs are a good compromise when needed
- Avro enables clean cross-language interoperability

This design improves **performance**, **scalability**, and **system integration**.
