<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üß¨ Serialization & Cross-Language UDFs

This notebook demonstrates how **heavy Python UDF usage** can slow down Spark jobs
and how to redesign the pipeline using:

- Built-in Spark SQL functions
- Vectorized (Pandas) UDFs with Arrow
- Efficient serialization formats (Avro)

The goal is to **improve performance** and **enable clean interoperability**
with a downstream microservice.


## üìÇ Dataset

**Dataset Name:** `events_for_udf_50k.csv` 

### Example Columns
- `user_id`
- `country`
- `segment`
- `event_type`
- `amount`

This dataset simulates user activity events used to compute
a **risk / engagement score**.


## üóÇÔ∏è Scenario

Your Spark job:
- Processes a medium-sized events dataset
- Uses **Python UDFs heavily**
- Runs noticeably slower than equivalent Scala jobs

Additionally:
- The processed data must be sent to a **downstream microservice**
- The microservice expects **Avro-encoded data**

You want to:
- Reduce Python‚ÄìJVM serialization overhead
- Improve Spark execution performance
- Use a standardized serialization format for interoperability

---

## üéØ Task

Redesign the job to:

1. Minimize Python UDF usage
2. Prefer Spark SQL / built-in expressions
3. Use vectorized UDFs only if necessary
4. Keep data columnar inside Spark
5. Serialize final output in **Avro**

---

## üß© Assumptions

- Dataset fits comfortably in Spark but has enough rows to show UDF overhead
- Business logic computes a per-row engagement score
- Microservice understands Avro schema
- Cluster is shared and performance-sensitive

---

## üì¶ Deliverables

- Optimized engagement score computation
- Reduced serialization overhead
- Avro output for downstream systems

### Expected Outcome

| Area | Result |
|----|----|
Python UDF overhead | Reduced |
Execution speed | Improved |
Serialization | Standardized (Avro) |
Interoperability | Simplified |

---

## üß† Notes

- **Python UDFs are slow** because each row must move between the JVM and Python.
- Spark SQL / DataFrame expressions run **inside the JVM** and benefit from:
  - whole-stage code generation
  - vectorized execution
- **Always try built-in functions first** before writing any UDF.
- If custom logic is unavoidable:
  - Prefer **Pandas (vectorized) UDFs** over normal Python UDFs.
  - Pandas UDFs use **Apache Arrow**, which reduces serialization overhead.
- Keep data in **columnar formats (Parquet / Delta)** while processing inside Spark.
- **Serialize only once at system boundaries** (for example, when sending data to a microservice).
- **Avro is ideal for cross-language systems** because:
  - it enforces schema
  - it is compact and fast
  - it supports schema evolution
- A common anti-pattern is:
  - heavy Python UDFs + CSV/JSON everywhere
- A common production pattern is:
  - Spark SQL / Scala logic ‚Üí columnar storage ‚Üí Avro at the boundary




## üß† Solution Strategy (High-Level)

1. Identify Python UDF bottlenecks
2. Replace row-based UDFs with Spark SQL expressions
3. Use Pandas UDFs only if custom Python logic is unavoidable
4. Keep data in columnar formats inside Spark
5. Serialize once at the boundary using Avro
