<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üõ°Ô∏è Fault Tolerance & Recomputation in Spark

This notebook explains how Apache Spark can **recover from node failures**
during long-running jobs **without restarting from scratch**.

We focus on Spark‚Äôs **lineage-based fault tolerance model** and how it enables
automatic recomputation of lost data.


## üìÇ Dataset

**Dataset Name:** `sales_orders_large.csv`

**Example Columns:**
- `order_id`
- `order_date`
- `region`
- `customer_id`
- `category`
- `quantity`
- `amount`

> ‚ö†Ô∏è The exact size of the dataset is not critical for this scenario.  
The goal is to build a **multi-step transformation pipeline** and understand
how Spark can recompute parts of it if something goes wrong.

The dataset is assumed to be available in **your catalog / database storage**.


## üóÇÔ∏è Scenario

You are running a **long Spark job** with multiple transformation steps.

Midway through execution:
- One executor (node) in the Spark cluster **fails**
- Some partitions of intermediate data are **lost**

Despite this failure:
- The job **does not restart from the beginning**
- Spark **re-executes only the lost work**
- The job **still completes successfully**

Your task is to explain:
- **What Spark is doing internally**
- **Why the job can recover**
- **How lineage enables recomputation**

---

## üéØ Task

Perform the following steps:

1. Build a **multi-step transformation pipeline** using Spark.
2. Observe how Spark tracks transformations as a **lineage DAG**.
3. Explain what happens when an executor fails and loses partitions.
4. Understand how Spark **recomputes only the missing partitions**.
5. Learn when **caching or checkpointing** helps shorten recomputation paths.

---

## üß© Assumptions

- Spark is running on a distributed cluster.
- Executors may fail due to:
  - hardware issues
  - network problems
  - resource pressure
- Spark uses **lazy evaluation**.
- Spark Serverless or classic clusters may be used.

---

## üì¶ Deliverables

- A clear explanation of Spark‚Äôs **fault tolerance mechanism**
- A working example showing a **multi-step Spark pipeline**
- Evidence of how Spark tracks transformations using **lineage**

---

## üß† Notes

- Spark is **fault-tolerant by design**.
- Spark does not store all intermediate data eagerly.
- Instead, Spark stores **how to recompute the data**.
- This design makes Spark resilient to executor failures.





## üß† Solution Strategy (High-Level)

1. Spark represents all computations as a **Directed Acyclic Graph (DAG)**.
2. Each RDD or DataFrame records its **lineage**:
   - source data
   - transformations applied
3. When an executor fails, Spark identifies **which partitions were lost**.
4. Spark **re-runs only the tasks needed** to rebuild those partitions.
5. Other completed partitions remain untouched.
6. Optional caching or checkpointing can **shorten the recomputation path**.

Spark handles:
- Task re-execution
- Dependency tracking
- Partition-level recovery
- Automatic retry logic


In [0]:
from pyspark.sql import functions as F


In [0]:
# Read the dataset
sales_df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("your_data")
)


In [0]:
sales_df.printSchema()


## üõ¢Ô∏è Input Data


In [0]:
display(sales_df.limit(5))

## üîÑ Building a Multi-Step Transformation Pipeline

We create a chain of transformations to simulate a long-running job.


In [0]:
step1 = sales_df.filter(F.col("amount") > 0)

step2 = step1.withColumn(
    "amount_with_tax",
    F.col("amount") * 1.18
)

step3 = step2.withColumn(
    "year",
    F.year("order_date")
)

step4 = (
    step3
        .groupBy("year", "region")
        .agg(F.sum("amount_with_tax").alias("total_sales"))
)


## üîç Viewing the Lineage (Logical Plan)

Spark tracks the **entire transformation chain**, not just the final result.


In [0]:
step4.explain(True)


### What You‚Äôll See

- A logical and physical plan showing:
  - CSV read
  - filters
  - projections
  - aggregation
- This plan represents the **lineage DAG**
- Spark uses this DAG for both **execution and recovery**


## üí• What Happens When a Node Fails?

If an executor fails:
- All partitions stored on that executor are **lost**
- Spark checks the lineage DAG
- Spark re-executes **only the tasks needed** to rebuild those partitions
- Other completed partitions are **not recomputed**

This is possible because:
- RDDs and DataFrames are **immutable**
- Transformations are **deterministic**


## ‚è±Ô∏è Checkpointing (Conceptual)

In classic Spark clusters, **checkpointing** is used to:
- Persist intermediate results to reliable storage
- Truncate very long lineage chains
- Reduce recomputation cost after failures

On **Databricks Serverless compute**:
- Direct access to `sparkContext` is not available
- Checkpointing behavior is **platform-managed**
- Users cannot manually configure checkpoint directories

Even though we do not execute checkpointing here,
the concept is important for understanding how Spark
limits recomputation depth in long-running pipelines.


## ‚úÖ Summary

- Spark achieves fault tolerance using **lineage-based recomputation**
- Executors can fail without causing job failure
- Only **lost partitions** are recomputed
- Lineage + immutability make this safe and deterministic
- Caching and checkpointing help optimize recovery for long pipelines

This notebook demonstrates one of the **core design principles**
that makes Spark suitable for large-scale, distributed data processing.
