# Spark & Data Engineering Questions

## Part 1: Apache Spark Concepts

### 1. What is Apache Spark, and how does it fit into the big data ecosystem (storage vs compute vs resource management)?

Apache Spark is a **unified analytics engine for large-scale data processing**. It provides in-memory computing capabilities that make it significantly faster than traditional disk-based processing frameworks like MapReduce.

In the big data ecosystem:
- **Storage**: Spark does NOT store data. It reads from and writes to external storage systems like HDFS, S3, Azure Blob Storage, or databases.
- **Compute**: Spark IS a compute engine. It processes data in memory across a distributed cluster, performing transformations and analytics.
- **Resource Management**: Spark does NOT manage cluster resources itself. It relies on external resource managers like YARN, Kubernetes, or Mesos to allocate CPU, memory, and other resources.

### 2. Explain lazy evaluation in Spark. Why does Spark delay execution until an action is called?

**Lazy evaluation** means Spark does not execute transformations immediately. Instead, it builds a logical execution plan (DAG - Directed Acyclic Graph) and waits until an action is called.

Why delay execution:
- **Optimization**: Spark can analyze the entire computation plan and optimize it (e.g., predicate pushdown, combining operations, eliminating unnecessary steps)
- **Reduced I/O**: By knowing the full plan, Spark minimizes data shuffling and disk reads
- **Fault tolerance**: The DAG lineage allows Spark to recompute lost partitions without re-running the entire job
- **Resource efficiency**: Only computes what's actually needed for the final result

### 3. What is the difference between transformations and actions? What happens internally when an action is triggered?

**Transformations**: Operations that create a new RDD/DataFrame from an existing one. They are lazy and only define the computation logic.
- Examples: `map()`, `filter()`, `select()`, `groupBy()`, `join()`

**Actions**: Operations that trigger actual computation and return results to the driver or write to external storage.
- Examples: `collect()`, `count()`, `show()`, `write()`, `reduce()`

When an action is triggered:
1. Spark's DAG Scheduler analyzes the lineage of transformations
2. It creates an optimized physical execution plan
3. The plan is divided into stages based on shuffle boundaries
4. Stages are broken into tasks (one per partition)
5. Tasks are distributed to executors for parallel execution
6. Results are collected or written as specified

### 4. What is an RDD? Why are immutability and partitioning important in Spark's design?

**RDD (Resilient Distributed Dataset)** is Spark's fundamental data abstraction - a fault-tolerant collection of elements that can be operated on in parallel.

**Immutability importance**:
- Enables safe parallel processing without locks or synchronization
- Supports fault tolerance through lineage - if a partition is lost, it can be recomputed from the original transformation chain
- Allows Spark to cache intermediate results safely
- Simplifies reasoning about distributed computations

**Partitioning importance**:
- Enables parallel processing - each partition can be processed independently on different nodes
- Controls data distribution across the cluster
- Affects performance - proper partitioning minimizes data shuffling
- Determines the level of parallelism in computations

### 5. Explain the execution hierarchy in Spark: Application → Job → Stage → Task. What determines stage boundaries?

**Execution Hierarchy**:
- **Application**: A complete Spark program with its own driver and executors
- **Job**: Triggered by each action; represents all work needed to compute that action's result
- **Stage**: A set of tasks that can run in parallel without shuffling data; jobs are divided into stages
- **Task**: The smallest unit of work; one task processes one partition within a stage

**Stage boundaries are determined by**:
- **Shuffle operations** (wide transformations) like `groupBy()`, `reduceByKey()`, `join()`, `repartition()`
- When data needs to be redistributed across partitions, Spark must complete all tasks in one stage before starting the next
- Each shuffle creates a new stage boundary because all map-side tasks must finish before reduce-side tasks can begin

### 6. What is the difference between narrow and wide transformations? Why are wide transformations usually more expensive?

**Narrow Transformations**:
- Each input partition contributes to at most one output partition
- No data shuffling required
- Examples: `map()`, `filter()`, `flatMap()`, `union()`
- Can be pipelined together in a single stage

**Wide Transformations**:
- Input partitions contribute to multiple output partitions
- Require data shuffling across the network
- Examples: `groupByKey()`, `reduceByKey()`, `join()`, `repartition()`
- Create new stage boundaries

**Why wide transformations are more expensive**:
- **Network I/O**: Data must be transferred between executors across the cluster
- **Disk I/O**: Shuffle data is often written to disk as intermediate storage
- **Synchronization**: All tasks in the previous stage must complete before the next stage can begin
- **Serialization overhead**: Data must be serialized for transfer and deserialized on receipt

### 7. Explain the roles of Driver and Executor. What kind of work happens on each side?

**Driver**:
- The main control process that runs the user's main() function
- Responsibilities:
  - Creates SparkContext/SparkSession
  - Converts user code into a DAG of tasks
  - Schedules tasks on executors
  - Coordinates job execution
  - Collects results from executors
  - Maintains metadata about the application
- Runs on: A single node (master node or client machine)

**Executor**:
- Worker processes that run on cluster nodes
- Responsibilities:
  - Execute tasks assigned by the driver
  - Store data in memory or disk for caching
  - Return computed results to the driver
  - Report task status and metrics back to driver
- Runs on: Multiple worker nodes (one or more executors per node)

### 8. What is the Medallion Architecture (Bronze, Silver, Gold)? What type of data and logic belongs in each layer?

**Medallion Architecture** is a data design pattern that organizes data into three layers based on quality and refinement level.

**Bronze Layer (Raw)**:
- Raw, unprocessed data exactly as ingested from sources
- Maintains original format and schema
- Logic: Minimal - just ingestion, timestamp addition, source tracking
- Purpose: Data lineage, reprocessing capability, audit trail

**Silver Layer (Cleaned/Conformed)**:
- Cleaned, validated, and standardized data
- Deduplication applied, data types enforced
- Logic: Data quality checks, schema enforcement, joins with reference data, business rules validation
- Purpose: Create a reliable, consistent dataset for analysis

**Gold Layer (Business-Level Aggregates)**:
- Aggregated, business-ready data optimized for consumption
- Dimensional models, KPIs, metrics
- Logic: Business aggregations, summary calculations, reporting transformations
- Purpose: Power dashboards, reports, ML features, and business analytics

### 9. Explain SCD Type 0, Type 1, and Type 2 in simple terms. When would you choose each one?

**SCD (Slowly Changing Dimension)** strategies handle how to track changes in dimension data over time.

**Type 0 - Retain Original**:
- Never update the record; keep the original value forever
- Use when: Original value must be preserved (e.g., original signup date, birth date, original credit score at loan application)

**Type 1 - Overwrite**:
- Simply overwrite the old value with the new value
- No history is kept
- Use when: Historical values don't matter, only current state is needed (e.g., fixing typos, updating phone numbers, current address for shipping)

**Type 2 - Add New Row**:
- Insert a new row for each change while keeping old rows
- Includes effective dates (start_date, end_date) and/or a current flag
- Full history is preserved
- Use when: Historical tracking is critical for analysis (e.g., price changes, customer status changes, employee role changes for auditing)



## Part 2: SQL Problem - 7-Day Moving Average

### Problem Statement

```text
+---------------+---------+ 
| Column Name   | Type    | 
+---------------+---------+ 
| customer_id   | int     | 
| name          | varchar | 
| visited_on    | date    | 
| amount        | int     | 
+---------------+---------+ 
```

In SQL,(customer_id, visited_on) is the primary key for this table. 
This table contains data about customer transactions in a restaurant. 
visited_on is the date on which the customer with ID (customer_id) has visited the restaurant.  

amount is the total paid by a customer.   

You are the restaurant owner and you want to analyze a possible expansion (there will be **at least one customer every day**).   

Compute the **moving average** of how much the customer paid in a seven days window (i.e., **current day + 6 days before**). average_amount should be rounded to two decimal places.   

Return the result table ordered by visited_on in ascending order.

### Solution

```sql
-- postgresql
-- all customer payments per day
WITH daily_amount AS (
    SELECT
        visited_on,
        SUM(amount) AS total_amount
    FROM Customer
    GROUP BY visited_on
)

SELECT
    visited_on,
    SUM(total_amount) OVER (
        ORDER BY visited_on
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW  -- creates a 7-day rolling window
    ) AS amount,
    ROUND(
        AVG(total_amount) OVER (
            ORDER BY visited_on
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ),
        2
    ) AS average_amount
FROM daily_amount
ORDER BY visited_on
OFFSET 6;  -- removes the first 6 days since they don’t have a full window.
-- or sql: LIMIT 18446744073709551615 OFFSET 6 -- a "safe" way to guarantee you get all rows no matter how big the table is
-- or just use where: WHERE visited_on >= (SELECT MIN(visited_on) + INTERVAL 6 DAY FROM Customer)
```

**Explanation**:
1. First, aggregate daily totals since multiple customers can visit on the same day
2. Use window functions with `ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` for the 7-day window
3. Filter out the first 6 days since they don't have a complete 7-day history
4. Round the average to 2 decimal places
5. Order by visited_on ascending