# Table of Contents 

- [Spark RDD (Resilient Distributed Dataset)](#spark-rdd-resilient-distributed-dataset)
- [How to use Spark SQL](#how-to-use-spark-sql)
- [Catalyst Optimzer in Spark](#catalyst-optimizer-in-spark)

# Spark RDD (Resilient Distributed Dataset)

## 1. What is an RDD?
- **RDD** = Resilient Distributed Dataset.  
- The **core abstraction** in Spark (low-level API).  
- Represents a **distributed collection of objects**:
  - Partitioned across cluster nodes.
  - Fault-tolerant → if a partition is lost, Spark can **recompute** it from lineage.
- Think of it as a **big list/array split across machines**.

---

## 2. What does it do?
- Lets you perform **parallel operations** on data:
  - Transformations: `map()`, `filter()`, `flatMap()`, `reduceByKey()`.
  - Actions: `collect()`, `count()`, `saveAsTextFile()`.
- Works with **raw objects** (no schema, no columns).  
- You can build custom logic with Python/Scala/Java functions.

---

## 3. Why it’s not used much today
- **No schema awareness**: Spark doesn’t know about columns/types.
- **Hard to optimize**: Catalyst optimizer (query planner) cannot peek into user functions.
- **Developer overhead**: You must manually parse, structure, and manage data types.
- **DataFrames/SQL are better**:
  - Provide schema (rows + columns).
  - Optimized by Catalyst engine (faster execution).
  - Easier to write and maintain with SQL-like operations.

---

## ✅ Summary
- RDD = Spark’s original, low-level API for distributed collections.  
- Still used internally by Spark, but **rarely written by developers** today.  
- Modern Spark code → **DataFrames/SQL API** (faster, easier, optimized).

# How to Use Spark SQL

Below is a simple pattern you can follow. Your mental model is right:
1) **Create a SparkSession** → 2) **Load data into a DataFrame** (CSV/Parquet/etc.) →  
3) **Register a temp view** → 4) **Write SQL with `spark.sql(...)`** → 5) **Action** (e.g., `show()`/`write`).

---

## 📌 One-file Example (read → temp view → SQL)

```python
from pyspark.sql import SparkSession

# --- 1) Create SparkSession ---
spark = (
    SparkSession.builder
        .appName("SparkSQLDemo")
        .config("spark.ui.port", "4050")  # optional: keep UI off default 4040
        .getOrCreate()
)

print("Spark UI:", spark.sparkContext.uiWebUrl)

# --- 2) Load data into a DataFrame ---
# Option A: CSV (use inferSchema if columns are typed; header if first line has names)
df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv("/path/to/your/survey.csv")
)

# # Option B: Parquet (schema is embedded; faster + columnar)
# df = spark.read.parquet("/path/to/your/survey.parquet")

# --- 3) Create or replace a temp view (session-scoped) ---
df.createOrReplaceTempView("survey_tbl")
# Tip: use Global Temp View for cross-session visibility:
# df.createOrReplaceGlobalTempView("survey_tbl")  # then query as SELECT * FROM global_temp.survey_tbl

# --- 4) Write SQL using spark.sql(...) ---
# Use COUNT(*) or COUNT(1); add spaces and proper operators in predicates.
# Example: count rows per Country where Age < 40
sql_query = """
SELECT Country, COUNT(*) AS cnt
FROM survey_tbl
WHERE Age < 40
GROUP BY Country
ORDER BY cnt DESC
"""

result_df = spark.sql(sql_query)

# --- 5) Trigger an action to execute the plan ---
result_df.show(truncate=False)

# (Optional) Save results
# result_df.write.mode("overwrite").parquet("/tmp/survey_by_country_lt40")

spark.stop()
```
<pre>
🧭 Notes & Fixes (common gotchas)
	•	COUNT syntax: prefer COUNT(*) or COUNT(1) rather than count() with empty parens.
	•	Temp View scope:
	•	createOrReplaceTempView("name") → visible only in this SparkSession.
	•	createOrReplaceGlobalTempView("name") → query as global_temp.name (across sessions).
	•	Schema:
	•	CSV often needs .option("inferSchema","true") (costs a small scan).
	•	Parquet/ORC already carry schema (faster + better for analytics).
	•	SQL vs DataFrame:
	•	Anything you do in SQL has an equivalent in the DataFrame API and vice versa (result_df.explain("formatted") to inspect the plan).
	•	Execution:
	•	Transformations are lazy; calling show()/write/collect triggers the job.
	•	Check Spark UI (printed URL) for Jobs → Stages → Tasks and shuffles.

⸻

✅ TL;DR
	•	Load to DataFrame → createOrReplaceTempView("table") → run SQL with spark.sql("...") → action.
	•	Use COUNT(*), ensure predicates are valid (e.g., WHERE Age < 40), prefer Parquet for speed in production.
<pre>

# Catalyst Optimizer in Spark

![Catalyst Optimizer](./images/catalyst_optimiser.png)

## 🔎 What is Catalyst?
- Catalyst is the **query optimization framework** inside the **Spark SQL Engine**.  
- It powers **DataFrame** and **SQL APIs**.  
- Its job: **analyze, optimize, and generate efficient execution plans** for your code.

---

## ⚙️ Catalyst Workflow (Four Phases)

### 1. **Analysis**
- Spark reads your SQL/DataFrame query and builds an **Abstract Syntax Tree (AST)**.  
- Resolves:
  - Table/view names
  - Column names
  - SQL functions  
- Errors like “column not found” appear at this stage.

---

### 2. **Logical Optimization**
- Catalyst applies **rule-based** and **cost-based** optimizations:
  - **Predicate Pushdown** → push filters closer to the data source.  
  - **Projection Pruning** → drop unused columns early.  
  - **Constant Folding** → pre-compute constants (e.g., `2+2 → 4`).  
  - **Boolean Simplification** → simplify conditions (`x AND true → x`).  
- Produces multiple logical plans and picks the best one.

---

### 3. **Physical Planning**
- The chosen logical plan is mapped to a **physical plan**:  
  - Concrete execution strategy (joins, shuffles, scans).  
  - Uses **RDD operations** under the hood.  
- Spark may consider multiple strategies (e.g., broadcast join vs shuffle join) and choose the cheapest.

---

### 4. **Code Generation (Whole-Stage Codegen)**
- Introduced in **Project Tungsten (Spark 2.0)**.  
- Compiles the physical plan into **optimized Java bytecode**.  
- Benefits:
  - Fewer virtual function calls
  - Better CPU utilization
  - Faster execution on each executor

---

## ✅ Why it matters
- As a developer, you **don’t need to hand-optimize** queries.  
- Just use **DataFrames/SQL**, and Spark’s **Catalyst Optimizer** automatically:
  - Builds efficient plans
  - Reduces shuffles
  - Improves speed

---

## 🔑 TL;DR
- Catalyst = Spark’s brain for **query optimization**.  
- Stages: **Analysis → Logical Optimization → Physical Planning → Code Generation**.  
- Use DataFrames/SQL to benefit from it automatically.