# Table of Contents 
 - [Data Lake, Data Sources & Data Sinks](#data-lake-data-sources--data-sinks)
 - [DataframeReader-API](#spark-dataframereader-api--quick-setup-with-visual)
 - [Spark DataframeSchema](#spark-dataframe-schema)
 - [Spark DataframeWriter adn Data sink](#spark-data-sink-api--writing-dataframes)
 - [Spark Databases and Tables](#spark-databases-and-tables)
    - [Managed Tables](#managed-tables)
    - [Unmanaged Tables](#unmanaged-external-tables)

# Data Lake, Data Sources & Data Sinks
![data_lake](./images/data_lake.png)
## 🔹 What is a Data Lake?
A **Data Lake** is a central storage system that can hold **structured, semi-structured, and unstructured data** in its **raw format**.  
- Usually built on **distributed storage** like HDFS, Amazon S3, Azure Data Lake Storage, or Google Cloud Storage.  
- Designed for **scalability, cost-efficiency, and flexibility**.  
- Uses **schema-on-read** (apply schema only when processing).  

---

## 🔹 Why Databases & Warehouses are NOT Data Lakes
- **Databases** (e.g., MySQL, Oracle, PostgreSQL):  
  - Optimized for **transactions (OLTP)**.  
  - Store only **structured data**.  
  - Capacity limited, not meant for petabytes of logs/images.  

- **Data Warehouses** (e.g., Redshift, Snowflake, BigQuery):  
  - Optimized for **analytics (OLAP)**.  
  - Store **clean, structured, aggregated data**.  
  - Require schema **before loading (schema-on-write)**.  
  - Expensive for raw/unstructured data storage.  

📌 Hence, they are not considered Data Lakes. They are **external systems** that may send data to or receive data from the lake.

---

## 🔹 Data Sources vs Data Sinks
- **Data Source:** Where the data comes from before Spark processes it.  
- **Data Sink:** Where the data goes after Spark processes it.  

👉 **Rule of Thumb:**  
- **Reading → Source**  
- **Writing → Sink**  

---

## 🔹 Internal vs External Storage (from a Data Lake POV)
- **Internal Storage:**  
  - Data already inside the Data Lake.  
  - Examples: HDFS, S3, ADLS, GCS, Spark SQL tables, Delta Lake tables.  
  - File formats: CSV, JSON, Parquet, Avro, ORC.  

- **External Storage:**  
  - Data that exists **outside the Data Lake**.  
  - Examples:  
    - **Databases:** MySQL, Oracle, SQL Server (via JDBC).  
    - **NoSQL:** MongoDB, Cassandra.  
    - **Cloud Warehouses:** Redshift, Snowflake, BigQuery.  
    - **Streams:** Kafka.  

---

## ✅ Summary
- **Data Lake** = Distributed storage system for raw + processed data (internal).  
- **Databases/Warehouses** = Specialized external systems for transactions/analytics.  
- **Internal Storage** = Files and cloud-based storage inside the Data Lake.  
- **External Storage** = Databases, warehouses, and streaming systems outside the Data Lake.  

# Spark DataFrameReader API — Quick Setup (with Visual)

![Spark Data Source API](./images/dataframe_api.png)

## ✅ Standard Pattern
Use the **same, consistent structure** for every source (CSV/JSON/Parquet/JDBC, etc.):
```python
df = spark.read \
    .format("<source>") \        # "csv" | "json" | "parquet" | "jdbc" | ...
    .option("<key>", "<value>") \# source-specific options
    .schema(mySchema) \          # optional (you can infer for many formats)
    .load("<path-or-uri>")       # files dir, single file, or JDBC parameters
```

### Key Parts
- **`.format()`** → declares the data source/connector.  
- **`.option(k, v)`** → controls how to read (header, delimiter, multiline, etc.).  
- **`.schema()`** → explicit schema (faster, safer than inference).  
- **`.load()`** → reads and returns a **DataFrame**.

---

## 🧰 Read Modes for Semi-Structured Files (CSV/JSON/XML)
Set with `.option("mode", "...")`:
- **`PERMISSIVE`** *(default)* → keeps bad rows; puts the raw text in `_corrupt_record`.  
- **`DROPMALFORMED`** → drops malformed rows.  
- **`FAILFAST`** → aborts on first malformed row.

> Tip: Pair with `.option("columnNameOfCorruptRecord", "_corrupt_record")` if you need a custom column name.

---

## 🧪 Minimal, Practical Examples

### 1) CSV
```python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

mySchema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

csv_df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("mode", "PERMISSIVE") \
    .schema(mySchema) \
    .load("/data/mycsvfiles/")    # folder or file path
```

### 2) JSON
```python
json_df = spark.read \
    .format("json") \
    .option("multiline", "true") \
    .option("mode", "FAILFAST") \
    .load("/data/events/*.json")
```

### 3) Parquet (schema embedded; fastest, columnar)
```python
pq_df = spark.read \
    .format("parquet") \
    .load("/data/clean/parquet/")
```

### 4) JDBC (example)
```python
jdbc_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:5432/db") \
    .option("dbtable", "public.orders") \
    .option("user", "<USER>") \
    .option("password", "<PASS>") \
    .load()
```

---

## 🔎 Checklist (Production-friendly)
- Prefer **standard API** (`format → option → schema → load`) over shortcuts (`.csv()`, `.json()`).
- **Define schemas** for CSV/JSON to avoid inference surprises and speed up reads.
- Set an explicit **`mode`** for resilient ingestion (PERMISSIVE for exploration, FAILFAST for strict jobs).
- Keep paths/environment **config-driven** (don’t hardcode).
- Validate data with small samples before full runs (`df.limit(10).show(truncate=False)`).

---

# Spark DataFrame Schema

![Spark Data Types](./images/spark_data_types.png)

---

## 🔹 Why Define Schema?
- Schema inference (like in CSV/JSON) is **not always reliable**.  
- Defining schema explicitly ensures **correct column names** and **data types**.  
- Spark provides two main ways:
  1. **Programmatically using StructType & StructField**  
  2. **Using a DDL (Data Definition Language) String**  

---

## 1️⃣ Programmatic Schema (StructType)

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

# Define schema
schema = StructType([
    StructField("FL_DATE", DateType(), True),
    StructField("AIRLINE", StringType(), True),
    StructField("FL_NUM", IntegerType(), True)
])

# Read CSV with explicit schema
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("dateFormat", "M/d/y") \  # specify date parsing format
    .schema(schema) \
    .load("/data/flights.csv")

df.printSchema()
df.show(5)
```

---

## 2️⃣ Schema with DDL String

```python
# Define schema using DDL string
ddl_schema = "FL_DATE DATE, AIRLINE STRING, FL_NUM INT"

# Read JSON with explicit DDL schema
df = spark.read \
    .format("json") \
    .option("dateFormat", "M/d/y") \
    .schema(ddl_schema) \
    .load("/data/flights.json")

df.printSchema()
df.show(5)
```

---

## ✅ Summary
- **StructType/StructField** → More explicit, flexible, IDE-friendly.  
- **DDL String** → Short and simple for quick schema definitions.  
- Always set **`dateFormat`** (or timestamp format) when working with date/time fields in CSV/JSON.  

#  Spark Data Sink API — Writing DataFrames

![Spark Data Sink API](./images/dataframewriter.png)

---

## 🔹 Introduction
In Spark, you write data using the **DataFrameWriter API**.  
You access it with the `.write` method on a DataFrame.  
This API is standardized and works with both **internal formats** (Parquet, JSON, CSV, ORC, Avro) and **external sinks** (JDBC, Cassandra, MongoDB, Kafka, Delta Lake, etc.).

---

## 🔹 General Structure
```python
df.write \
  .format("parquet") \         # Output format (default = Parquet)
  .mode("overwrite") \         # Save mode: append/overwrite/errorIfExists/ignore
  .option("path", "/data/out") # Target location or config
  .save()
```

### Write Modes
- **append** → Add new data alongside existing files.  
- **overwrite** → Delete old data and write fresh output.  
- **errorIfExists** → Fail if data already exists.  
- **ignore** → Do nothing if data already exists.  

---

## 🔹 Layout Controls
Beyond format and mode, Spark gives you tools to control **how data is organized and stored**.

---

### 1) **Repartition**
- **What it does:** Changes the number of partitions → controls number of output files.  
- **Why use:** Balance file counts, manage parallelism.  
- **Example:**
  ```python
  df.repartition(5).write.parquet("/data/output/repartitioned/")
  ```
- **Advantage:** Prevents too many small files or one giant file.

---

### 2) **partitionBy**
- **What it does:** Creates **folders by column values**.  
- **Why use:** Logical data organization + faster queries (partition pruning).  
- **Example:**
  ```python
  df.write \
    .partitionBy("country", "state") \
    .parquet("/data/output/partitioned/")
  ```
- **Advantage:** Queries can scan only needed partitions (`WHERE country='US'` → reads only US folder).  
- **Caution:** Too many unique values = too many small folders/files.

---

### 3) **bucketBy** (for Spark-managed tables)
- **What it does:** Hashes rows into a fixed number of buckets (files).  
- **Why use:** Optimize **joins** and **groupBy** on the bucketed column.  
- **Example:**
  ```python
  df.write \
    .bucketBy(8, "user_id") \
    .sortBy("user_id") \
    .saveAsTable("bucketed_users")
  ```
- **Advantage:** Spark can skip shuffle during joins when both tables share same bucket column + bucket count.  
- **Caution:** Only works for managed tables, not plain files.

- 🔹 Bucket Column
-	•	The bucket column is the column you tell Spark to use when distributing rows into buckets.
- 	•	Spark takes the value of that column, applies a hash function, and then assigns the row to a bucket.

- 🔹 Bucket Count
- 	•	The bucket count is simply the number of buckets (output files) you want Spark to create.
- 	•	It’s the 8 in the above example.
- 	•	Spark ensures that all rows are spread across these 8 buckets based on the hash of user_id.


---

### 4) **sortBy**
- **What it does:** Sorts rows inside each bucket.  
- **Why use:** Improves range queries, window functions, and some joins.  
- **Example:**
  ```python
  df.write \
    .bucketBy(8, "user_id") \
    .sortBy("timestamp") \
    .saveAsTable("bucketed_sorted_users")
  ```
- **Advantage:** Data is ordered within buckets → faster queries on ranges.

---

### 5) **maxRecordsPerFile**
- **What it does:** Limits number of rows per file.  
- **Why use:** Prevent very large files, balance parallelism.  
- **Example:**
  ```python
  df.write \
    .option("maxRecordsPerFile", 1000000) \
    .parquet("/data/output/maxrecords/")
  ```
- **Advantage:** Avoids giant files (hard to read) and ensures manageable chunk sizes.

---

## ✅ Summary
- Use **write modes** to control overwrite/append behavior.  
- Use **repartition** to manage file counts.  
- Use **partitionBy** for logical folder-based organization.  
- Use **bucketBy + sortBy** for optimized joins and range queries.  
- Use **maxRecordsPerFile** to avoid massive files and keep outputs balanced.  

Together, these make your Spark writes more **efficient, organized, and query-friendly**.

# 🪣 Bucketing with Two Tables in Spark

## 🔹 Scenario
You have **two tables** (`Table A` and `Table B`) and you want to join them on `user_id`.

---

## 🔹 Applying Bucketing
- You bucket **Table A** with:
  ```python
  dfA.write.bucketBy(4, "user_id").saveAsTable("tableA")
  ```
  → Each row in Table A goes to one of **4 buckets** based on:
  ```
  bucket_number = hash(user_id) % 4
  ```

- You bucket **Table B** with:
  ```python
  dfB.write.bucketBy(4, "user_id").saveAsTable("tableB")
  ```
  → Same rule is applied for Table B.

---

## 🔹 What Happens?
- **Each table has its own set of bucket files** (they don’t share physical files).  
- But since both use the **same column (`user_id`)** and the **same bucket count (4)**,  
  the same `user_id` values end up in the **same bucket number** in both tables.  

👉 Example with 4 buckets (simplified hash = user_id):
- `user_id = 1 → 1 % 4 = 1 → Bucket 1 in both Table A and B`  
- `user_id = 2 → 2 % 4 = 2 → Bucket 2 in both Table A and B`  
- `user_id = 3 → 3 % 4 = 3 → Bucket 3 in both Table A and B`  
- `user_id = 4 → 4 % 4 = 0 → Bucket 0 in both Table A and B`

---

## 🔹 Why is This Useful?
When Spark performs a join:
- It knows **Bucket 0 of Table A only needs to join with Bucket 0 of Table B**.  
- Same for Bucket 1 with Bucket 1, and so on.  

⚡ This avoids a **full shuffle of all rows** (wide dependency).  
Instead, the join happens **bucket-to-bucket** (narrower dependency).  

---

## ✅ Key Point
- **Table B does not append to Table A’s buckets**.  
- Each table keeps its own files, but **follows the same bucketing logic**.  
- This alignment is enough for Spark to optimize joins and reduce shuffle dramatically.

# Spark Databases and Tables

Spark can act like a **database system**, allowing you to create **databases**, **tables**, and **views**.  
A table in Spark consists of two parts:  

1. **Table Data** → stored as files in distributed storage (HDFS, S3, etc.).  
2. **Table Metadata** → stored in the **catalog (metastore)**, which holds schema, table name, database name, partitions, and data location.  

By default, Spark uses an **in-memory catalog** (session-based), but for persistence it reuses the **Hive Metastore**.

---

## 🔹 Types of Spark Tables
Spark supports two types of tables:  
1. **Managed Tables**  
2. **Unmanaged (External) Tables**

---

## Managed Tables
- Spark manages **both data and metadata**.  
- Data is stored under the **warehouse directory**:  
  ```
  spark.sql.warehouse.dir
  ```
- You don’t specify a location — Spark decides it.  
- Dropping a managed table deletes **both metadata and data**.  

### Example
```python
df.write.saveAsTable("my_managed_table")
```

![Managed Table](./images/managed_table.png)

**Advantages:**
- Simple management (Spark handles location and lifecycle).  
- Supports advanced features like **bucketing** and **sorting**.  
- Future Spark SQL improvements target managed tables.  

---

## Unmanaged (External) Tables
- Spark manages **only metadata**, **not the data**.  
- You must specify the **data location** when creating the table.  
- Dropping an unmanaged table deletes **only metadata**, the underlying data files remain intact.  

### Example
```sql
CREATE TABLE my_unmanaged_table (
    col1 INT,
    col2 STRING
)
USING PARQUET
LOCATION '/data/external/myfiles/';
```

![Unmanaged Table](./images/unmanaged.png)

**Use Case:**
- Reuse existing data files (already in S3, HDFS, etc.) as a Spark SQL table.  
- No control over data lifecycle (Spark won’t move or delete the files).  

---

## 🔎 Key Differences: Managed vs Unmanaged

| Feature              | Managed Table                        | Unmanaged Table (External)      |
|----------------------|--------------------------------------|---------------------------------|
| Data Location        | Stored in `spark.sql.warehouse.dir`  | Must be provided by user        |
| Data Lifecycle       | Spark manages data + metadata        | Spark manages only metadata     |
| Drop Table Behavior  | Deletes data + metadata              | Deletes only metadata           |
| Advanced Features    | Supports bucketing & sorting         | No bucketing/sorting (no control over data) |
| Use Case             | Default, clean management, full SQL  | Reuse existing external data     |

---

## ✅ Summary
- **Managed Tables**: Spark controls everything, best for new data pipelines, supports bucketing & sorting.  
- **Unmanaged Tables**: Only metadata managed, best for temporarily mapping **existing data** to Spark SQL, Spark has no control over the data.  