# Table of Contents 
 - [Data Lake, Data Sources & Data Sinks](#data-lake-data-sources--data-sinks)
 - [DataframeReader-API](#spark-dataframereader-api--quick-setup-with-visual)

# Data Lake, Data Sources & Data Sinks
![data_lake](./images/data_lake.png)
## 🔹 What is a Data Lake?
A **Data Lake** is a central storage system that can hold **structured, semi-structured, and unstructured data** in its **raw format**.  
- Usually built on **distributed storage** like HDFS, Amazon S3, Azure Data Lake Storage, or Google Cloud Storage.  
- Designed for **scalability, cost-efficiency, and flexibility**.  
- Uses **schema-on-read** (apply schema only when processing).  

---

## 🔹 Why Databases & Warehouses are NOT Data Lakes
- **Databases** (e.g., MySQL, Oracle, PostgreSQL):  
  - Optimized for **transactions (OLTP)**.  
  - Store only **structured data**.  
  - Capacity limited, not meant for petabytes of logs/images.  

- **Data Warehouses** (e.g., Redshift, Snowflake, BigQuery):  
  - Optimized for **analytics (OLAP)**.  
  - Store **clean, structured, aggregated data**.  
  - Require schema **before loading (schema-on-write)**.  
  - Expensive for raw/unstructured data storage.  

📌 Hence, they are not considered Data Lakes. They are **external systems** that may send data to or receive data from the lake.

---

## 🔹 Data Sources vs Data Sinks
- **Data Source:** Where the data comes from before Spark processes it.  
- **Data Sink:** Where the data goes after Spark processes it.  

👉 **Rule of Thumb:**  
- **Reading → Source**  
- **Writing → Sink**  

---

## 🔹 Internal vs External Storage (from a Data Lake POV)
- **Internal Storage:**  
  - Data already inside the Data Lake.  
  - Examples: HDFS, S3, ADLS, GCS, Spark SQL tables, Delta Lake tables.  
  - File formats: CSV, JSON, Parquet, Avro, ORC.  

- **External Storage:**  
  - Data that exists **outside the Data Lake**.  
  - Examples:  
    - **Databases:** MySQL, Oracle, SQL Server (via JDBC).  
    - **NoSQL:** MongoDB, Cassandra.  
    - **Cloud Warehouses:** Redshift, Snowflake, BigQuery.  
    - **Streams:** Kafka.  

---

## ✅ Summary
- **Data Lake** = Distributed storage system for raw + processed data (internal).  
- **Databases/Warehouses** = Specialized external systems for transactions/analytics.  
- **Internal Storage** = Files and cloud-based storage inside the Data Lake.  
- **External Storage** = Databases, warehouses, and streaming systems outside the Data Lake.  

# Spark DataFrameReader API — Quick Setup (with Visual)

![Spark Data Source API](./images/dataframe_api.png)

## ✅ Standard Pattern
Use the **same, consistent structure** for every source (CSV/JSON/Parquet/JDBC, etc.):
```python
df = spark.read \
    .format("<source>") \        # "csv" | "json" | "parquet" | "jdbc" | ...
    .option("<key>", "<value>") \# source-specific options
    .schema(mySchema) \          # optional (you can infer for many formats)
    .load("<path-or-uri>")       # files dir, single file, or JDBC parameters
```

### Key Parts
- **`.format()`** → declares the data source/connector.  
- **`.option(k, v)`** → controls how to read (header, delimiter, multiline, etc.).  
- **`.schema()`** → explicit schema (faster, safer than inference).  
- **`.load()`** → reads and returns a **DataFrame**.

---

## 🧰 Read Modes for Semi-Structured Files (CSV/JSON/XML)
Set with `.option("mode", "...")`:
- **`PERMISSIVE`** *(default)* → keeps bad rows; puts the raw text in `_corrupt_record`.  
- **`DROPMALFORMED`** → drops malformed rows.  
- **`FAILFAST`** → aborts on first malformed row.

> Tip: Pair with `.option("columnNameOfCorruptRecord", "_corrupt_record")` if you need a custom column name.

---

## 🧪 Minimal, Practical Examples

### 1) CSV
```python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

mySchema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

csv_df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("mode", "PERMISSIVE") \
    .schema(mySchema) \
    .load("/data/mycsvfiles/")    # folder or file path
```

### 2) JSON
```python
json_df = spark.read \
    .format("json") \
    .option("multiline", "true") \
    .option("mode", "FAILFAST") \
    .load("/data/events/*.json")
```

### 3) Parquet (schema embedded; fastest, columnar)
```python
pq_df = spark.read \
    .format("parquet") \
    .load("/data/clean/parquet/")
```

### 4) JDBC (example)
```python
jdbc_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:5432/db") \
    .option("dbtable", "public.orders") \
    .option("user", "<USER>") \
    .option("password", "<PASS>") \
    .load()
```

---

## 🔎 Checklist (Production-friendly)
- Prefer **standard API** (`format → option → schema → load`) over shortcuts (`.csv()`, `.json()`).
- **Define schemas** for CSV/JSON to avoid inference surprises and speed up reads.
- Set an explicit **`mode`** for resilient ingestion (PERMISSIVE for exploration, FAILFAST for strict jobs).
- Keep paths/environment **config-driven** (don’t hardcode).
- Validate data with small samples before full runs (`df.limit(10).show(truncate=False)`).

---