# Table of Contents 
 - [Data Lake, Data Sources & Data Sinks](#data-lake-data-sources--data-sinks)
 - [DataframeReader-API](#spark-dataframereader-api--quick-setup-with-visual)
 - [Spark DataframeSchema](#spark-dataframe-schema)

# Data Lake, Data Sources & Data Sinks
![data_lake](./images/data_lake.png)
## üîπ What is a Data Lake?
A **Data Lake** is a central storage system that can hold **structured, semi-structured, and unstructured data** in its **raw format**.  
- Usually built on **distributed storage** like HDFS, Amazon S3, Azure Data Lake Storage, or Google Cloud Storage.  
- Designed for **scalability, cost-efficiency, and flexibility**.  
- Uses **schema-on-read** (apply schema only when processing).  

---

## üîπ Why Databases & Warehouses are NOT Data Lakes
- **Databases** (e.g., MySQL, Oracle, PostgreSQL):  
  - Optimized for **transactions (OLTP)**.  
  - Store only **structured data**.  
  - Capacity limited, not meant for petabytes of logs/images.  

- **Data Warehouses** (e.g., Redshift, Snowflake, BigQuery):  
  - Optimized for **analytics (OLAP)**.  
  - Store **clean, structured, aggregated data**.  
  - Require schema **before loading (schema-on-write)**.  
  - Expensive for raw/unstructured data storage.  

üìå Hence, they are not considered Data Lakes. They are **external systems** that may send data to or receive data from the lake.

---

## üîπ Data Sources vs Data Sinks
- **Data Source:** Where the data comes from before Spark processes it.  
- **Data Sink:** Where the data goes after Spark processes it.  

üëâ **Rule of Thumb:**  
- **Reading ‚Üí Source**  
- **Writing ‚Üí Sink**  

---

## üîπ Internal vs External Storage (from a Data Lake POV)
- **Internal Storage:**  
  - Data already inside the Data Lake.  
  - Examples: HDFS, S3, ADLS, GCS, Spark SQL tables, Delta Lake tables.  
  - File formats: CSV, JSON, Parquet, Avro, ORC.  

- **External Storage:**  
  - Data that exists **outside the Data Lake**.  
  - Examples:  
    - **Databases:** MySQL, Oracle, SQL Server (via JDBC).  
    - **NoSQL:** MongoDB, Cassandra.  
    - **Cloud Warehouses:** Redshift, Snowflake, BigQuery.  
    - **Streams:** Kafka.  

---

## ‚úÖ Summary
- **Data Lake** = Distributed storage system for raw + processed data (internal).  
- **Databases/Warehouses** = Specialized external systems for transactions/analytics.  
- **Internal Storage** = Files and cloud-based storage inside the Data Lake.  
- **External Storage** = Databases, warehouses, and streaming systems outside the Data Lake.  

# Spark DataFrameReader API ‚Äî Quick Setup (with Visual)

![Spark Data Source API](./images/dataframe_api.png)

## ‚úÖ Standard Pattern
Use the **same, consistent structure** for every source (CSV/JSON/Parquet/JDBC, etc.):
```python
df = spark.read \
    .format("<source>") \        # "csv" | "json" | "parquet" | "jdbc" | ...
    .option("<key>", "<value>") \# source-specific options
    .schema(mySchema) \          # optional (you can infer for many formats)
    .load("<path-or-uri>")       # files dir, single file, or JDBC parameters
```

### Key Parts
- **`.format()`** ‚Üí declares the data source/connector.  
- **`.option(k, v)`** ‚Üí controls how to read (header, delimiter, multiline, etc.).  
- **`.schema()`** ‚Üí explicit schema (faster, safer than inference).  
- **`.load()`** ‚Üí reads and returns a **DataFrame**.

---

## üß∞ Read Modes for Semi-Structured Files (CSV/JSON/XML)
Set with `.option("mode", "...")`:
- **`PERMISSIVE`** *(default)* ‚Üí keeps bad rows; puts the raw text in `_corrupt_record`.  
- **`DROPMALFORMED`** ‚Üí drops malformed rows.  
- **`FAILFAST`** ‚Üí aborts on first malformed row.

> Tip: Pair with `.option("columnNameOfCorruptRecord", "_corrupt_record")` if you need a custom column name.

---

## üß™ Minimal, Practical Examples

### 1) CSV
```python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

mySchema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

csv_df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("mode", "PERMISSIVE") \
    .schema(mySchema) \
    .load("/data/mycsvfiles/")    # folder or file path
```

### 2) JSON
```python
json_df = spark.read \
    .format("json") \
    .option("multiline", "true") \
    .option("mode", "FAILFAST") \
    .load("/data/events/*.json")
```

### 3) Parquet (schema embedded; fastest, columnar)
```python
pq_df = spark.read \
    .format("parquet") \
    .load("/data/clean/parquet/")
```

### 4) JDBC (example)
```python
jdbc_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:5432/db") \
    .option("dbtable", "public.orders") \
    .option("user", "<USER>") \
    .option("password", "<PASS>") \
    .load()
```

---

## üîé Checklist (Production-friendly)
- Prefer **standard API** (`format ‚Üí option ‚Üí schema ‚Üí load`) over shortcuts (`.csv()`, `.json()`).
- **Define schemas** for CSV/JSON to avoid inference surprises and speed up reads.
- Set an explicit **`mode`** for resilient ingestion (PERMISSIVE for exploration, FAILFAST for strict jobs).
- Keep paths/environment **config-driven** (don‚Äôt hardcode).
- Validate data with small samples before full runs (`df.limit(10).show(truncate=False)`).

---

# Spark DataFrame Schema

![Spark Data Types](./images/spark_data_types.png)

---

## üîπ Why Define Schema?
- Schema inference (like in CSV/JSON) is **not always reliable**.  
- Defining schema explicitly ensures **correct column names** and **data types**.  
- Spark provides two main ways:
  1. **Programmatically using StructType & StructField**  
  2. **Using a DDL (Data Definition Language) String**  

---

## 1Ô∏è‚É£ Programmatic Schema (StructType)

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

# Define schema
schema = StructType([
    StructField("FL_DATE", DateType(), True),
    StructField("AIRLINE", StringType(), True),
    StructField("FL_NUM", IntegerType(), True)
])

# Read CSV with explicit schema
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("dateFormat", "M/d/y") \  # specify date parsing format
    .schema(schema) \
    .load("/data/flights.csv")

df.printSchema()
df.show(5)
```

---

## 2Ô∏è‚É£ Schema with DDL String

```python
# Define schema using DDL string
ddl_schema = "FL_DATE DATE, AIRLINE STRING, FL_NUM INT"

# Read JSON with explicit DDL schema
df = spark.read \
    .format("json") \
    .option("dateFormat", "M/d/y") \
    .schema(ddl_schema) \
    .load("/data/flights.json")

df.printSchema()
df.show(5)
```

---

## ‚úÖ Summary
- **StructType/StructField** ‚Üí More explicit, flexible, IDE-friendly.  
- **DDL String** ‚Üí Short and simple for quick schema definitions.  
- Always set **`dateFormat`** (or timestamp format) when working with date/time fields in CSV/JSON.  