
## Table of Contents
1. [Introduction to Big Data](#introduction-to-big-data)
   - [The 4 Vs of Big Data](#the-4-vs-of-big-data)
   - [Limitations of Traditional Databases](#limitations-of-traditional-databases)
2. [Big Data System](#big-data-system)
   - [Three Core Challenges](#three-core-challenges)
3. [MapReduce](#5-mapreduce)
4. [Apache Spark](#6-apache-spark)
5. [Batch vs Streaming Processing](#7-batch-vs-streaming-processing)

6.  [Data Lake](#10-data-lake)
7.  [ETL vs ELT](#11-etl-vs-elt)
8.  [Data Warehouse](#12-data-warehouse)
9.  [Data Mart](#13-data-mart)

10. [Cloud Platforms & AWS Services](#15-cloud-platforms--aws-services)

---



# Introduction to Big Data

## [What is Big Data](https://www.upgrad.com/blog/what-is-big-data-types-characteristics-benefits-and-examples/)

datasets that are **so large, complex, or fast-moving** that they cannot be managed or analyzed using traditional data processing tools.  

As Gartner defines it, big data is "**high-volume, high-velocity, and high-variety information assets** that require innovative processing for enhanced decision-making, insight discovery, and process optimization."


<img src="./pic/1_What_Is_Big_Data.jpg" width=400>  

<img src="./pic/1_how_big_data_works.webp" width=400>



## Why Big Data Matters

Big Data has become essential in modern computing due to several converging factors:

| Driver | Description |
|--------|-------------|
| **Explosive Data Growth** | Global data doubles every 2-3 years; by 2025, estimated 175+ zettabytes worldwide |
| **Digital Interactions** | Every click, swipe, purchase, and social media post generates data |
| **IoT Devices & Sensors** | Smart devices, wearables, industrial sensors produce continuous data streams |
| **Real-time Analytics** | Businesses need instant insights for competitive advantage |
| **ML/AI Requirements** | Machine learning models require massive datasets for training |
| **Cheap Storage** | Cost per GB has dropped dramatically, making it economical to store everything |

**Example: Daily Data Generation**    

- 500 million tweets per day
- 4 petabytes of data created on Facebook daily
- 720,000 hours of video uploaded to YouTube daily
- 6 billion Google searches per day






### Limitations of Traditional Databases

#### Why a Single Database Isn't Enough

Traditional RDBMS (like MySQL, PostgreSQL) face fundamental limitations at scale:

| Limitation | Explanation | Impact |
|------------|-------------|--------|
| **Cannot Scale to TB/PB** | Vertical scaling has physical limits | Cannot handle Big Data volumes |
| **Limited CPU/Memory** | Single server bottleneck | Processing becomes the constraint |
| **Single Point of Failure** | One server = one failure point | Downtime affects entire system |
| **Slow for Real-time Analytics** | Designed for transactions, not analytics | Cannot meet velocity requirements |
| **Tight Coupling** | Storage and compute bound together | Cannot scale independently |

### Example: The Scaling Problem

```text
Scenario: E-commerce company with 100M users

Traditional Approach:
├── Single PostgreSQL server
├── 500GB RAM (maximum practical)
├── 10TB storage
└── Result: Cannot handle Black Friday traffic spike

Big Data Approach:
├── Distributed cluster (100 nodes)
├── 50TB total RAM
├── 1PB distributed storage
└── Result: Scales horizontally to meet demand
```





## The 4 Vs of Big Data

The **four defining characteristics of Big Data** form the foundation for understanding its challenges:

<img src="./pic/1_4Vs-of-Big-Data-Infographic.png" width=500>

### Volume
**Definition:** The **sheer size of data** being generated and stored.

| Scale | Size | Example |
|-------|------|---------|
| Gigabyte (GB) | 10⁹ bytes | A few hours of HD video |
| Terabyte (TB) | 10¹² bytes | Large enterprise database |
| Petabyte (PB) | 10¹⁵ bytes | Netflix's entire video library |
| Exabyte (EB) | 10¹⁸ bytes | All words ever spoken by humans |

**Example:** Walmart processes 2.5 petabytes of customer transaction data per hour.

### Velocity
**Definition:** The **speed** at which *data flows in, is processed, and analyzed*.

| Type | Latency | Use Case |
|------|---------|----------|
| **Batch** | Hours/Days | Monthly reports, historical analysis |
| **Near Real-time** | Minutes | Inventory updates, email campaigns |
| **Real-time** | Milliseconds | Fraud detection, stock trading |

**Example:** Credit card fraud detection must analyze transactions in <100ms to block fraudulent purchases.

### Variety
**Definition:** Different **types and formats of data** from multiple sources.

| Data Type | Format | Examples |
|-----------|--------|----------|
| **Structured** | Tables, rows, columns | SQL databases, spreadsheets |
| **Semi-structured** | Has organization but flexible | JSON, XML, CSV, logs |
| **Unstructured** | No predefined format | Images, videos, emails, PDFs |

**Example:** A hospital processes structured patient records, semi-structured lab results (JSON), and unstructured MRI images.

### Veracity
**Definition:** The **trustworthiness, accuracy, and quality** of data.

| Challenge | Description | Solution |
|-----------|-------------|----------|
| **Inconsistency** | Same entity represented differently | Data normalization |
| **Incompleteness** | Missing values | Imputation, validation rules |
| **Ambiguity** | Unclear meaning | Data governance, metadata |
| **Noise** | Erroneous data points | Outlier detection, cleaning |

**Example:** Customer addresses may be entered in different formats: "123 Main St", "123 Main Street", "123 Main St."


### Value - the fifth V
Value is widely considered the most critical of the 5 Vs because it represents the ultimate purpose of any big data system. 

While the other Vs (Volume, Velocity, Variety, Veracity) describe the **technical challenges** of the data, Value focuses on the **actionable insights and tangible benefits** derived from it.

Value is often viewed as the result of successfully managing the other four characteristics:   
$$\text{Volume + Velocity + Variety + Veracity = Value}$$

<img src="./pic/1_sources-of-big-data.jpg" width=400>

# Big Data System

## What is a Big Data System
An integrated **framework** of software and hardware designed to **collect, store, process, and analyze datasets** that are too massive or complex for traditional database systems. 

As of 2025, these systems are essential for handling the "5 Vs"—Volume (petabytes of data), Velocity (real-time speed), Variety (text, video, sensors), Veracity (accuracy), and Value.  

Big Data -> The Asset  
Big Data System -> The Framework

### Core Architecture Layers
Modern big data systems are typically organized into several functional layers:
- **Ingestion**: The entry point where raw data is captured from sources like IoT sensors, social media, or financial transactions using tools like **Apache Kafka**.
- **Storage**: **Distributed environments**, such as **Data Lakes** (for raw data) or **Lakehouses**, that store information across many servers using systems like HDFS or cloud object storage (Amazon S3, Google Cloud Storage).
- **Processing**: Frameworks like [**Apache Spark**](https://spark.apache.org/) that transform raw data into usable formats through **parallel computing**, often handling both **real-time streams** and **historical batches**.
- **Analysis & Visualization**: The final stage where data scientists use **machine learning models or BI tools** (like Tableau) to extract patterns and present actionable insights. 

```text
┌─────────────────────────────────────────────────────────────────────────┐
│                        BIG DATA LIFECYCLE                               │
├─────────────────────────────────────────────────────────────────────────┤
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐│
│  │  INGEST  │──▶│  STORE   │──▶│ PROCESS  │──▶│ ANALYZE  │──▶│VISUALIZE││
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘   └─────────┘│
│  "Collect"      "Persist"      "Transform"    "Understand"   "Present"  │
│                                                                         │
│  Kafka          S3/HDFS        Spark          SQL/Python     Tableau    │
│  Kinesis        Snowflake      Flink          ML Models      Power BI   │
│  Fivetran       PostgreSQL     dbt            Statistics     Looker     │
└─────────────────────────────────────────────────────────────────────────┘
```

### Key Characteristics in 2025
- **Distributed Computing**: Instead of one large computer, big data systems use "clusters" of connected machines to **process data in parallel**.
- **AI & Machine Learning Integration**: These systems now serve as the primary training ground for large language models (LLMs) and predictive AI.
- **Scalability**: They are designed to **scale "horizontally"** by simply adding more commodity servers as data grows.
- **Fault Tolerance**: To prevent data loss, they **automatically replicate** data across multiple nodes so the system continues to work even if a server fails.

### Common Use Cases
| Industry 	| Big Data System Application| 
| ----------| --------------------------| 
| Finance	| Detecting fraudulent transactions in milliseconds.| 
| Healthcare	| Predicting disease outbreaks and personalizing patient care.| 
| Retail	| Optimizing inventory and delivering hyper-personalized recommendations.| 
| Manufacturing	| Using sensor data for predictive maintenance to reduce equipment downtime.| 


## What do Big Data Systems Solve
Big data systems solve the problem of processing and extracting value from information that is too large, complex, or fast-moving for traditional database tools. By addressing the "5 Vs" (Volume, Velocity, Variety, Veracity, and Value), these systems transform raw, unorganized data into actionable intelligence.

### Three Core Challenges

Big Data systems must solve three fundamental problems:

1. **Distributed Compute (Scaling Processing)**: split workloads across many machines    
   
    **The Problem**:  
    A single processor cannot analyze petabytes of data within a useful timeframe.   
    
    **The Solution**: 
    - Systems use parallelism to split a large job into thousands of small tasks executed simultaneously across different machines. 
    - Frameworks like **Apache Spark** also prioritize data locality, moving code to the data rather than vice versa to minimize network traffic.
2. **Distributed Storage (Scaling Volume)**: partition and replicate data     
    
    **The Problem**:   
    Data grows faster than the disk capacity of any individual server.    
    
    **The Solution**: 
    - **Distributed File Systems (DFS)** like **HDFS** partition large files into smaller blocks (shards) and scatter them across multiple nodes. 
    - This allows storage to **scale horizontally** by simply adding more commodity hardware to the cluster.
3. **Fault Tolerance (Maintaining Reliability)**: recover automatically when nodes fail  
  
   
    **The Problem**:    
    In a system with thousands of components, hardware failure is a daily occurrence, not a rarity.     
    
    **The Solution**: 
    - Systems use **replication** to store multiple copies of the same data across different nodes. 
    - If a node fails, the system automatically redirects tasks to a healthy replica and initiates automatic recovery to restore the lost copy elsewhere.

### Key Design Questions

| Question | Consideration | Common Approaches |
|----------|---------------|-------------------|
| **How to split data?** | Partitioning strategy.  Ensuring even distribution to avoid "hotspots." | **Hash partitioning** (for balance) or **Range partitioning** (for sorted access). |
| **How to schedule tasks?** | Resource allocation | YARN, Kubernetes(containerized orchestration), Mesos |
| **How to handle failures?** | Fault tolerance. Balancing cost vs. recovery speed. | Replication (fatest), checkpointing vs. Erasure Coding(cheaper storage)|
| **How to aggregate results?** | Combining partial results. Minimizing data "shuffling" over the network. | Reduce operations, shuffling. MapReduce workflows or Vectorized execution in modern engines. |

---



# Big Data Technologies
[Big data technology](https://www.datacamp.com/blog/big-data-technologies) refers to the tools and frameworks that process, store, and analyze complex and large datasets. 

To understand the Big Data Ecosystem, instead of thinking in isolated categories, think of big data as a **pipeline with layers**:

<img src="./pic/1_data-pipeline.png" width=500>


## Layer 1: Data Ingestion

Collect data from **various sources** and move it into storage systems.

There are two Modes of Ingestion:  

<img src="./pic/1_ingestion_modes.png" width=500>


### Key Technologies

| Technology | Type | Best For |
|------------|------|----------|
| **Apache Kafka** | Streaming | High-throughput event streaming |
| **AWS Kinesis** | Streaming | AWS-native streaming |
| **Debezium** | CDC (Change Data Capture) | Database change capture |
| **Fivetran** | Batch | Managed ELT connectors |
| **Airbyte** | Batch | Open-source ELT |
| **Apache NiFi** | Both | Complex data routing |

### Kafka Deep Dive

<img src="./pic/1_kafka-architecture.png" width=500>

Key Concepts:                                                
- Topics: Categories of messages                             
- Partitions: Parallel lanes for scalability                 
- Consumer Groups: Load balance consumption                  
- Retention: How long messages are kept
- Producers: could be applications, IoT Sensors
- Consumers: could be Spark Streaming, Flink, S3 (Archive)


## Layer 2: Data Storage

**Overview**:  

<img src="./pic/1_big_data_storage_landscape.png" width=600>  

Traditional databases (PostgreSQL, MySQL, MongoDB) serve as operational sources. **Some NoSQL (Cassandra, HBase) can scale to big data levels**.

**Storage Types Comparison**   

| Type | Examples | Data Format | Best For | Latency | Cost |
|------|----------|-------------|----------|---------|------|
| **OLTP Database** | PostgreSQL, MySQL | Structured | Transactions | ms | $$$ |
| **NoSQL** | MongoDB, Cassandra | Flexible | Scale, Speed | ms | $$ |
| **Data Lake** | S3, HDFS | Any | Raw storage, ML | sec | $ |
| **Data Warehouse** | Snowflake, BigQuery | Structured | Analytics | sec | $$$ |
| **Lakehouse** | Delta Lake, Iceberg | Any + ACID | Modern analytics | sec | $$ |




### Data Lakes
   
A Data Lake is a centralized repository that stores **all** your data in its **raw, native format** at any scale.  

**PURPOSE**: Store everything cheaply, process later.


```text
┌─────────────────────────────────────────────────────────────────┐
│                        DATA LAKE ZONES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   S3 Bucket: s3://company-data-lake/                            │
│   │                                                              │
│   ├── raw/                    ◄── BRONZE: Landing zone (as-is)  │
│   │   ├── sales/                                                │
│   │   │   └── 2024/01/01/transactions.json                     │
│   │   ├── logs/                                                 │
│   │   │   └── 2024/01/01/app.log.gz                            │
│   │   ├── images/                                               │
│   │   │   └── product_photos/                                   │
│   │   └── social/                                               │
│   │       └── twitter_feed.json                                 │
│   │                                                              │
│   ├── processed/              ◄── SILVER: Cleaned & validated   │
│   │   ├── sales/                                                │
│   │   │   └── daily_transactions.parquet                       │
│   │   └── customers/                                            │
│   │       └── profiles_deduplicated.parquet                    │
│   │                                                              │
│   └── curated/                ◄── GOLD: Business-ready          │
│       ├── marketing/                                            │
│       │   └── campaign_metrics.parquet                         │
│       ├── finance/                                              │
│       │   └── revenue_by_region.parquet                        │
│       └── ml_features/                                          │
│           └── customer_features.parquet                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```



#### Concept

**Data Lake Characteristics:**
| Characteristic | Description |
|----------------|-------------|
| **Raw Data Storage** | Data stored as-is, without transformation |
| **Any Format** | JSON, CSV, Parquet, ORC, Avro, images, video |
| **Schema-on-Read** | Define structure when querying, not when writing |
| **Cheap Storage** | ~$0.02/GB/month for standard, less for infrequent |
| **Unlimited Scale** | Exabytes of data possible |
| **Decoupled Compute** | Any engine can read (Spark, Trino, Athena) |
| **Foundation for ML** | Data scientists access raw data for exploration |



#### Organization 

The architecture used above is called **Medallion Architecture (Bronze/Silver/Gold)**, it is a popular design pattern. But it is not the only way to organize a data lake. While highly recommended for its clear separation of data quality stages, many organizations in 2025 use hybrid, alternative or modified frameworks depending on their specific needs:  

| Architecture 	| Best For	| Focus| 
| ---------------| --------| -----| 
| Medallion (Bronze → Silver → Gold zones)| 	General-purpose lakehouse	| Progressive data refinement & quality| 
| Data Vault	| Complex, highly regulated audits| 	Long-term history and auditability| 
| Data Mesh	| Large, global organizations| 	Domain autonomy and scalability| 
| Lambda	| High-velocity streaming + batch| 	Balancing real-time and historical accuracy| 





#### File Formats in Data Lakes

| Format | Type | Compression | Splittable | Best For |
|--------|------|-------------|------------|----------|
| **CSV** | Row | Poor | Yes | Simple interchange, small data |
| **Avro** | Row | Good | Yes | Streaming, schema evolution |
| **JSON** | Semi-structured | Moderate | Yes (line-delimited) | APIs, nested data |
| **Parquet** | Columnar | Excellent | Yes | Analytics, Spark |
| **ORC** | Columnar | Excellent | Yes | Hive workloads |


**Parquet**:   
A **columnar** storage **file format** designed for efficient analytics. Think of it as a better CSV for analytical workloads.

<img src="./pic/1_parquet_file_format.webp" width=500>

*Key characteristics*

- Columnar: Data stored by column, not row—great for queries that only need specific columns
- Compressed: Built-in compression (snappy, gzip, zstd)—files are much smaller than CSV/JSON
- Schema-embedded: The file carries its own schema (types, column names)
- Splittable: Can be read in parallel chunks—important for distributed processing

*Why Parquet Dominates*    
- more columnar ideas in storage formats  

*Benefits*                                                     
- Read only columns needed (projection pushdown)             
- Skip row groups using statistics (predicate pushdown)      
- Excellent compression (same data types together)     

#### Data Lake Challenges
| Challenge | Description | Solution |
|-----------|-------------|----------|
| **No ACID** | Concurrent writes can corrupt | Use Lakehouse (Delta/Iceberg) |
| **Data Swamp** | Ungoverned, unusable data | Data catalog, governance |
| **No Schema** | Hard to understand data | Metadata management |
| **Performance** | Small files problem | Compaction, partitioning |



#### Foundation Storage 
where files physically live   

- 2010s: HDFS was THE foundation                                      
  Data Lake = HDFS + Hive + MapReduce/Spark                            
                                                                
- 2020s: Cloud Object Storage is THE foundation                                            
  Data Lake = S3/GCS + Spark/Trino + Delta Lake/Iceberg                
                                                                
HDFS still exists but is legacy/declining                           
Most new projects use S3/GCS      

| Aspect| HDFS| S3/GCS/Azure Blob| 
| ------| ----| ------------------| 
| Era| 2006+ (Hadoop era)| 2010s+ (Cloud era)| 
| Deployment| On-premise (deploy locally), self-managed| Cloud, fully managed| 
| Scaling| Add nodes manually| Unlimited, automatic| 
| Cost| Hardware + ops team| Pay per GB stored| 
| Ecosystem| Hadoop (MapReduce, Hive)| Everything (Spark, Snowflake, etc.)| 
| Current Trend| Declining| Dominant| 

##### HDFS (on-premise, legacy)

**What it is:** Hadoop Distributed File System, the primary storage system for Hadoop ecosystem, designed to store very large files across multiple machines.

**Architecture:**

<img src="./pic/1_HDFS-Architecture.webp" width=500>


**Key Features:**
| Feature | Description |
|---------|-------------|
| **Block Storage** | Files split into 128MB blocks (configurable) |
| **Replication** | Each block replicated **3x by default** |
| **Rack Awareness** | Places replicas on different racks for fault tolerance |
| **Write Once** | Optimized for **append-only**, **sequential reads** |
| **Data Locality** | Moves compute to data, not data to compute |

**Use Cases:**
- Log file storage and analysis
- Data lake foundation for Hadoop ecosystem
- Batch processing with MapReduce/Spark
- Long-term archival storage

**Example Commands:**
```bash
# List files in HDFS
hdfs dfs -ls /user/data/

# Copy local file to HDFS
hdfs dfs -put localfile.csv /user/data/

# Copy from HDFS to local
hdfs dfs -get /user/data/file.csv ./local/

# Check replication factor
hdfs dfs -stat %r /user/data/myfile.csv

# Set replication factor
hdfs dfs -setrep 5 /user/data/important_file.csv

# Check disk usage
hdfs dfs -du -h /user/data/

# Delete file
hdfs dfs -rm /user/data/oldfile.csv
```



##### Cloud Object Storage (modern standard)

**Amazon S3 (Simple Storage Service)**

**What it is:** Highly scalable, durable **object storage** service that serves as the backbone for cloud data lakes.

**Structure:**
```text
S3 Organization:

 AWS Account                                      
 └── Bucket: my-company-data-lake                 
     ├── raw/                                     
     │   ├── sales/2024/01/transactions.parquet  
     │   ├── logs/2024/01/01/app.log.gz           
     │   └── images/product_001.jpg              
     ├── processed/                               
     │   └── sales/aggregated_daily.parquet      
     └── analytics/                               
         └── reports/monthly_summary.csv         

```

**Storage Classes:**
| Class | Use Case | Retrieval | Cost | Min Duration |
|-------|----------|-----------|------|--------------|
| **S3 Standard** | Frequently accessed | Instant | $$$ | None |
| **S3 Intelligent-Tiering** | Unknown patterns | Instant | $$ | 30 days |
| **S3 Standard-IA** | Infrequent access | Instant | $$ | 30 days |
| **S3 One Zone-IA** | Infrequent, non-critical | Instant | $ | 30 days |
| **S3 Glacier Instant** | Archive, instant access | Instant | $ | 90 days |
| **S3 Glacier Flexible** | Archive | Minutes-12hrs | ¢ | 90 days |
| **S3 Glacier Deep Archive** | Long-term archive | 12-48hrs | ¢ | 180 days |

**Key Features:**
- 99.999999999% (11 9s) durability
- 99.99% availability
- Unlimited storage capacity
- Built-in versioning and lifecycle policies
- Event notifications (trigger Lambda on upload)
- Cross-region replication
- Server-side encryption



**Google Cloud Storage (GCS)**

**Storage Classes:**
| Class | Use Case | Retrieval |
|-------|----------|-----------|
| **Standard** | Frequently accessed | Instant |
| **Nearline** | Once per month | Instant |
| **Coldline** | Once per quarter | Instant |
| **Archive** | Once per year | Instant |



**Azure Blob Storage**

**Access Tiers:**
| Tier | Use Case | Retrieval |
|------|----------|-----------|
| **Hot** | Frequently accessed | Instant |
| **Cool** | Infrequent (30+ days) | Instant |
| **Cold** | Rarely accessed (90+ days) | Instant |
| **Archive** | Long-term (180+ days) | Hours |



**Cloud Storage Comparison**

| Feature | S3 (AWS) | GCS (Google) | Azure Blob |
|---------|----------|--------------|------------|
| **Durability** | 11 9s | 11 9s | 11 9s |
| **Availability** | 99.99% | 99.99% | 99.99% |
| **Min Object Size** | 0 bytes | 0 bytes | 0 bytes |
| **Max Object Size** | 5TB | 5TB | 4.75TB |
| **Versioning** | Yes | Yes | Yes |
| **Lifecycle Policies** | Yes | Yes | Yes |
| **Best Integration** | AWS services | BigQuery, Dataflow | Azure services |


### Data Warehouses 
A Data Warehouse is a centralized repository of **integrated, cleaned, structured data** optimized for high-performance reporting and business intelligence. DW serve as the "refined" analytical layer of a big data system.

**PURPOSE**: Fast analytics on clean, structured data  

**Characteristics**:                                       

| Feature | Description |
|---------|-------------|
| **Subject-Oriented** | Organized by business subjects (sales, customers) |
| **Integrated** | Unified from multiple source systems |
| **Time-Variant** | Maintains historical data with timestamps |
| **Non-Volatile** | Data is stable, not frequently updated |
| **Columnar Storage** | Optimized for analytical read patterns |
| **Schema-on-Write**	| Requires a strictly defined structure (schema) before data is loaded, ensuring high data quality and predictability.| 
| **Pre-computed Aggregations** | Materialized views for common queries |
| **MPP Architecture**	| Uses Massive Parallel Processing to distribute queries across multiple nodes for sub-second responses on millions of rows.| 
| **Decoupled Design**	| Modern 2025 DWs (like Snowflake/BigQuery) separate storage from compute, allowing you to pay for each independently.|             

**Pros**:
- ✅ Speed: Optimized for complex SQL joins and pre-computed aggregations (materialized views).
- ✅ Accessibility: Easy for non-technical business users to query via standard SQL or BI tools.
- ✅ Governance: High levels of security, access control, and data lineage.

**Cons**:
- ❌ Cost: Generally more expensive than raw Data Lake storage (S3/GCS).
- ❌ Inflexibility: Struggles with unstructured data (videos, raw logs) and requires rigorous ETL (Extract, Transform, Load) pipelines to ingest data.

#### Data Warehouse vs Data Lake

| Aspect | Data Warehouse | Data Lake |
|--------|----------------|-----------|
| **Data Type** | Structured only | All types |
| **Schema** | Schema-on-write | Schema-on-read |
| **Users** | Business analysts | Data scientists |
| **Data Quality** | High (cleaned) | Variable (raw) |
| **Cost** | Higher | Lower |
| **Query Performance** | Fast (optimized) | Variable |



#### Snowflake 
[Snowflake](https://www.linkedin.com/pulse/snowflake-architecture-overview-minzhen-yang/) is architected with **three independent layers**: 

**Cloud Service Layer**
- Authentication & Access Control               
- Infrastructure Management                     
- Metadata Management                           
- Query Parsing & Optimization

**Compute Layer**  
- Scale up/down instantly                      
- Auto-suspend when idle                       
- Multi-cluster for concurrency

**Data Storage Layer**
- Compressed **columnar** format                     
- Stored on **cloud object storage (S3/Azure/GCS)**  
- Automatic clustering and optimization          
- Pay only for storage used          

<img src="./pic/1_snowflake_layers.png" width=500>

**Snowflake Key Features:**
| Feature | Description |
|---------|-------------|
| **Separation of Storage/Compute** | Scale independently, pay for what you use |
| **Zero-Copy Cloning** | Instant database copies without storage cost |
| **Time Travel** | Query historical data (up to 90 days) |
| **Data Sharing** | Share live data across accounts securely |
| **Multi-Cloud** | Run on AWS, Azure, or GCP |
| **Auto-Scaling** | Automatically scale compute up/down |
| **Snowpipe** | Continuous data loading |



#### Amazon Redshift 

<img src="./pic/1_redshift_clusters.jpg" width=400>

**Key Features**:                                        
- Massively Parallel Processing (MPP)                
- Columnar storage with compression                  
- Redshift Spectrum: Query S3 directly               
- Concurrency Scaling: Handle traffic spikes         
- Redshift ML: In-database machine learning          
- Redshift Serverless: No cluster management  

**Leader Node**: 
- Query planning & aggregation
- Coordinates compute nodes       

**Compute Nodes**:
- Parallel processing


<img src="./pic/1_redshift.png" width=500>


#### Google BigQuery

**What it is:** Serverless, highly scalable enterprise data warehouse.

<img src="./pic/1_bigquery-architecture-diagram.svg" width=500>


**Key Features:**
| Feature | Description |
|---------|-------------|
| **Serverless** | No infrastructure to manage |
| **Pay-per-Query** | $5 per TB scanned (on-demand) |
| **Slots** | Reserved compute for predictable pricing |
| **BigQuery ML** | Train ML models with SQL |
| **BI Engine** | In-memory analysis for sub-second queries |
| **Streaming Insert** | Real-time data ingestion |
| **Geospatial** | Native GIS support |

<img src="./pic/1_GoogleBigQuery_Explained.jpg" width=500>

#### Data Warehouse Comparison

| Warehouse | Cloud | Key Differentiator | Best For |
|-----------|-------|-------------------|----------|
| **Snowflake** | Multi-cloud | Data sharing, separation | Multi-cloud, collaboration |
| **BigQuery** | GCP | Serverless, BigQuery ML | No-ops, ML in SQL |
| **Redshift** | AWS | Deep AWS integration | AWS-centric teams |
| **Azure Synapse** | Azure | Unified analytics | Microsoft ecosystem |
| **Databricks SQL** | Multi-cloud | Lakehouse-native | Existing Databricks users |
| **ClickHouse** | Any | Fastest for real-time | Real-time OLAP |





#### Data Modeling

Data modeling in warehouses typically uses dimensional modeling with Fact and Dimension tables.

| Table Type | Purpose | Example |
|------------|---------|---------|
| **Fact Table** | Stores measurable, quantitative data | Sales transactions, clicks, orders |
| **Dimension Table** | Stores descriptive attributes | Products, customers, dates, locations |


**Dimensional modeling** ends up with a dimensional model as a:   

- **Star schema** with one fact surrounded by conformed and shared dimensions.
- **Constellation schema** with few facts sharing conformed dimensions.
- **Snowflake schema** with one or few dimensions have become snowflaking.   

> Snowflake schema is not Snowflake product/service/tech talked before or after

**Design Methodology**:    
- [Inmon VS the Kimball](https://www.astera.com/type/blog/data-warehouse-concepts/) 
- [Dimensional modeling – architecture and methodology (including data warehouse definition)](./reading/Dimensional%20modeling_Data_Warehouse.pdf)

#### Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions (SCD) are techniques for **tracking changes** in dimension data over time in **data warehouses**.

**Why SCD Matters**

  Dimension attributes change over time (e.g., customer address, employee department), and we need strategies to handle these changes while maintaining historical accuracy.

**1. SCD Type 0 — No Change (Retain Original)**

  **Strategy**: Original value is NEVER changed

  **Characteristics**:
  - No updates allowed
  - No history tracking
  - Data treated as static/immutable

  **Use Cases**:
  - Original registration date
  - Birth date
  - Social Security Number
  - Original customer ID

  **Example**:
  ```sql
  -- Customer table with Type 0 attribute
  CREATE TABLE customers (
      customer_id INT PRIMARY KEY,
      original_signup_date DATE,  -- Type 0: Never changes
      current_email VARCHAR(100)  -- Other attributes may change
  );

  -- Even if customer re-registers, original_signup_date stays the same
  ```

  ```text
  Initial State:
  customer_id | original_signup_date | current_email
  1           | 2020-01-15          | john@old.com

  After "Update":
  customer_id | original_signup_date | current_email
  1           | 2020-01-15          | john@new.com  ← email changed
              ↑ signup date UNCHANGED
  ```

**2. SCD Type 1 — Overwrite (No History)**

  **Strategy**: Update attribute in place, replacing old value

  **Characteristics**:
  - Old value is permanently lost
  - Table always reflects current state only
  - Simplest to implement
  - No storage overhead

  **Use Cases**:
  - Correcting data entry errors
  - Attributes where history doesn't matter
  - Non-critical descriptive attributes

  **Example**:
  ```sql
  -- Before update
  SELECT * FROM customers WHERE customer_id = 1;
  -- customer_id | name  | city
  -- 1           | John  | Boston

  -- Update: Customer moved to New York
  UPDATE customers SET city = 'New York' WHERE customer_id = 1;

  -- After update (history lost!)
  SELECT * FROM customers WHERE customer_id = 1;
  -- customer_id | name  | city
  -- 1           | John  | New York
  ```

  ```text
  Timeline with Type 1:
  Jan 2023: city = 'Boston'
  Jun 2023: city = 'New York' (Boston is GONE)
  Dec 2023: Query "Where did customer live in Feb 2023?" → Cannot answer!
  ```

**3. SCD Type 2 — Full History Tracking**

  **Strategy**: Insert new row for each change, maintaining complete history

  **Characteristics**:
  - Old records are NOT overwritten
  - New row inserted when data changes
  - Multiple records per business key
  - Tracks complete history with validity periods

  **Required Columns**:
  - Surrogate Key (unique per row)
  - Business Key (identifies the entity)
  - Effective Date / Start Date
  - Expiration Date / End Date
  - Current Flag (optional, for convenience)

  **Example**:
  ```sql
  -- Type 2 Dimension Table Structure
  CREATE TABLE dim_customer (
      surrogate_key INT PRIMARY KEY AUTO_INCREMENT,
      customer_id INT,              -- Business key
      name VARCHAR(100),
      city VARCHAR(100),
      effective_date DATE,
      expiration_date DATE,
      is_current BOOLEAN
  );
  ```

  ```sql
  -- Initial record (Jan 2023)
  INSERT INTO dim_customer VALUES 
  (1, 101, 'John', 'Boston', '2023-01-01', '9999-12-31', TRUE);

  -- Customer moves to New York (Jun 2023)
  -- Step 1: Close current record
  UPDATE dim_customer 
  SET expiration_date = '2023-06-14', is_current = FALSE
  WHERE customer_id = 101 AND is_current = TRUE;

  -- Step 2: Insert new record
  INSERT INTO dim_customer VALUES 
  (2, 101, 'John', 'New York', '2023-06-15', '9999-12-31', TRUE);
  ```

  **Resulting Table State**:

  | surrogate_key | customer_id | name | city     | effective_date | expiration_date | is_current |
  |--------------|------------|------|----------|----------------|-----------------|------------|
  | 1            | 101        | John | Boston   | 2023-01-01     | 2023-06-14      | FALSE      |
  | 2            | 101        | John | New York | 2023-06-15     | 9999-12-31      | TRUE       |

  **Querying Type 2 Dimensions**:
  ```sql
  -- Get current state
  SELECT * FROM dim_customer WHERE is_current = TRUE;

  -- Get state as of specific date (point-in-time query)
  SELECT * FROM dim_customer 
  WHERE customer_id = 101 
    AND '2023-03-15' BETWEEN effective_date AND expiration_date;
  -- Returns: Boston (historical state)

  -- Get complete history
  SELECT * FROM dim_customer WHERE customer_id = 101 ORDER BY effective_date;
  ```

**SCD Type Comparison**

  | Aspect | Type 0 | Type 1 | Type 2 |
  |--------|--------|--------|--------|
  | History | None (immutable) | None (overwritten) | Full |
  | Storage | Minimal | Minimal | Higher |
  | Complexity | Lowest | Low | High |
  | Query Performance | Fast | Fast | Slower |
  | Point-in-Time Analysis | N/A | Not possible | Fully supported |
  | Use Case | Static attributes | Current state only | Audit/History required |

**Additional SCD Types (Brief Overview)**  

  | Type | Strategy | Description |
  |------|----------|-------------|
  | Type 3 | Limited History | Add columns for previous value (e.g., `previous_city`) |
  | Type 4 | History Table | Separate current and history tables |
  | Type 6 | Hybrid | Combines Types 1, 2, and 3 |





### Data Mart

A Data Mart is a **subset of a Data Warehouse** focused on a specific business department, subject area, or team.


```text
┌──────────────────────────────────────────────────────────┐
│                ENTERPRISE DATA WAREHOUSE                 │
│  ┌─────────────────────────────────────────────────────┐ │
│  │               All Company Data                      │ │
│  │┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐│ │
│  ││  Sales  │   │ Finance │   │Marketing│   │   HR    ││ │
│  ││  Mart   │   │  Mart   │   │  Mart   │   │  Mart   ││ │
│  │└─────────┘   └─────────┘   └─────────┘   └─────────┘│ │
│  │     ↓             ↓             ↓             ↓     │ │
│  │Sales Team   Finance Team    Marketing     HR Team   │ │
│  │                                Team                 │ │
│  └─────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```

**Example: Sales Data Mart**

```sql
-- Sales Mart focuses only on sales-related data
Sales Data Mart:
├── fact_sales (transactions)
├── dim_product
├── dim_customer
├── dim_store
├── dim_date
└── dim_salesperson

-- Excludes HR, Finance, Marketing data
-- Pre-aggregated for common sales queries:
   - Daily sales by store
   - Monthly revenue by product category
   - Salesperson performance metrics
```

**Benefits of Data Marts**

| Benefit | Description |
|---------|-------------|
| **Performance** | Smaller dataset = faster queries |
| **Simplicity** | Reduced schema complexity for end users |
| **Security** | Department-level access control |
| **Autonomy** | Teams manage their own data |
| **Cost** | Lower compute requirements per mart |

### Lakehouse (Modern Approach)

PURPOSE: Combine lake flexibility with warehouse reliability


| Traditional Data Lake Problems | Lakehouse Solutions            |
|--------------------------------|-------------------------------  |
| No ACID transactions           | Full ACID support               |
| No schema enforcement          | Schema evolution & enforcement  |
| Data corruption on failure     | Atomic commits                  |
| No versioning                  | Time travel (versioned data)    |
| Slow metadata operations       | Optimized metadata handling     |
| No upserts/deletes             | Full DML support                |

```text
┌───────────────────────────────────────────────────┐
│                   LAKEHOUSE                       │
├───────────────────────────────────────────────────┤
│ ┌───────────────────────────────────────────────┐ │
│ │             TABLE FORMAT LAYER                │ │
│ │     (Delta Lake / Apache Iceberg / Hudi)      │ │
│ └───────────────────────────────────────────────┘ │
│                        │                          │
│                        ▼                          │
│ ┌───────────────────────────────────────────────┐ │
│ │               OBJECT STORAGE                  │ │
│ │              (S3 / GCS / ADLS)                │ │
│ └───────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────┘
```



#### Delta Lake

**What it is:** Open-source **storage layer** that brings **ACID** transactions to data lakes. Created by Databricks.

**Architecture:**

<img src="./pic/1_delta-lake.png" width=600>

```text
┌─────────────────────────────────────────────────────────────────┐
│                    DELTA LAKE ON DISK                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   s3://bucket/orders/                                           │
│   ├── _delta_log/                    ◄── Transaction log        │
│   │   ├── 00000000000000000000.json  ◄── Version 0 (create)     │
│   │   ├── 00000000000000000001.json  ◄── Version 1 (insert)     │
│   │   ├── 00000000000000000002.json  ◄── Version 2 (update)     │
│   │   └── 00000000000000000003.json  ◄── Version 3 (delete)     │
│   ├── part-00000-xxx.parquet         ◄── Data files             │
│   ├── part-00001-xxx.parquet                                    │
│   └── part-00002-xxx.parquet                                    │
│                                                                 │
│   Transaction Log Entry Example:                                │
│   {                                                             │
│     "add": {                                                    │
│       "path": "part-00003-xxx.parquet",                         │
│       "size": 1024000,                                          │
│       "modificationTime": 1704067200000,                        │
│       "dataChange": true,                                       │
│       "stats": "{\"numRecords\":10000,\"minValues\":{...}}"     │
│     }                                                           │
│   }                                                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Delta Lake Code Examples:**
```python
# Write Delta table
df.write.format("delta").save("/data/events")

# Write with partitioning
df.write.format("delta") \
    .partitionBy("date", "country") \
    .save("/data/events")

# Read Delta table
df = spark.read.format("delta").load("/data/events")

# Time Travel - Read specific version
df_v5 = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/data/events")

# Time Travel - Read as of timestamp
df_historical = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/data/events")

# MERGE (Upsert) operation
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/data/events")

deltaTable.alias("target").merge(
    updates.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set={
    "value": "source.value",
    "updated_at": "source.updated_at"
}).whenNotMatchedInsert(values={
    "id": "source.id",
    "value": "source.value",
    "created_at": "source.created_at"
}).execute()

# Delete operation
deltaTable.delete("date < '2023-01-01'")

# Update operation
deltaTable.update(
    condition="country = 'USA'",
    set={"region": "'North America'"}
)

# Optimize (compaction)
deltaTable.optimize().executeCompaction()

# Z-Order (co-locate related data)
deltaTable.optimize().executeZOrderBy("customer_id")
```



#### Apache Iceberg

**What it is:** Open **table format** for **huge analytic datasets**. Created by Netflix.

**Key Features:**
| Feature | Description |
|---------|-------------|
| **Hidden Partitioning** | No partition columns in queries |
| **Schema Evolution** | Add, drop, rename columns without rewrite |
| **Partition Evolution** | Change partitioning without rewrite |
| **Time Travel** | Query historical snapshots |
| **Multi-Engine** | Works with Spark, Trino, Flink, Hive |



#### Apache Hudi

**What it is:** Data lake **storage layer** with incremental processing. Created by Uber.

**Key Features:**
| Feature | Description |
|---------|-------------|
| **Incremental Processing** | Process only changed data |
| **Upserts/Deletes** | Record-level mutations |
| **Two Table Types** | Copy-on-Write, Merge-on-Read |
| **Streaming Ingestion** | Built for real-time data |



#### Lakehouse Technologies Comparison

| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---------|------------|----------------|-------------|
| **Created By** | Databricks | Netflix | Uber |
| **ACID** | Yes | Yes | Yes |
| **Time Travel** | Yes | Yes | Yes |
| **Schema Evolution** | Yes | Yes (best) | Yes |
| **Partition Evolution** | Limited | Yes (best) | Yes |
| **Streaming** | Good | Good | Best |
| **Best Engine** | Spark/Databricks | Trino, multi-engine | Spark, Flink |
| **Cloud Support** | All | All | All |



### Star/ Snowflake Schema

Star/Snowflake Schema is for STRUCTURED ANALYTICAL DATA. It requires: Tables, Relationships, Schema-on-write.   

Do they need to consider it:  
- **DATA LAKE: ✗ NO** - Not typically                                                                    
  - Raw data, any format                                        
  - Schema-on-read                                              
  - No enforced structure                                       
  - Just files: JSON, CSV, Parquet, images...                   
                                                                
    (But if you add structure → it becomes a Lakehouse)
- **DATA WAREHOUSE: ✓ YES** - Primary use case
  - Structured data                                     
  - Schema-on-write                                     
  - Designed for dimensional modeling                   
  - fact_sales, dim_customer, dim_product, etc. 
- **LAKEHOUSE: ✓ YES** - In the **Gold/Curated layer**
  - Bronze (Raw)    → No schema, just raw files                        
  - Silver (Clean)  → Some structure, cleaned data                     
  - Gold (Curated)  → Star/Snowflake schema for analytics ✓
- **STREAMING STORAGE: ✗ NO** - Different model
  - Event-based, not dimensional                                      
  - Time-series data                                                  
  - Append-only logs                                                  
  - No fact/dimension concept  
- **VECTOR STORAGE: ✗ NO** - Completely different 
  - Stores embeddings (high-dimensional vectors)                      
  - Similarity search, not SQL joins                                  
  - No relational model   


## Complete Storage Technology Comparison

| Technology | Category | Type | Data Format | Scale | Latency | Cost |
|------------|----------|------|-------------|-------|---------|------|
| **HDFS** | File System | Distributed | Any | PB | High | $$ |
| **S3** | Object Storage | Cloud | Any | EB | Medium | $ |
| **GCS** | Object Storage | Cloud | Any | EB | Medium | $ |
| **Azure Blob** | Object Storage | Cloud | Any | EB | Medium | $ |
| **PostgreSQL** | RDBMS | Relational | Structured | TB | Low | $$ |
| **MySQL** | RDBMS | Relational | Structured | TB | Low | $$ |
| **Oracle** | RDBMS | Relational | Structured | TB | Low | $$$$ |
| **MongoDB** | NoSQL | Document | JSON-like | TB | Low | $ |
| **Cassandra** | NoSQL | Wide-Column | Flexible | PB | Low | $ |
| **DynamoDB** | NoSQL | Key-Value | Flexible | PB | Very Low | $$ |
| **Redis** | NoSQL | In-Memory | Any | GB-TB | Sub-ms | $$ |
| **Neo4j** | NoSQL | Graph | Graph | TB | Low | $$ |
| **Snowflake** | Data Warehouse | Cloud DW | Structured | PB | Medium | $$ |
| **BigQuery** | Data Warehouse | Serverless DW | Structured | PB | Medium | $ |
| **Redshift** | Data Warehouse | Cloud DW | Structured | PB | Medium | $$ |
| **ClickHouse** | Data Warehouse | OLAP | Structured | PB | Very Low | $ |
| **Delta Lake** | Lakehouse | Table Format | Any + ACID | PB | Medium | $ |
| **Apache Iceberg** | Lakehouse | Table Format | Any + ACID | PB | Medium | $ |
| **Apache Hudi** | Lakehouse | Table Format | Any + ACID | PB | Medium | $ |





## Layer 3: Data Processing & Compute

```text
┌─────────────────────────────────────────────────────────────────┐
│   STORAGE ≠ COMPUTE                                             │
│   ═══════════════════                                           │
│   STORAGE = Where data lives      COMPUTE = How data transforms │
│   (Passive)                       (Active)                      │
│   ┌─────────────┐                ┌─────────────┐                │
│   │    HDFS     │                │   Spark     │                │
│   │    S3       │◄──── read ────▶│   Flink     │                │
│   │  Snowflake  │◄──── write ───▶│   dbt       │                │
│   └─────────────┘                └─────────────┘                │
│   Think of it like:                                             │
│   Storage = Hard drive / Filing cabinet                         │
│   Compute = CPU / Workers processing files                      │
└─────────────────────────────────────────────────────────────────┘
```



### Processing Paradigms: Batch vs Stream

```text
┌─────────────────────────────────────────────────────────────────┐
│                   PROCESSING PARADIGMS                           │
├──────────────────────────────┬────────────────────────────────────┤
│      BATCH PROCESSING        │      STREAM PROCESSING             │
├──────────────────────────────┼────────────────────────────────────┤
│                              │                                     │
│  Process bounded dataset     │  Process unbounded stream          │
│                              │                                     │
│  ┌──────────────────────┐    │  ────▶────▶────▶────▶────▶        │
│  │  Yesterday's Data    │    │  event event event event event    │
│  │  [████████████████]  │    │    │     │     │     │     │      │
│  └──────────┬───────────┘    │    ▼     ▼     ▼     ▼     ▼      │
│             │                │  ┌─────────────────────────────┐   │
│             ▼                │  │    Process immediately      │   │
│  ┌──────────────────────┐    │  └─────────────────────────────┘   │
│  │     Process All      │    │                                     │
│  │     At Once          │    │                                     │
│  └──────────────────────┘    │                                     │
│                              │                                     │
│  Latency: Minutes-Hours      │  Latency: Milliseconds-Seconds     │
│                              │                                     │
│  Use Cases:                  │  Use Cases:                        │
│  - Daily/weekly report       │  - Fraud detection                 │
│  - ML model training         │  - Real-time dashboards            │
│  - Historical analysis       │  - IoT monitoring                  │
│  - Data warehouse loads(ETL) │  - Live recommendations            │
│  - Historical trend analysis │  - Social media trend detection    │
│                              │                                    │
│  Tools:                      │  Tools:                            │
│  - Spark (batch mode)        │  - Spark Streaming                 │
│  - dbt                       │  - Apache Flink                    │
│  - AWS Glue                  │  - Kafka Streams                   │
│  - Hadoop MapReduce          │  - Apache Storm                    │
│                              │                                     │
└──────────────────────────────┴────────────────────────────────────┘
```

**Comparison**:   

| Aspect | Batch Processing | Streaming Processing |
|--------|------------------|----------------------|
| **Data Scope** | Large historical datasets | Individual events/micro-batches |
| **Timing** | Scheduled (hourly/daily/weekly) | Continuous (real-time) |
| **Latency** | High (minutes to hours) | Low (milliseconds to seconds) |
| **Throughput** | Very high | Moderate |
| **Complexity** | Lower | Higher (state management, ordering) |
| **Use Cases** | ETL, reporting, warehousing | Monitoring, fraud detection, IoT |

#### Batch Processing Examples

```text
Daily Sales Aggregation Pipeline:
┌───────────┐    ┌───────────┐    ┌───────────┐    ┌───────────┐
│Raw Sales  │───▶│ Clean &   │───▶│ Aggregate │───▶│  Daily    │
│   Logs    │    │ Transform │    │ by Region │    │  Report   │
└───────────┘    └───────────┘    └───────────┘    └───────────┘
     ↑
  Runs at midnight daily
```

#### Streaming Processing Examples

```text
Fraud Detection Pipeline:
┌───────────┐    ┌───────────┐    ┌───────────┐    ┌───────────┐
│Credit Card│───▶│  Feature  │───▶│   ML      │───▶│  Alert/   │
│Transaction│    │Extraction │    │Inference  │    │  Block    │
└───────────┘    └───────────┘    └───────────┘    └───────────┘
     ↑                                                    ↓
  Each transaction                               < 100ms latency
```


#### Popular Frameworks Comparison

| Framework | Type | Strengths |
|-----------|------|-----------|
| **Apache Spark** | Batch + Micro-batch streaming | Unified API, mature ecosystem |
| **Apache Flink** | True streaming + Batch | Low latency, exactly-once semantics |
| **Apache Kafka Streams** | Streaming | Lightweight, Kafka-native |
| **Apache Storm** | Streaming | Real-time, low latency |
| **AWS Kinesis** | Streaming | Managed, AWS integration |



### ETL vs ELT

<img src="./pic/1_ETLandELT.png" width=500>

the choice between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) is primarily defined by the location of the "heavy lifting" (transformation) and the intended use of the data.  

The fundamental difference is **where and when data is transformed**:  

- **ETL**: Data is transformed on a **secondary processing server** (**external** to the storage) before it is loaded.
- **ELT**: Raw data is loaded directly into the destination system (**data lake or cloud warehouse**), where transformations occur using that system's **internal compute power**.

```text
┌─────────────────────────────────────────────────────────────────┐
│   TRADITIONAL ETL               MODERN ELT                      │
│   ═══════════════               ══════════                      │
│   ┌────────┐                    ┌────────┐                      │
│   │ Source │                    │ Source │                      │
│   └───┬────┘                    └───┬────┘                      │
│       ▼                             ▼                           │
│   ┌────────┐                    ┌────────┐                      │
│   │Extract │                    │Extract │                      │
│   └───┬────┘                    │   &    │                      │
│       │                         │ Load   │ (Fivetran, Airbyte)  │
│       ▼                         └───┬────┘                      │
│   ┌─────────┐                       │                           │
│   │Transform│ ◄── External          │                           │
│   │ Server  │                       ▼                           │
│   └───┬─────┘                  ┌─────────────┐                  │
│       │                        │   Data      │                  │
│       ▼                        │ Warehouse   │                  │
│   ┌────────┐                   │             │                  │
│   │ Load   │                   │ ┌─────────┐ │                  │
│   └───┬────┘                   │ │Transform│ │◄── dbt           │
│       │                        │ │ (SQL)   │ │                  │
│       ▼                        │ └─────────┘ │                  │
│   ┌────────┐                   └─────────────┘                  │
│   │  DW    │                                                    │
│   └────────┘                                                    │
│   Pros: Control                Pros: Simpler, faster            │
│   Cons: Complex, slow          Cons: Warehouse compute cost     │
└─────────────────────────────────────────────────────────────────┘
```

#### Detailed Comparison

| Aspect | ETL | ELT |
|--------|-----|-----|
| **Processing Location** | External ETL server | Inside target data warehouse |
| **Data Movement** | Raw → ETL Tool → DW | Raw → DW (Staging) → DW (Final) |
| **Best For** | Structured data, smaller volumes | Large volumes, cloud DWs |
| **Compute Cost** | Dedicated ETL infrastructure | Uses DW compute power |
| **Flexibility** | Rigid, pre-defined transformations | Flexible, can re-transform |
| **Historical Data** | Only transformed data kept | Raw data preserved |
| **Speed** | Slower (extra hop) | Faster (direct load) |
| **Tools** | Informatica, Talend, SSIS | dbt, Snowflake, BigQuery |

#### When to Use Each

**Use ETL when:**
- Data needs extensive cleaning before loading
- Target system has limited compute
- Sensitive data must be masked before storage
- Working with on-premise legacy systems

**Use ELT when:**
- Using cloud data warehouses (Snowflake, BigQuery, Redshift)
- Need to preserve raw data
- Transformations may change over time
- High data volumes require parallel processing

#### Modern ELT Example with dbt

```sql
-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

cleaned AS (
    SELECT
        order_id,
        customer_id,
        CAST(order_date AS DATE) AS order_date,
        ROUND(amount, 2) AS amount,
        LOWER(status) AS status
    FROM source
    WHERE order_id IS NOT NULL
)

SELECT * FROM cleaned
```


### MapReduce - Batch Processing Only

- MapReduce is a fundamental programming model and distributed execution framework, designed for processing large datasets in parallel across a distributed cluster on disk.
- Introduced by Google (2004)
- Implemented in Hadoop (2006) 
- NOT for real-time or streaming

<img src="./pic/1_map-reduce-mode.png" width=500>

**Detailed Phases**:  

| Phase | Function | Description |
|-------|----------|-------------|
| **Split** | Divide input | Break large file into chunks (typically 64-128MB) |
| **Map** | Transform | Apply function to each record, emit key-value pairs |
| **Shuffle** | Redistribute | Group all values by key across the cluster |
| **Sort** | Order | Sort intermediate data by key |
| **Reduce** | Aggregate | Combine values for each key into final result |

Classic Example: Word Count   


<img src="./pic/1_mapreduce_eg_word_count.png" width=500>

#### Automatic Handling

MapReduce automatically manages:
- **Partitioning:** Distributes data across nodes
- **Retries:** Re-executes failed tasks
- **Shuffle:** Moves intermediate data between map and reduce
- **Scheduling:** Assigns tasks to available workers

#### Limitations of MapReduce

| Limitation | Explanation | Impact |
|------------|-------------|--------|
| **Disk-heavy** | Writes intermediate results to disk | Slow I/O operations |
| **Poor for Iterative Jobs** | Must read/write disk between iterations | ML algorithms suffer |
| **High Latency** | Batch-oriented, not real-time | Minutes to hours for results |
| **Hard to Debug** | Distributed nature complicates debugging | Development overhead |
| **Rigid Model** | Only Map and Reduce operations | Complex logic is awkward |




### Apache Spark - The Swiss Army Knife

Spark is a **unified analytics engine** that **overcomes MapReduce limitations** through in-memory processing.

#### Spark Architecture

```text
┌─────────────────────────────────────────────────────────────────┐
│                      APACHE SPARK                               │
│              "Unified Analytics Engine"                         │
├─────────────────────────────────────────────────────────────────┤
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    SPARK CORE                           │   │
│   │            (Distributed Compute Engine)                 │   │
│   │                                                         │   │
│   │   • In-memory processing (100x faster than MapReduce)   │   │
│   │   • DAG execution (optimizes query plans)               │   │
│   │   • Fault tolerance via lineage                         │   │
│   │   • Runs on: YARN, Kubernetes, Standalone, Mesos        │   │
│   └─────────────────────────────────────────────────────────┘   │
│        ┌────────────────────┼────────────────────┐              │
│        ▼                    ▼                    ▼              │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐          │
│   │Spark SQL │        │  MLlib   │        │Structured│          │
│   │          │        │          │        │Streaming │          │
│   │DataFrames│        │  ML at   │        │Real-time │          │
│   │& SQL     │        │  Scale   │        │Processing│          │
│   └──────────┘        └──────────┘        └──────────┘          │
│   │                   │                   │                     │
│   │ Analytics         │ Mining            │ Streaming           │
│   │ Layer             │ Layer             │ Layer               │
│                                                                 │
│   READS FROM:               WRITES TO:                          │
│   • HDFS                    • HDFS                              │
│   • S3/GCS/Azure            • S3/GCS/Azure                      │
│   • JDBC databases          • JDBC databases                    │
│   • Kafka                   • Kafka                             │
│   • Delta Lake/Iceberg      • Delta Lake/Iceberg                │
│   • Cassandra, HBase        • Data Warehouses                   │
│                                                                 │
│   ════════════════════════════════════════════════════════════  │
│   SPARK IS COMPUTE, NOT STORAGE!                                │
│   ════════════════════════════════════════════════════════════  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

```text
┌─────────────────────────────────────────────────────────────┐
│                      SPARK APPLICATION                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐                                           │
│  │    Driver    │ ◄── SparkContext, DAG Scheduler           │
│  │   Program    │                                           │
│  └──────┬───────┘                                           │
│         │                                                   │
│         ▼                                                   │
│  ┌──────────────┐                                           │
│  │   Cluster    │ ◄── YARN / Mesos / Kubernetes / Standalone│
│  │   Manager    │                                           │
│  └──────┬───────┘                                           │
│         │                                                   │
│    ┌────┴────┬────────┬────────┐                            │
│    ▼         ▼        ▼        ▼                            │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                         │
│ │Worker│ │Worker│ │Worker│ │Worker│                         │
│ │Node 1│ │Node 2│ │Node 3│ │Node N│                         │
│ └──────┘ └──────┘ └──────┘ └──────┘                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

#### Spark APIs Comparison

| API | Abstraction Level | Type Safety | Use Case |
|-----|-------------------|-------------|----------|
| **RDD** | Low-level | Compile-time | Fine-grained control |
| **DataFrame** | High-level | Runtime | SQL-like operations |
| **Dataset** | High-level | Compile-time | Type-safe structured data |
| **Spark SQL** | Highest | Runtime | SQL queries on structured data |

#### Key Advantages

| Feature | Description | Benefit |
|---------|-------------|---------|
| **In-memory Processing** | Keeps data in **RAM** between operations | Up to 100x faster than MapReduce |
| **DAG Execution** | Directed Acyclic Graph optimizer | Efficient query planning |
| **Rich APIs** | RDD, DataFrame, Dataset, SQL | Flexible programming models |
| **Unified Engine** | Batch + Streaming in one framework | Simplified architecture |
| **Built-in Fault Tolerance** | Lineage-based recovery | Automatic failure handling |



**Example - Word Count in Spark (Python):**

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Read text file
text_df = spark.read.text("input.txt")

# Word count using DataFrame API
from pyspark.sql.functions import explode, split, col

word_counts = text_df \
    .select(explode(split(col("value"), " ")).alias("word")) \
    .groupBy("word") \
    .count() \
    .orderBy(col("count").desc())

word_counts.show()
```



### Apache Flink - True Streaming

```text
    ┌─────────────────────────────────────────────────────────────────┐
    │                      APACHE FLINK                               │
    │              "Stateful Stream Processing"                       │
    ├─────────────────────────────────────────────────────────────────┤
    │                                                                 │
    │  Flink processes data as STREAMS first (batch = bounded stream) │
    │                                                                 │
    │   ┌─────────────────────────────────────────────────────────┐   │
    │   │                    DATA STREAM                          │   │
    │   │                                                         │   │
    │   │  ──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶──▶       │   │
    │   │  event event event event event event event event        │   │
    │   │                                                         │   │
    │   └───────────────────────┬─────────────────────────────────┘   │
    │                           ▼                                     │
    │   ┌─────────────────────────────────────────────────────────┐   │
    │   │                  FLINK CLUSTER                          │   │
    │   │                                                         │   │
    │   │  ┌──────────┐    ┌──────────┐    ┌──────────┐           │   │
    │   │  │   Job    │    │   Task   │    │   Task   │           │   │
    │   │  │ Manager  │───▶│ Manager  │    │ Manager  │           │   │
    │   │  │(Master)  │    │(Worker)  │    │(Worker)  │           │   │
    │   │  └──────────┘    └──────────┘    └──────────┘           │   │
    │   │                                                         │   │
    │   │  Features:                                              │   │
    │   │  • Exactly-once semantics                               │   │
    │   │  • Event time processing                                │   │
    │   │  • Stateful computations                                │   │
    │   │  • Low latency (milliseconds)                           │   │
    │   │  • High throughput                                      │   │
    │   │                                                         │   │
    │   └─────────────────────────────────────────────────────────┘   │
    └─────────────────────────────────────────────────────────────────┘
```



### dbt (Data Build Tool)

```text
┌─────────────────────────────────────────────────────────────────┐
│                           dbt                                   │
│              "Transform Data in Warehouse"                      │
├─────────────────────────────────────────────────────────────────┤
│  dbt = SQL-based transformation framework                       │
│  Runs INSIDE your data warehouse (ELT, not ETL)                 │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                  dbt Project                            │   │
│   │                                                         │   │
│   │   models/                                               │   │
│   │   ├── staging/           ◄── Clean raw data             │   │
│   │   │   ├── stg_orders.sql                                │   │
│   │   │   └── stg_customers.sql                             │   │
│   │   ├── intermediate/      ◄── Business logic             │   │
│   │   │   └── int_orders_enriched.sql                       │   │
│   │   └── marts/             ◄── Final tables               │   │
│   │       ├── dim_customers.sql                             │   │
│   │       └── fct_orders.sql                                │   │
│   │                                                         │   │
│   │   tests/                 ◄── Data quality               │   │
│   │   └── assert_positive_amounts.sql                       │   │
│   │                                                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│   Example dbt model (stg_orders.sql):                           │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  SELECT                                                 │   │
│   │      order_id,                                          │   │
│   │      customer_id,                                       │   │
│   │      CAST(order_date AS DATE) AS order_date,            │   │
│   │      ROUND(amount, 2) AS amount,                        │   │
│   │      LOWER(status) AS status                            │   │
│   │  FROM {{ source('raw', 'orders') }}                     │   │
│   │  WHERE order_id IS NOT NULL                             │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```



### Processing Technology Comparison

| Technology | Type | Latency | Throughput | Best For |
|------------|------|---------|------------|----------|
| **MapReduce** | Batch | Very High | High | Legacy Hadoop jobs |
| **Apache Spark** | Batch + Micro-batch | Medium | Very High | General purpose, ML |
| **Apache Flink** | True Streaming | Very Low | High | Real-time, event-driven |
| **dbt** | Batch (SQL) | Medium | Medium | SQL transformations |
| **Kafka Streams** | Streaming | Low | High | Kafka-native apps |
| **AWS Glue** | Batch | Medium | High | Serverless ETL on AWS |
| **Dataflow** | Both | Low | High | GCP-native |
| **Presto/Trino** | Interactive | Low | Medium | Ad-hoc queries |



## Layer 4: Data Mining & Machine Learning

### Where Mining Fits

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                  │
│   Data Mining = Finding patterns you didn't know existed        │
│   Analytics   = Answering questions you already have            │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    DATA                                  │   │
│   └─────────────────────────┬───────────────────────────────┘   │
│                             │                                    │
│           ┌─────────────────┴─────────────────┐                 │
│           │                                   │                 │
│           ▼                                   ▼                 │
│   ┌───────────────┐                   ┌───────────────┐         │
│   │  DATA MINING  │                   │   ANALYTICS   │         │
│   │               │                   │               │         │
│   │ "What patterns│                   │ "How much did │         │
│   │  exist in     │                   │  we sell last │         │
│   │  customer     │                   │  month?"      │         │
│   │  behavior?"   │                   │               │         │
│   │               │                   │ (Known        │         │
│   │ (Unknown      │                   │  questions)   │         │
│   │  discoveries) │                   │               │         │
│   └───────────────┘                   └───────────────┘         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### Mining Techniques Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                    DATA MINING TECHNIQUES                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   SUPERVISED LEARNING            UNSUPERVISED LEARNING          │
│   (Have labels/answers)          (No labels)                    │
│                                                                  │
│   ┌─────────────────┐            ┌─────────────────┐            │
│   │ CLASSIFICATION  │            │   CLUSTERING    │            │
│   │                 │            │                 │            │
│   │ Predict category│            │ Find groups     │            │
│   │                 │            │                 │            │
│   │ • Spam/Not spam │            │ • Customer      │            │
│   │ • Fraud/Legit   │            │   segments      │            │
│   │ • Churn/Stay    │            │ • Similar docs  │            │
│   │ • Disease/No    │            │ • Anomalies     │            │
│   │                 │            │                 │            │
│   │ Algorithms:     │            │ Algorithms:     │            │
│   │ • Random Forest │            │ • K-Means       │            │
│   │ • XGBoost       │            │ • DBSCAN        │            │
│   │ • Neural Nets   │            │ • Hierarchical  │            │
│   └─────────────────┘            └─────────────────┘            │
│                                                                  │
│   ┌─────────────────┐            ┌─────────────────┐            │
│   │   REGRESSION    │            │  ASSOCIATION    │            │
│   │                 │            │                 │            │
│   │ Predict number  │            │ Find rules      │            │
│   │                 │            │                 │            │
│   │ • House price   │            │ • Market basket │            │
│   │ • Sales forecast│            │ • "Customers    │            │
│   │ • CLV           │            │   who bought X  │            │
│   │ • Stock price   │            │   also buy Y"   │            │
│   │                 │            │                 │            │
│   │ Algorithms:     │            │ Algorithms:     │            │
│   │ • Linear Reg    │            │ • Apriori       │            │
│   │ • Gradient Boost│            │ • FP-Growth     │            │
│   │ • Neural Nets   │            │                 │            │
│   └─────────────────┘            └─────────────────┘            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### ML/Mining Tools at Scale

| Tool | Scale | Best For | Integration |
|------|-------|----------|-------------|
| **Scikit-learn** | Single machine (GB) | Prototyping, small data | Python |
| **Spark MLlib** | Distributed (PB) | Large-scale ML | Spark ecosystem |
| **TensorFlow** | Single/Distributed | Deep learning | TFX, Vertex AI |
| **PyTorch** | Single/Distributed | Deep learning, research | TorchServe |
| **XGBoost** | Single/Distributed | Tabular data, competitions | Spark, Dask |
| **Amazon SageMaker** | Managed | End-to-end ML | AWS |
| **Vertex AI** | Managed | End-to-end ML | GCP |
| **Databricks ML** | Distributed | Unified ML platform | Spark, MLflow |

### Example: Customer Segmentation with Spark MLlib

```python
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

# Load data from warehouse (STORAGE)
customers = spark.read.table("analytics.customer_features")

# Prepare features
assembler = VectorAssembler(
    inputCols=["recency", "frequency", "monetary"],
    outputCol="features_raw"
)

scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features"
)

# K-Means clustering (MINING)
kmeans = KMeans(k=4, seed=42, featuresCol="features")

# Build pipeline
pipeline = Pipeline(stages=[assembler, scaler, kmeans])

# Fit model
model = pipeline.fit(customers)

# Get predictions
segmented = model.transform(customers)

# Analyze segments
segmented.groupBy("prediction").agg(
    F.avg("recency").alias("avg_recency"),
    F.avg("frequency").alias("avg_frequency"),
    F.avg("monetary").alias("avg_monetary"),
    F.count("*").alias("count")
).show()

# Save results back to warehouse (STORAGE)
segmented.write.saveAsTable("analytics.customer_segments")
```



## Layer 5: Data Analytics

### Analytics Maturity Levels

```
┌─────────────────────────────────────────────────────────────────┐
│                    ANALYTICS SPECTRUM                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   VALUE & COMPLEXITY                                            │
│        ▲                                                         │
│        │                                    ┌───────────────┐   │
│        │                             ┌─────▶│ PRESCRIPTIVE  │   │
│        │                             │      │ "What should  │   │
│        │                    ┌────────┴──┐   │  we do?"      │   │
│        │             ┌─────▶│ PREDICTIVE │   │               │   │
│        │             │      │ "What will │   │ • Optimization│   │
│        │    ┌────────┴──┐   │  happen?"  │   │ • Simulation  │   │
│        │ ┌─▶│DIAGNOSTIC │   │            │   │ • Recommend   │   │
│        │ │  │ "Why did  │   │ • ML Models│   └───────────────┘   │
│   ┌────┴─┴┐ │ it happen?"│   │ • Forecast │                      │
│   │DESCRIP│ │            │   └────────────┘                      │
│   │-TIVE  │ │ • Drill-down│                                       │
│   │"What  │ │ • Root cause│                                       │
│   │happened│ └────────────┘                                       │
│   │?"     │                                                       │
│   │       │                                                       │
│   │• Reports                                                      │
│   │• Dashboards                                                  │
│   │• KPIs                                                        │
│   └───────┘                                                       │
│   └───────────────────────────────────────────────────────▶     │
│                            TIME                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### Analytics Tools

| Type | Tools | Use Case |
|------|-------|----------|
| **Descriptive** | SQL, Excel, Tableau | Reports, dashboards |
| **Diagnostic** | SQL, Python | Root cause analysis |
| **Predictive** | Python, Spark MLlib | Forecasting, ML |
| **Prescriptive** | OR tools, simulation | Optimization |



## Layer 6: Data Visualization & Consumption

### How Data Reaches End Users

```
┌─────────────────────────────────────────────────────────────────┐
│                   DATA CONSUMPTION METHODS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    DATA WAREHOUSE                        │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│       ┌───────────────────────┼───────────────────────┐         │
│       │                       │                       │         │
│       ▼                       ▼                       ▼         │
│   ┌───────────┐         ┌───────────┐         ┌───────────┐    │
│   │DASHBOARDS │         │  REPORTS  │         │   APIs    │    │
│   │           │         │           │         │           │    │
│   │ Interactive│        │ Scheduled │         │Programmatic│   │
│   │ exploration│        │ delivery  │         │ access    │    │
│   │           │         │           │         │           │    │
│   │ Tableau   │         │ PDF/Email │         │ REST/     │    │
│   │ Power BI  │         │ Exports   │         │ GraphQL   │    │
│   │ Looker    │         │           │         │           │    │
│   └─────┬─────┘         └─────┬─────┘         └─────┬─────┘    │
│         │                     │                     │           │
│         ▼                     ▼                     ▼           │
│   ┌───────────┐         ┌───────────┐         ┌───────────┐    │
│   │ Analysts  │         │ Executives│         │Applications│   │
│   │ Explore   │         │ Consume   │         │ Integrate  │    │
│   └───────────┘         └───────────┘         └───────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### Visualization Tools

| Tool | Type | Best For | Users |
|------|------|----------|-------|
| **Tableau** | Enterprise BI | Beautiful visuals | Analysts |
| **Power BI** | Enterprise BI | Microsoft ecosystem | Business users |
| **Looker** | Semantic layer | Governed metrics | Data teams |
| **Metabase** | Open-source | Self-service | Everyone |
| **Grafana** | Monitoring | Time-series, real-time | DevOps |
| **Apache Superset** | Open-source | SQL-based exploration | Analysts |



## Layer 7: Orchestration & Governance

### Orchestration

```
┌─────────────────────────────────────────────────────────────────┐
│                    APACHE AIRFLOW DAG                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Daily Sales Pipeline:                                         │
│                                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                 │
│   │ Extract  │───▶│Transform │───▶│  Load    │                 │
│   │ from DB  │    │ (Spark)  │    │   (DW)   │                 │
│   └──────────┘    └──────────┘    └────┬─────┘                 │
│                                        │                        │
│                           ┌────────────┼────────────┐          │
│                           │            │            │          │
│                           ▼            ▼            ▼          │
│                    ┌──────────┐ ┌──────────┐ ┌──────────┐      │
│                    │  Update  │ │  Train   │ │  Send    │      │
│                    │Dashboard │ │ ML Model │ │  Alert   │      │
│                    └──────────┘ └──────────┘ └──────────┘      │
│                                                                  │
│   Schedule: Daily at 2:00 AM                                    │
│   Retries: 3 with exponential backoff                          │
│   Alerts: On failure → PagerDuty                               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

| Tool | Best For |
|------|----------|
| **Apache Airflow** | Complex pipelines, Python DAGs |
| **Dagster** | Modern, asset-centric |
| **Prefect** | Python-native, cloud-first |
| **dbt Cloud** | SQL transformations |

### Data Governance

| Component | Purpose | Tools |
|-----------|---------|-------|
| **Data Catalog** | What data exists? | DataHub, Atlan, Glue Catalog |
| **Data Quality** | Is data accurate? | Great Expectations, dbt tests |
| **Data Lineage** | Where did data come from? | OpenLineage, Marquez |
| **Data Security** | Who can access what? | IAM, column-level security |

---



## Tech Summary

```text
┌─────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. Big Data is a PIPELINE: Ingest → Store → Process → Analyze │
│                                                                  │
│  2. Storage ≠ Compute                                           │
│     • HDFS, S3 = Storage                                        │
│     • Spark, Flink = Compute                                    │
│                                                                  │
│  3. Modern Trends:                                              │
│     • Lakehouse (Delta Lake, Iceberg)                          │
│     • ELT over ETL (transform in warehouse)                    │
│     • SQL-first (dbt + Snowflake)                              │
│     • Managed services (less ops)                              │
│                                                                  │
│  4. Choose based on:                                            │
│     • Scale (GB vs TB vs PB)                                    │
│     • Latency (batch vs real-time)                             │
│     • Team skills (SQL vs Python)                              │
│     • Cloud strategy (single vs multi)                         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```


### All Technologies Mapping


| Layer | Open Source | AWS | GCP | Azure |
|-------|-------------|-----|-----|-------|
| **Ingestion** | Kafka, Airbyte | Kinesis, DMS | Pub/Sub, Dataflow | EventHub |
| **Lake Storage** | HDFS, MinIO | S3 | GCS | ADLS |
| **Databases** | PostgreSQL, Cassandra | RDS, DynamoDB | Cloud SQL, Bigtable | Cosmos DB |
| **Warehouse** | ClickHouse | Redshift | BigQuery | Synapse |
| **Lakehouse** | Delta, Iceberg | Lake Formation | BigLake | Delta on Azure |
| **Processing** | Spark, Flink | EMR, Glue | Dataproc, Dataflow | HDInsight |
| **ML** | MLflow, Kubeflow | SageMaker | Vertex AI | Azure ML |
| **Visualization** | Superset, Metabase | QuickSight | Looker | Power BI |
| **Orchestration** | Airflow, Dagster | MWAA, Step Functions | Composer | Data Factory |
| **Governance** | DataHub, Atlas | Glue Catalog | Data Catalog | Purview |




### Architecture Patterns

#### Modern Data Stack (Most Companies)

```text
Fivetran → Snowflake → dbt → Tableau
```

#### Lakehouse (Data-Intensive)

```text
Kafka → S3 + Delta Lake → Spark → Databricks SQL → BI Tools
```

#### Real-Time Analytics

```text
Kafka → Flink → ClickHouse → Grafana
```





# Cloud Platforms & AWS Services

## Major Cloud Providers

| Provider | Strengths | Data Services |
|----------|-----------|---------------|
| **AWS** | Largest market share, most services | S3, Redshift, EMR, Glue, Athena |
| **Azure** | Enterprise integration, Microsoft ecosystem | Blob, Synapse, Databricks, Data Factory |
| **Google Cloud** | BigQuery, AI/ML leadership | GCS, BigQuery, Dataflow, Dataproc |

## AWS Data Services

<Img src="./pic/1_aws_overview.jpg" >

### Key AWS Services for Big Data

| Service | Category | Purpose |
|---------|----------|---------|
| **S3** | Storage | Object storage, data lake foundation |
| **Redshift** | Data Warehouse | Petabyte-scale columnar analytics |
| **EMR** | Processing | Managed Spark, Hadoop, Presto clusters |
| **Athena** | Query | Serverless SQL queries on S3 |
| **Glue** | ETL | Serverless data integration |
| **Kinesis** | Streaming | Real-time data ingestion |
| **DynamoDB** | Database | NoSQL for OLTP workloads |
| **RDS/Aurora** | Database | Managed relational databases |
| **QuickSight** | BI | Dashboards and visualization |
| **SageMaker** | ML | Machine learning platform |

### AWS Service Categories

**Compute:**
| Service | Type | Use Case |
|---------|------|----------|
| EC2 | Virtual Machines | Custom workloads, legacy apps |
| Lambda | Serverless | Event-driven, short tasks |
| ECS/EKS | Containers | Microservices, Kubernetes |

**Databases:**
| Service | Type | Use Case |
|---------|------|----------|
| RDS | Relational | Traditional OLTP |
| Aurora | Relational | High-performance MySQL/PostgreSQL |
| DynamoDB | NoSQL | Key-value, high scale |
| DocumentDB | NoSQL | MongoDB-compatible |
| ElastiCache | Cache | Redis/Memcached |

**Analytics:**
| Service | Type | Use Case |
|---------|------|----------|
| Redshift | Data Warehouse | BI, complex analytics |
| Athena | Serverless Query | Ad-hoc S3 queries |
| OpenSearch | Search | Log analytics, full-text search |
| QuickSight | BI Tool | Dashboards, visualization |

---



# Summary: Big Data Ecosystem

```text
┌─────────────────────────────────────────────────────────────────────┐
│                      BIG DATA ECOSYSTEM                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   SOURCES              INGESTION           STORAGE                  │
│   ────────             ─────────           ───────                  │
│   Databases            Kinesis             S3 (Data Lake)           │
│   Applications         Kafka               HDFS                     │
│   IoT/Sensors          Flume               Cloud Storage            │
│   Logs                 NiFi                                         │
│                                                                      │
│   PROCESSING           TRANSFORMATION      SERVING                  │
│   ──────────           ──────────────      ───────                  │
│   Spark (Batch)        dbt                 Redshift (OLAP)          │
│   Flink (Stream)       Glue                PostgreSQL (OLTP)        │
│   EMR                  Dataflow            Elasticsearch            │
│                                                                      │
│   ORCHESTRATION        VISUALIZATION       GOVERNANCE               │
│   ─────────────        ─────────────       ──────────               │
│   Airflow              Tableau             Glue Catalog             │
│   Step Functions       Power BI            Atlas                    │
│   Prefect              Looker              DataHub                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---



# Quick Reference Card

## The 4 Vs
- **Volume:** Terabytes to Petabytes
- **Velocity:** Batch to Real-time
- **Variety:** Structured, Semi-structured, Unstructured
- **Veracity:** Data quality and reliability

## Key Comparisons

| Batch | Streaming |
|-------|-----------|
| High latency | Low latency |
| Historical data | Real-time events |
| Scheduled | Continuous |    


--  


| OLTP | OLAP |
|------|------|
| Transactions | Analytics |
| Row-based | Column-based |
| Current data | Historical data |
     
--

| ETL | ELT |
|-----|-----|
| Transform first | Load first |
| External processing | In-warehouse processing |
| On-premise | Cloud-native |

--

| Star Schema | Snowflake Schema |
|-------------|------------------|
| Denormalized | Normalized |
| Faster queries | Less storage |
| Simpler | Complex |

