# Table of Contents
- [Data Warehouse - Overview & Architecture](#data-warehouse--overview--architecture)
- [Database vs Data Lkae vs Data Warehouse](#database-vs-data-lake-vs-data-warehouse--comparison-table)


# Data Warehouse — Overview & Architecture

## 📌 Key Characteristics of a Data Warehouse

![Data Warehouse](./images/data_warehouse.png)

- **Subject-Oriented** 🧠  
  Organizes and stores data around key business domains such as Customers, Products, Sales, Finance.  
  → Data is modeled to support **decision-making**, not transactions.

- **Integrated** 🔗  
  Combines data from multiple, heterogeneous sources into a **consistent schema**.

- **Non-Volatile** 📄  
  Data is **read-only** — it cannot be updated or deleted.  
  → Preserves snapshots for historical analysis.

- **Time-Variant** 🕰  
  Stores both **current and historical** data, unlike OLTP systems.  
  → Enables trend analysis over time.

---

## 🧱 Data Warehouse–Centric Architecture (ETL Pattern)

![Data Warehouse ETL Architecture](./images/data_warehouse-centric_architecture.png)

1. **Extract**  
   Pull data from various operational sources — databases, APIs, files, etc.

2. **Transform**  
   Clean, standardize, and model the data in a **staging area**.

3. **Load**  
   Push transformed data into the **Data Warehouse** with a comprehensive schema.

4. **Data Marts** 🏪  
   - Department-specific subsets (e.g. Sales, Marketing, Finance).  
   - Often follow **simplified or denormalized schemas**.  
   - Improve query performance for specific use cases.

5. **Analytics & Reports** 📊  
   BI tools and analysts use the Data Marts & Warehouse for dashboards and decision-making.

---

## 🔄 Change Data Capture (CDC)

![Change Data Capture](./images/data_warehouse-centric_archi_cdc.png)

- Instead of extracting the **entire dataset** every time,  
  **CDC tracks only changes** (inserts, updates, deletes) in the source systems.

- Reduces load on production OLTP databases ✅  
- Keeps the Data Warehouse **incrementally in sync** with source systems.  
- Commonly implemented using ETL pipelines.

---

## ⚡ Evolution of Data Warehouse Implementations

![Data Warehouse Implementation](./images/data_warehouse_implementations.png)

| Era | Architecture | Key Features |
|-----|-------------|--------------|
| **Early DW** 🧱 | Monolithic servers | Limited performance, single big machine |
| **MPP DW** ⚙️ | Massively Parallel Processing | Distributes queries across multiple nodes, scans data in parallel |
| **Modern Cloud DW** ☁️ | Snowflake, BigQuery, Redshift | Separates **compute & storage**, scales elastically, cost-efficient |

### ✨ Modern Cloud Data Warehouses
- **Amazon Redshift**  
- **Google BigQuery**  
- **Snowflake**

✅ **Key Advantage**:  
- Compute and storage are **independent**, allowing cost-effective scaling.  
- Ideal for analytical workloads on very large datasets.

---

## 📝 Recap

- OLTP systems are designed for **transactions**, not heavy analytics.  
- Data Warehouses were introduced to:
  - Consolidate data from multiple sources
  - Provide **historical context**
  - Enable **analytical queries** efficiently
- **ETL pipelines** → clean & load data into a **centralized warehouse**  
- **Data Marts** → business function–specific views for easier analysis  
- **CDC** → keeps data up to date without full reloads  
- **Modern DW** leverage **cloud + MPP** for scale & performance.

---

# Database vs Data Lake vs Data Warehouse — Comparison Table

| Feature 🧠                         | Database (OLTP) 🧾                          | Data Lake 🌊                                      | Data Warehouse 🏢                                                   |
|-------------------------------------|---------------------------------------------|--------------------------------------------------|---------------------------------------------------------------------|
| **Primary Purpose**                | Handle **day-to-day transactions**         | Store **large volumes of raw / semi-structured data** cheaply | **Analytics, reporting & decision-making**                         |
| **Schema**                          | Strict, normalized                         | Flexible or **schema-on-read**                   | Structured, **modeled** (e.g. Star / Snowflake schema)              |
| **Data Type**                       | Mostly structured                          | Structured, semi-structured, unstructured        | Structured, cleaned & integrated                                   |
| **Data Processing**                | OLTP (row-based inserts/updates)          | Batch or streaming (big data frameworks)        | OLAP (columnar scans, aggregations, joins)                         |
| **Storage Layer**                  | Disk + RAM (local)                         | Disk / Object storage (e.g., S3, ADLS, HDFS)     | Disk-based (often columnar format on cloud object storage)         |
| **RAM Usage**                      | High — for fast concurrent writes & lookups | Low for storage; RAM needed only during Spark/compute jobs | Moderate; used mainly during **query execution**, not for storage |
| **Query Patterns**                 | Fast reads/writes on individual records   | Heavy ETL/ELT, ML workloads, large scans        | Analytical queries, aggregations, dashboards                       |
| **Performance Optimization**       | Indexes, B-trees, in-memory caching       | Parallel processing, schema-on-read            | Columnar storage, MPP (Massively Parallel Processing), caching     |
| **Cost**                            | Expensive to scale for big analytics       | Cheap storage, variable compute cost            | Cost-effective analytics, separates **compute & storage** in modern cloud DWs |
| **Data Freshness**                 | Real-time (transactions)                  | Near real-time or batch                        | Batch or near real-time (depending on ETL/CDC setup)               |
| **Examples**                        | MySQL, PostgreSQL, Oracle, SQL Server     | S3, ADLS, HDFS, Delta Lake                      | Snowflake, BigQuery, Redshift, Synapse                             |
| **Best For**                        | Transactional apps, operational systems   | Data exploration, staging, ML feature stores   | Business intelligence, dashboards, historical analysis             |

---

## ✨ Key Insights

- 🧾 **Databases** are great for **fast, reliable transactions** but are **not optimized for analytical queries** over large datasets.  
- 🌊 **Data Lakes** are great for **raw, large-scale data storage** at low cost, but **query performance depends on the compute engine** (e.g., Spark).  
- 🏢 **Data Warehouses** sit in between: they combine structured modeling, columnar storage, and parallel processing to support **high-performance analytics**.

---

## ✅ Final One-liner

> “A **database** is built for transactions, a **data lake** is built for storage, and a **data warehouse** is built for analytics.”