# Table of Contents 
- [Storage Hierarchy](#storage-hierarchy)
- [Storage Raw Ingredients & their role](#storage-raw-ingredients--their-role)
- [Networking Serialization & Compression](#networking-serialization--compression)
    - [Compression Algorithms (Coursera Supplement)](https://www.coursera.org/learn/data-storage-and-queries/supplement/ebNtg/optional-compression-algorithms)
- [Cloud Storage Options](#cloud-storage-options)
- [Cloud Storage Tiers](#storage-tiers-in-the-cloud)
- [Distributed Storage Systems](#distributed-storage-systems)
    - [Database Partitioning/sharding methods](https://www.coursera.org/learn/data-storage-and-queries/supplement/wxN0S/optional-database-partitioning-sharding-methods)

# Storage Hierarchy

Storage in data engineering can be thought of as a **hierarchy with three layers**:

---

## 1. Raw Ingredients  
The foundation of all storage systems.  
- **Physical Components**:  
  - Magnetic disks (HDD)  
  - Solid-state drives (SSD)  
  - Memory (RAM)  

- **Processes**:  
  - Networking  
  - Serialization  
  - Compression  
  - CPU utilization  
  - Caching  

---

## 2. Storage Systems  
Built from raw ingredients. Provide management and interaction with data.  
- Databases (OLTP & OLAP)  
- Distributed storage systems  
- Specialized systems:  
  - Graph databases  
  - Vector databases  
- Caching systems  

---

## 3. Storage Abstractions  
High-level assemblies of storage systems.  
- Data warehouses  
- Data lakes  
- Data lakehouses  

---

## Visual Representation
![Storage Hierarchy](./images/storage_hierarchy.png)

# Storage Raw Ingredients & Their Role

Storage systems are built on raw ingredients that balance **cost, speed, and durability**. Here’s what each does:

---

## 🗄 Persistent Storage
- **Magnetic Disk (HDD)** → Good for bulk storage where cost matters more than speed.  
- **Solid-State Drive (SSD)** → Faster access, great for transactional (OLTP) workloads.

![Performance Comparison](./images/performance_1.png)

---

## ⚡ Improving Performance
- **Distributed Storage** → Splits data across machines to improve read/write throughput.  
- **Partitioning** → Divides SSDs into chunks so multiple controllers can work in parallel.

![Improving Performance](./images/performace_2.png)

---

## 🔋 Volatile Memory
- **RAM** → Temporary workspace for fast processing, caching, and indexing.  
- **CPU Cache** → Ultra-fast memory right on the processor, keeps frequently used data close.

![Volatile Memory Ingredients](./images/performace_3.png)

# Networking, Serialization & Compression

When working with storage systems, data doesn’t stay in one place — it moves between **memory, disk, and networks**. To make this efficient, several processes come into play.  

---

## 🌐 Networking & CPU
Networking connects multiple servers in distributed systems.  
It helps with:  
- Faster reads/writes  
- Data durability  
- High availability  

![Networking](./images/networking.png)

---

## 📦 Serialization
To store data from **memory to disk** or send it across a network, it must be transformed.  
- **Serialization** → Convert in-memory structures into disk-friendly format  
- **Deserialization** → Rebuild original memory format  

![Serialization](./images/serialize.png)

---

## 📊 Serialization Approaches
- **Row-based** → Store data record by record (good for transactions).  
- **Column-based** → Store data column by column (good for analytics).  

![Serialization Example](./images/serialization.png)

---

## 📝 Serialization Formats
You can choose different file formats for serialized data:  

- **Human-readable**:  
  - CSV → Simple, row-based, but error-prone (coz no schema)  
  - XML → Legacy, slow to process  
  - JSON → Widely used for APIs  

- **Binary**:  
  - Parquet → Column-based, efficient for big data  
  - Avro → Row-based, schema-driven, supports evolution  

![Serialization Formats](./images/serialization_formats.png)

---

## 🗜 Compression
Once serialized, data can be **compressed** before saving to disk.  
- Reduces disk space  
- Speeds up queries  
- Cuts down I/O time  

![Compression Flow](./images/compressin_1.png)

Compression works by removing **redundancy and repetition** in data.  
For example, frequently used characters get shorter encodings.  

![Compression Example](./images/compression.png)

---

## 🔑 Flow Summary
1. **Networking/CPU** → Move and coordinate data across servers.  
2. **Serialization** → Convert memory data into storable/transmittable format.  
3. **Serialization Formats** → Choose (CSV/JSON/XML for readability, Parquet/Avro for efficiency).  
4. **Compression** → Save space and improve performance before writing to disk.  

# Cloud Storage Options

As a data engineer, you have three main cloud storage options: **File Storage**, **Block Storage**, and **Object Storage**.  
Each has different strengths and trade-offs depending on your workload.

---

## 📂 File Storage
Organizes files in a **hierarchical directory tree** (like folders on your laptop).  
Each directory stores metadata such as name, owner, permissions, and location pointers.

- ✅ Easy to use and manage  
- ✅ Great for sharing files across users and machines  
- ❌ Lower read/write performance (must track hierarchy)  
- ❌ Limited scalability compared to other options  

![File Storage](./images/file_storage.png)  
![File Storage Use Case](./images/file_storage_usecase.png)

---

## 📦 Block Storage
Divides data into **small fixed-size blocks**, each with a unique ID.  
You can directly update or retrieve specific blocks without rewriting the entire file.

- ✅ High performance & low latency → ideal for **transactional workloads (OLTP)**  
- ✅ Supports frequent small reads/writes  
- ✅ Used in **databases, VMs (e.g., Amazon EBS)**  
- ❌ Limited scalability (usually a few TBs)  
- ❌ Tied closely to compute instances (not independent like object storage)  

![Block Storage](./images/block_storage.png)  
![Block Storage Lookup](./images/block_storage_2.png)  
![Block Storage Use Case](./images/block_storage_usecase.png)

---

## 🗃️ Object Storage
Stores data as **immutable objects** in a flat structure with unique keys.  
Objects are replicated across nodes and scale almost infinitely.

- ✅ Extremely scalable (petabytes and beyond)  
- ✅ Great for **analytics (OLAP), data lakes, ML pipelines**  
- ✅ High durability via replication  
- ❌ Not good for transactional workloads (must rewrite entire object to update)  
- ❌ Higher latency for frequent small updates  

![Object Storage](./images/object_storage.png)  
![Object Storage Use Cases](./images/object_storage_use_Cases.png)

---

## ⚖️ Trade-offs at a Glance

![Storage Options](./images/storage_option.png)

- **File Storage** → simple sharing & management, but not for performance-heavy tasks  
- **Block Storage** → best for **transactional systems** with frequent updates  
- **Object Storage** → best for **analytics & big data** with massive scalability  

👉 Rule of Thumb:  
- Use **File Storage** for team collaboration and shared access  
- Use **Block Storage** for **databases/VMs**  
- Use **Object Storage** for **data lakes, warehouses, ML workloads**  

# Storage Tiers in the Cloud

Cloud storage tiers are based on **how often data is accessed** and the trade-off between **cost vs speed**.

- Hot storage → fast but expensive.  
- Cold storage → cheap but slow.  
- Warm storage → balance between the two.  
- Best practice: **combine tiers** depending on workload needs.

![Storage Tiers](./images/storage_tiers.png)

---

## ☁️ AWS Storage Tier Examples

Amazon S3 provides multiple storage options mapped to these tiers.  
You choose based on access frequency and cost requirements.

![AWS Storage Tiers](./images/aws_storage_tiers.png)

# Distributed Storage Systems

Distributed storage splits and replicates data across **multiple servers (nodes)**.  
Nodes form a **cluster**, giving fault tolerance, scalability, and parallel performance.

---

## 🔹 How Distributed Storage Works

![Distributed Storage](./images/dds.png)

- Nodes store data on **magnetic disks** or **SSDs**.  
- Total storage = sum of all node capacities.  
- Each node can also handle replication and access control.

---

## 🔹 Advantages of Distributed Storage

![Distributed Storage Advantages](./images/dds_adv.png)

- Fault tolerance and durability.  
- High availability even if some nodes fail.  
- Parallel reads/writes → faster performance.  
- Horizontal scaling (add more nodes).

---

## 🔹 Methods of Distribution

![Types of DDS](./images/types_of_dds.png)

- **Replication** → multiple copies for availability.  
- **Partitioning (Sharding)** → dataset split across nodes.  
- Often combined → partition first, then replicate each shard.

---

## 🔹 Replication & Partition Trade-off

![Replication & Partition](./images/dds_default.png)

- Replication = redundancy but update delay.  
- Partitioning = efficient storage spread but needs coordination.  

---

# ⚖️ CAP Theorem in Distributed Storage

Any distributed system can only guarantee **two of three**:

- **Consistency (C)** → latest write is always reflected.  
- **Availability (A)** → every request gets a response.  
- **Partition Tolerance (P)** → system works despite network failures.  

---

## 🔹 Consistency

![Consistency](./images/cap_consistancy.png)  
Every read reflects the latest write.

---

## 🔹 Availability

![Availability](./images/availability.png)  
Every request receives *some* response.

---

## 🔹 Partition Tolerance

![Partition Tolerance](./images/partition_tolerance.png)  
System continues working even if some nodes are unreachable.

---

# 🏦 Example: Banking System (CAP Trade-off)

Two ATMs connected to the same account database:

- **CP System (Consistency + Partition Tolerance)**  
  - ATM #2 waits until ATM #1’s withdrawal is synced.  
  - ✅ Balance always correct.  
  - ❌ Sometimes no response (reduced availability).  
  - Example: **RDBMS**.

- **AP System (Availability + Partition Tolerance)**  
  - ATM #2 shows old balance immediately.  
  - ✅ Always responds fast.  
  - ❌ Risk of temporary inconsistency.  
  - Example: **Cassandra, DynamoDB**.

- **CA System (Consistency + Availability, no Partition Tolerance)**  
  - Only works if no network issues (rare in distributed systems).  
  - Example: **single-node DB**.

---

# 🛒 Example: Online Shopping (CAP Trade-off)

- **Consistency (CP)** → System rejects your order until stock count is synced everywhere.  
- **Availability (AP)** → Lets everyone order instantly, but may oversell stock.  

---

# 🔹 ACID vs BASE in Distributed Databases

![ACID](./images/acid.png)  
![BASE](./images/base.png)  
![ACID vs BASE](./images/acid_vs_base.png)

- **ACID (Relational DBs)** → Correctness first (Consistency + Partition Tolerance).  
- **BASE (NoSQL)** → Scale and speed first (Availability + Partition Tolerance).  