# Table of Contents 
- [Storage Hierarchy](#storage-hierarchy)
- [Storage Raw Ingredients & their role](#storage-raw-ingredients--their-role)

# Storage Hierarchy

Storage in data engineering can be thought of as a **hierarchy with three layers**:

---

## 1. Raw Ingredients  
The foundation of all storage systems.  
- **Physical Components**:  
  - Magnetic disks (HDD)  
  - Solid-state drives (SSD)  
  - Memory (RAM)  

- **Processes**:  
  - Networking  
  - Serialization  
  - Compression  
  - CPU utilization  
  - Caching  

---

## 2. Storage Systems  
Built from raw ingredients. Provide management and interaction with data.  
- Databases (OLTP & OLAP)  
- Distributed storage systems  
- Specialized systems:  
  - Graph databases  
  - Vector databases  
- Caching systems  

---

## 3. Storage Abstractions  
High-level assemblies of storage systems.  
- Data warehouses  
- Data lakes  
- Data lakehouses  

---

## Visual Representation
![Storage Hierarchy](./images/storage_hierarchy.png)

# Storage Raw Ingredients & Their Role

Storage systems are built on raw ingredients that balance **cost, speed, and durability**. Here’s what each does:

---

## 🗄 Persistent Storage
- **Magnetic Disk (HDD)** → Good for bulk storage where cost matters more than speed.  
- **Solid-State Drive (SSD)** → Faster access, great for transactional (OLTP) workloads.

![Performance Comparison](./images/performance_1.png)

---

## ⚡ Improving Performance
- **Distributed Storage** → Splits data across machines to improve read/write throughput.  
- **Partitioning** → Divides SSDs into chunks so multiple controllers can work in parallel.

![Improving Performance](./images/performace_2.png)

---

## 🔋 Volatile Memory
- **RAM** → Temporary workspace for fast processing, caching, and indexing.  
- **CPU Cache** → Ultra-fast memory right on the processor, keeps frequently used data close.

![Volatile Memory Ingredients](./images/performace_3.png)

# Networking, Serialization & Compression

When working with storage systems, data doesn’t stay in one place — it moves between **memory, disk, and networks**. To make this efficient, several processes come into play.  

---

## 🌐 Networking & CPU
Networking connects multiple servers in distributed systems.  
It helps with:  
- Faster reads/writes  
- Data durability  
- High availability  

![Networking](./images/networking.png)

---

## 📦 Serialization
To store data from **memory to disk** or send it across a network, it must be transformed.  
- **Serialization** → Convert in-memory structures into disk-friendly format  
- **Deserialization** → Rebuild original memory format  

![Serialization](./images/serialize.png)

---

## 📊 Serialization Approaches
- **Row-based** → Store data record by record (good for transactions).  
- **Column-based** → Store data column by column (good for analytics).  

![Serialization Example](./images/serialization.png)

---

## 📝 Serialization Formats
You can choose different file formats for serialized data:  

- **Human-readable**:  
  - CSV → Simple, row-based, but error-prone (coz no schema)  
  - XML → Legacy, slow to process  
  - JSON → Widely used for APIs  

- **Binary**:  
  - Parquet → Column-based, efficient for big data  
  - Avro → Row-based, schema-driven, supports evolution  

![Serialization Formats](./images/serialization_formats.png)

---

## 🗜 Compression
Once serialized, data can be **compressed** before saving to disk.  
- Reduces disk space  
- Speeds up queries  
- Cuts down I/O time  

![Compression Flow](./images/compressin_1.png)

Compression works by removing **redundancy and repetition** in data.  
For example, frequently used characters get shorter encodings.  

![Compression Example](./images/compression.png)

---

## 🔑 Flow Summary
1. **Networking/CPU** → Move and coordinate data across servers.  
2. **Serialization** → Convert memory data into storable/transmittable format.  
3. **Serialization Formats** → Choose (CSV/JSON/XML for readability, Parquet/Avro for efficiency).  
4. **Compression** → Save space and improve performance before writing to disk.  