# Table of Contents 
- [Storage Hierarchy](#storage-hierarchy)
- [Storage Raw Ingredients & their role](#storage-raw-ingredients--their-role)
- [Networking Serialization & Compression](#networking-serialization--compression)
    - [Compression Algorithms (Coursera Supplement)](https://www.coursera.org/learn/data-storage-and-queries/supplement/ebNtg/optional-compression-algorithms)
- [Cloud Storage Options](#cloud-storage-options)
- [Cloud Storage Tiers](#storage-tiers-in-the-cloud)
- [Distributed Storage Systems](#distributed-storage-systems)
    - [Database Partitioning/sharding methods](https://www.coursera.org/learn/data-storage-and-queries/supplement/wxN0S/optional-database-partitioning-sharding-methods)
- [How Databases Store Data](#how-databases-store-data)
- [Row Oriented Storage vs Column Oriented Storage](#row-oriented-storage-vs-column-oriented-storage)
    - [The Parquet Format](https://www.coursera.org/learn/data-storage-and-queries/supplement/lGyaX/optional-the-parquet-format)
    - [Wide-Column Databases](https://www.coursera.org/learn/data-storage-and-queries/supplement/gRrOm/optional-wide-column-databases)
- [Graph Databases](#graph-databases)
- [Vector Databases](#vector-databases)
    - [ANN Algorithm: Hierarchical Navigable Small World (HNSW) — Coursera](https://www.coursera.org/learn/data-storage-and-queries/supplement/NPkIA/optional-ann-algorithm-hierarchical-navigable-small-world-hnsw)


# Storage Hierarchy

Storage in data engineering can be thought of as a **hierarchy with three layers**:

---

## 1. Raw Ingredients  
The foundation of all storage systems.  
- **Physical Components**:  
  - Magnetic disks (HDD)  
  - Solid-state drives (SSD)  
  - Memory (RAM)  

- **Processes**:  
  - Networking  
  - Serialization  
  - Compression  
  - CPU utilization  
  - Caching  

---

## 2. Storage Systems  
Built from raw ingredients. Provide management and interaction with data.  
- Databases (OLTP & OLAP)  
- Distributed storage systems  
- Specialized systems:  
  - Graph databases  
  - Vector databases  
- Caching systems  

---

## 3. Storage Abstractions  
High-level assemblies of storage systems.  
- Data warehouses  
- Data lakes  
- Data lakehouses  

---

## Visual Representation
![Storage Hierarchy](./images/storage_hierarchy.png)

# Storage Raw Ingredients & Their Role

Storage systems are built on raw ingredients that balance **cost, speed, and durability**. Here’s what each does:

---

## 🗄 Persistent Storage
- **Magnetic Disk (HDD)** → Good for bulk storage where cost matters more than speed.  
- **Solid-State Drive (SSD)** → Faster access, great for transactional (OLTP) workloads.

![Performance Comparison](./images/performance_1.png)

---

## ⚡ Improving Performance
- **Distributed Storage** → Splits data across machines to improve read/write throughput.  
- **Partitioning** → Divides SSDs into chunks so multiple controllers can work in parallel.

![Improving Performance](./images/performace_2.png)

---

## 🔋 Volatile Memory
- **RAM** → Temporary workspace for fast processing, caching, and indexing.  
- **CPU Cache** → Ultra-fast memory right on the processor, keeps frequently used data close.

![Volatile Memory Ingredients](./images/performace_3.png)

# Networking, Serialization & Compression

When working with storage systems, data doesn’t stay in one place — it moves between **memory, disk, and networks**. To make this efficient, several processes come into play.  

---

## 🌐 Networking & CPU
Networking connects multiple servers in distributed systems.  
It helps with:  
- Faster reads/writes  
- Data durability  
- High availability  

![Networking](./images/networking.png)

---

## 📦 Serialization
To store data from **memory to disk** or send it across a network, it must be transformed.  
- **Serialization** → Convert in-memory structures into disk-friendly format  
- **Deserialization** → Rebuild original memory format  

![Serialization](./images/serialize.png)

---

## 📊 Serialization Approaches
- **Row-based** → Store data record by record (good for transactions).  
- **Column-based** → Store data column by column (good for analytics).  

![Serialization Example](./images/serialization.png)

---

## 📝 Serialization Formats
You can choose different file formats for serialized data:  

- **Human-readable**:  
  - CSV → Simple, row-based, but error-prone (coz no schema)  
  - XML → Legacy, slow to process  
  - JSON → Widely used for APIs  

- **Binary**:  
  - Parquet → Column-based, efficient for big data  
  - Avro → Row-based, schema-driven, supports evolution  

![Serialization Formats](./images/serialization_formats.png)

---

## 🗜 Compression
Once serialized, data can be **compressed** before saving to disk.  
- Reduces disk space  
- Speeds up queries  
- Cuts down I/O time  

![Compression Flow](./images/compressin_1.png)

Compression works by removing **redundancy and repetition** in data.  
For example, frequently used characters get shorter encodings.  

![Compression Example](./images/compression.png)

---

## 🔑 Flow Summary
1. **Networking/CPU** → Move and coordinate data across servers.  
2. **Serialization** → Convert memory data into storable/transmittable format.  
3. **Serialization Formats** → Choose (CSV/JSON/XML for readability, Parquet/Avro for efficiency).  
4. **Compression** → Save space and improve performance before writing to disk.  

# Cloud Storage Options

As a data engineer, you have three main cloud storage options: **File Storage**, **Block Storage**, and **Object Storage**.  
Each has different strengths and trade-offs depending on your workload.

---

## 📂 File Storage
Organizes files in a **hierarchical directory tree** (like folders on your laptop).  
Each directory stores metadata such as name, owner, permissions, and location pointers.

- ✅ Easy to use and manage  
- ✅ Great for sharing files across users and machines  
- ❌ Lower read/write performance (must track hierarchy)  
- ❌ Limited scalability compared to other options  

![File Storage](./images/file_storage.png)  
![File Storage Use Case](./images/file_storage_usecase.png)

---

## 📦 Block Storage
Divides data into **small fixed-size blocks**, each with a unique ID.  
You can directly update or retrieve specific blocks without rewriting the entire file.

- ✅ High performance & low latency → ideal for **transactional workloads (OLTP)**  
- ✅ Supports frequent small reads/writes  
- ✅ Used in **databases, VMs (e.g., Amazon EBS)**  
- ❌ Limited scalability (usually a few TBs)  
- ❌ Tied closely to compute instances (not independent like object storage)  

![Block Storage](./images/block_storage.png)  
![Block Storage Lookup](./images/block_storage_2.png)  
![Block Storage Use Case](./images/block_storage_usecase.png)

---

## 🗃️ Object Storage
Stores data as **immutable objects** in a flat structure with unique keys.  
Objects are replicated across nodes and scale almost infinitely.

- ✅ Extremely scalable (petabytes and beyond)  
- ✅ Great for **analytics (OLAP), data lakes, ML pipelines**  
- ✅ High durability via replication  
- ❌ Not good for transactional workloads (must rewrite entire object to update)  
- ❌ Higher latency for frequent small updates  

![Object Storage](./images/object_storage.png)  
![Object Storage Use Cases](./images/object_storage_use_Cases.png)

---

## ⚖️ Trade-offs at a Glance

![Storage Options](./images/storage_option.png)

- **File Storage** → simple sharing & management, but not for performance-heavy tasks  
- **Block Storage** → best for **transactional systems** with frequent updates  
- **Object Storage** → best for **analytics & big data** with massive scalability  

👉 Rule of Thumb:  
- Use **File Storage** for team collaboration and shared access  
- Use **Block Storage** for **databases/VMs**  
- Use **Object Storage** for **data lakes, warehouses, ML workloads**  

# Storage Tiers in the Cloud

Cloud storage tiers are based on **how often data is accessed** and the trade-off between **cost vs speed**.

- Hot storage → fast but expensive.  
- Cold storage → cheap but slow.  
- Warm storage → balance between the two.  
- Best practice: **combine tiers** depending on workload needs.

![Storage Tiers](./images/storage_tiers.png)

---

## ☁️ AWS Storage Tier Examples

Amazon S3 provides multiple storage options mapped to these tiers.  
You choose based on access frequency and cost requirements.

![AWS Storage Tiers](./images/aws_storage_tiers.png)

# Distributed Storage Systems

Distributed storage splits and replicates data across **multiple servers (nodes)**.  
Nodes form a **cluster**, giving fault tolerance, scalability, and parallel performance.

---

## 🔹 How Distributed Storage Works

![Distributed Storage](./images/dds.png)

- Nodes store data on **magnetic disks** or **SSDs**.  
- Total storage = sum of all node capacities.  
- Each node can also handle replication and access control.

---

## 🔹 Advantages of Distributed Storage

![Distributed Storage Advantages](./images/dds_adv.png)

- Fault tolerance and durability.  
- High availability even if some nodes fail.  
- Parallel reads/writes → faster performance.  
- Horizontal scaling (add more nodes).

---

## 🔹 Methods of Distribution

![Types of DDS](./images/types_of_dds.png)

- **Replication** → multiple copies for availability.  
- **Partitioning (Sharding)** → dataset split across nodes.  
- Often combined → partition first, then replicate each shard.

---

## 🔹 Replication & Partition Trade-off

![Replication & Partition](./images/dds_default.png)

- Replication = redundancy but update delay.  
- Partitioning = efficient storage spread but needs coordination.  

---

# ⚖️ CAP Theorem in Distributed Storage

Any distributed system can only guarantee **two of three**:

- **Consistency (C)** → latest write is always reflected.  
- **Availability (A)** → every request gets a response.  
- **Partition Tolerance (P)** → system works despite network failures.  

---

## 🔹 Consistency

![Consistency](./images/cap_consistancy.png)  
Every read reflects the latest write.

---

## 🔹 Availability

![Availability](./images/availability.png)  
Every request receives *some* response.

---

## 🔹 Partition Tolerance

![Partition Tolerance](./images/partition_tolerance.png)  
System continues working even if some nodes are unreachable.

---

# 🏦 Example: Banking System (CAP Trade-off)

Two ATMs connected to the same account database:

- **CP System (Consistency + Partition Tolerance)**  
  - ATM #2 waits until ATM #1’s withdrawal is synced.  
  - ✅ Balance always correct.  
  - ❌ Sometimes no response (reduced availability).  
  - Example: **RDBMS**.

- **AP System (Availability + Partition Tolerance)**  
  - ATM #2 shows old balance immediately.  
  - ✅ Always responds fast.  
  - ❌ Risk of temporary inconsistency.  
  - Example: **Cassandra, DynamoDB**.

- **CA System (Consistency + Availability, no Partition Tolerance)**  
  - Only works if no network issues (rare in distributed systems).  
  - Example: **single-node DB**.

---

# 🛒 Example: Online Shopping (CAP Trade-off)

- **Consistency (CP)** → System rejects your order until stock count is synced everywhere.  
- **Availability (AP)** → Lets everyone order instantly, but may oversell stock.  

---

# 🔹 ACID vs BASE in Distributed Databases

![ACID](./images/acid.png)  
![BASE](./images/base.png)  
![ACID vs BASE](./images/acid_vs_base.png)

- **ACID (Relational DBs)** → Correctness first (Consistency + Partition Tolerance).  
- **BASE (NoSQL)** → Scale and speed first (Availability + Partition Tolerance).  

# How Databases Store Data

Databases rely on a **Database Management System (DBMS)**, which provides a structured way to store, retrieve, and manage data efficiently. Let’s break down how this works step by step.

---

## 🏗️ Database Management System (DBMS)

A DBMS consists of multiple components that work together to process queries and manage data.

![DBMS](./images/dbms.png)

- **Transport System** → Manages communication between the database and clients.  
- **Query Processor** → Interprets and optimizes query languages.  
- **Execution Engine** → Executes the optimized query plan.  
- **Storage Engine** → Handles how data is stored and retrieved from disk.  

---

## 💾 Storage Engine

The **Storage Engine** is responsible for the physical storage and retrieval of data.

![Storage Engine](./images/storage_engine.png)

Key responsibilities:
- **Serialization** → Converting data into storable formats.  
- **Arrangement on Disk** → Efficient placement of data for quick access.  
- **Indexing** → Creating auxiliary data structures for fast lookup.  

### ⚡ Modern Storage Engines
- Optimized for **SSDs** to leverage high-speed performance.  
- Support **complex data types** like variable-length strings, arrays, and nested data.  
- Provide **columnar storage** for analytical applications.  

---

## 📑 Indexing for Faster Retrieval

Indexes are special **data structures** that make it faster to locate specific data without scanning entire tables.

![Index Table](./images/index_table.png)

- **Without Index** → The DBMS scans all rows (`O(n)` complexity).  
- **With Index** → An index maps column values to row addresses, enabling **Binary Search** (`O(log n)` complexity).  

---

## ⚡ In-Memory Databases

Some databases use **RAM as the primary storage layer**:
- **Pros**: Ultra-fast retrieval, low latency.  
- **Cons**: Volatile → Data is lost when the machine restarts.  

### Examples:
- **Memcached** → Key-value store used for caching API/database results.  
- **Redis** → Key-value store supporting complex data types, with persistence options like snapshotting & journaling.  

Common use cases:
- Caching applications  
- Real-time bidding  
- Gaming leaderboards  

---

✅ **In summary:**  
Databases use **DBMS components** (query processor, execution engine, storage engine) to handle queries efficiently.  
The **storage engine** manages physical data organization, serialization, and indexing.  
**Indexes** speed up retrieval by reducing search complexity.  
For ultra-fast applications, **in-memory databases** like Redis and Memcached are widely used.

# Row-Oriented Storage vs Column-Oriented Storage

As a data engineer, it’s important to understand **how data is physically stored on disk** because this determines query performance and use cases. Let’s explore **row-oriented storage** and **column-oriented storage**.

---

## 🟦 Row-Oriented Storage

Row-oriented databases store **complete records row by row** on disk.  
This means all the values for a single row (Order ID, Price, SKU, Quantity, Customer ID, etc.) are stored together consecutively.

![Row Storage](./images/row_storage_.png)

### 🔹 How it works on disk
- Each row is serialized and stored sequentially.  
- Example: For a record, all fields (`Order ID = 4, Price = 50, SKU = 45682, Quantity = 13, Customer ID = 98q`) are stored together.  

![Row Storage Example](./images/row_storage.png)

### 🔹 Querying Example
Suppose you want to calculate the **sum of all prices**:  
- The system must **read all rows** from disk into memory.  
- Then, the CPU extracts the price column from each row and sums it up.  
- Even though you only need one column (Price), the database still loads **all columns of all rows**.  

📉 **Downside**: Very slow for analytical queries on large datasets (e.g., 1 billion rows may take 4 hours to process).  

✅ **Best for**: **Transactional workloads (OLTP)** → frequent inserts, updates, deletes, and quick row lookups.

---

## 🟩 Column-Oriented Storage

Column-oriented databases store **data column by column** instead of row by row.  
This means all values of a single column (e.g., Price) are stored together on disk.

![Columnar Storage](./images/columar_storage.png)

### 🔹 How it works on disk
- Each column is serialized separately.  
- Example: All prices (40, 23, 45, 50, …) are stored together in one block, all SKUs together in another block, etc.  

![Columnar Storage Example](./images/columar_storage_example.png)

### 🔹 Querying Example
Suppose you want to calculate the **sum of all prices**:  
- The system directly **reads only the Price column** from disk into memory.  
- No need to load irrelevant columns like Customer ID or SKU.  
- For 1 billion rows, instead of loading 3000 GB (all rows with 30 columns), only **100 GB** (just the Price column) needs to be read.  

📈 **Upside**: Much faster for analytical queries (e.g., only ~8 minutes vs 4 hours with row storage).  

❌ **Downside**: Slow for transactional operations because reconstructing an entire row requires reading from multiple columns.  

✅ **Best for**: **Analytical workloads (OLAP)** → aggregations, averages, sums, and large-scale reporting.

---

## ⚖️ Summary

| Feature                  | Row-Oriented Storage | Column-Oriented Storage |
|---------------------------|----------------------|--------------------------|
| Data Stored              | Row by row           | Column by column         |
| Best For                 | OLTP (transactions)  | OLAP (analytics)         |
| Query: SUM(price)        | Must read entire table | Reads only price column |
| Performance              | Fast for single record updates | Fast for aggregations |
| Example Use Case         | Banking systems, sales DB | Data warehouses, BI tools |

✅ **Conclusion**:  
- Use **row storage** for transactional systems where low-latency reads/writes of full records are needed.  
- Use **columnar storage** for analytical systems where aggregations over large datasets are common.  

# Graph Databases

A **graph database** stores data using a **graph data structure**, where:
- **Nodes** represent entities (e.g., users, products, credit cards, IP addresses).  
- **Edges** represent relationships or connections between these entities.  

This makes graph databases ideal for modeling **complex, interconnected data** where relationships are just as important as the data itself.

---

## 🧩 Graph Database Basics

![Graph Database](./images/graph_database.png)

- **Nodes** → Data items like people, products, or locations.  
- **Edges** → Connections between nodes (e.g., "friend", "purchased", "used").  
- **Why Graph?** → Relationships are treated as **first-class citizens**, making it much easier to traverse and query compared to relational joins.  

---

## 🔍 Querying Data

![Graph Example](./images/graph_example.png)

### Graph Database Approach
- Traverse the graph directly:  
  - Start with a user → follow edges labeled "friend" → get friends.  
  - From friends → follow edges labeled "purchased" → get purchased products.  
  - Recommend those products to the user.  

### Relational Database Approach
- Requires multiple **JOINs** between friendship and purchase tables.  
- Complex relationships (friends of friends, multi-hop queries) become inefficient.  

👉 Graph databases are **more efficient** for traversing multi-level relationships.

---

## ⚠️ Use Case: Fraud Detection

![Fraud Detection Graph](./images/fraud_detection_graph_database.png)

Graph databases can model suspicious behavior by linking:  
- **Users → Credit cards → IP addresses → Purchases**  
- Known fraudulent users can be compared against patterns.  
- Example: If a **credit card** is reused by a new user with a **new IP**, it can be flagged as **suspicious**.  

---

## 📚 Use Case: Knowledge Graphs

![Knowledge Graph](./images/knowledge_graph.png)

Knowledge graphs connect diverse data sources (products, customers, shipments, etc.) into a unified network.  
They are often used with **Retrieval Augmented Generation (RAG)** techniques to provide **contextual information** to LLMs.  

Example:  
- A chatbot can query the knowledge graph for product, customer, and shipping data.  
- RAG supplies relevant context to the model → more accurate and business-specific answers.  

---

## ⚙️ Popular Graph Databases

- **Neo4j** → Most popular graph DB, supports Cypher query language.  
- **ArangoDB** → Multi-model DB supporting graph queries.  
- **Amazon Neptune** → Managed graph database service on AWS.  
- Others include **TigerGraph**, **OrientDB**, etc.  

---

## ✅ Summary

- Graph databases use **nodes + edges** to represent **entities + relationships**.  
- They excel in workloads where relationships matter most:  
  - **Product recommendations**  
  - **Fraud detection**  
  - **Social networks**  
  - **Knowledge graphs**  

Compared to relational databases, graph databases avoid heavy joins and make traversing complex relationships **faster and more intuitive**.

# Vector Databases

With the rise of machine learning applications, **vector databases** have become increasingly important. They allow you to efficiently store, index, and query data based on **semantic similarities**, enabling use cases like recommendations, anomaly detection, and information retrieval.

---

## 📊 What is Vector Data?

![Vector Database](./images/vector_database.png)

- **Vector data** → Consists of numerical values arranged in arrays (e.g., rainfall data, unrolled image pixels).  
- **Vector embeddings** → Represent semantic meaning of items like text, documents, or images.  
  - Original data → passed through an **ML model** → transformed into **vector embeddings**.  
  - These embeddings are stored in a **vector database**.  

✅ Benefits:  
- Makes it faster to **find and retrieve similar items**.  
- Example: Instead of comparing raw text, embeddings capture semantic similarity → enabling **search by meaning**.

---

## 📏 Distance Metrics

Vector databases determine similarity using **distance metrics**.

![Distance Metric](./images/vectore_distance_metric.png)

Common distance measures:  
- **Euclidean Distance** → Straight-line distance between two vectors.  
- **Cosine Distance** → Angle between two vectors, captures directional similarity.  
- **Manhattan Distance** → Distance along axes, useful in some sparse data scenarios.  

---

## 🔎 Similarity Search

![Vector ANN Algorithm](./images/vector_ann_algo.png)

### 🟦 K-Nearest Neighbors (KNN)
- Calculates the distance of a query vector to **all embeddings** in the database.  
- Returns the **k most similar items**.  

⚠️ **Problems with KNN**:  
- **Inefficient** for large datasets (needs to compute distance against every vector).  
- Suffers from the **curse of dimensionality** → in very high-dimensional spaces, distances become less meaningful.  

---

### 🟩 Approximate Nearest Neighbors (ANN)
- Instead of calculating the exact distance for all vectors, ANN **guesses the nearest neighbors** using probabilistic or graph-based methods.  
- **Much faster** and scalable compared to KNN.  
- Widely used in production-grade vector databases (e.g., Pinecone, Weaviate, Milvus, FAISS).  

✅ ANN trades **a little accuracy** for **huge performance gains**, making it the preferred method for **real-world similarity search**.  

---

## ✅ Summary

- **Vector databases** store embeddings that represent the semantic meaning of data.  
- They use **distance metrics** to measure similarity.  
- **KNN** → Accurate but slow, not scalable for big data.  
- **ANN** → Fast and efficient, widely used in vector databases for similarity search.  

🚀 In practice, ANN algorithms (e.g., HNSW — Hierarchical Navigable Small World) power real-world applications in **recommendations, fraud detection, semantic search, and generative AI retrieval systems**.

# Neo4j and the Property Graph Model

Graph databases like **Neo4j** allow us to represent and query data as **graphs**, making it easier to model complex relationships compared to traditional relational databases. Neo4j uses the **Property Graph Model**, which consists of:

- **Nodes (entities)** → e.g., Customer, Order, Product, Supplier, Category  
- **Relationships (edges)** → e.g., PURCHASED, ORDERS, SUPPLIES, PART_OF  
- **Properties (key-value pairs)** → metadata associated with nodes and relationships  

---

## 🔹 Basic Property Graph Model

Each node has a **label** (type of entity), and relationships describe how nodes are connected.  

![Property Graph Model](./images/property_graph_model.png)

---

## 🔹 Example with Real Data

Here’s how actual data fits into the model:  
- Customers (pink) PURCHASE Orders (green)  
- Orders (green) contain Products (blue)  
- Products belong to Categories (red) and are SUPPLIED by Suppliers (brown)  

![Graph Example](./images/property_graph_model_.png)

---

## 🔹 Node Properties

Nodes can have multiple descriptive properties. For example:  

- **Product Properties:** productID, productName, unitPrice, unitsInStock  
- **Customer Properties:** address, city, companyName, customerID  
- **Supplier Properties:** address, contactName, supplierID  
- **Order Properties:** orderID, orderDate, shipAddress  

![Node Properties](./images/graph_properties.png)

---

## 🔹 Relationship Properties

Not only nodes but also **relationships** can store properties.  
For example, the `ORDERS` relationship may have:  
- **discount**  
- **quantity**  
- **unitPrice**  

![Relationship Properties](./images/relation_ship_properties.png)

---

## 🔹 Creating a Graph Database in Neo4j

You can create a graph in **Neo4j** by providing:  
1. **Instructions** describing the graph model (nodes, relationships, properties)  
2. **CSV files** containing the data  

Neo4j ingests this data and creates the graph, after which you can query it using **Cypher Query Language**.

![Creating Graph Database](./images/creating_neo4j_database.png)

---

## 🔹 Key Takeaways

- Neo4j follows the **Property Graph Model** (Nodes + Relationships + Properties).  
- Each **node** and **relationship** can have **labels** and **properties**.  
- Relationships are **directional** and define how entities interact.  
- **Cypher Query Language (CQL)** is used to create, query, and manage data.  
