# 📚 Table of Contents

- [Source Systems for Data Ingestion](#source-systems-for-data-ingestion)

##  Source Systems for Data Ingestion

As a data engineer, one of the core responsibilities is to extract **raw data** from different source systems. This raw data can be **structured**, **semi-structured**, or **unstructured**, and needs to be ingested and processed downstream.

---

### 🧱 Types of Data

There are **three main categories** of data based on structure:

- **Structured Data**:  
  Data organized as tables of rows and columns.  
  *Example*: SQL tables, CSV files.

- **Semi-Structured Data**:  
  Data not in strict tabular form but still containing structure like tags or keys.  
  *Example*: JSON, XML.

- **Unstructured Data**:  
  Data with no predefined structure.  
  *Example*: Text, audio, video, images.

![Types of Data](./images/types_of_data.png)

---

### 🔑 Example: Semi-Structured Data (JSON Format)

A common example of semi-structured data is **JSON (JavaScript Object Notation)**, which stores information as a collection of key-value pairs. These can also be **nested**, allowing complex data structures.

![Semi-Structured JSON Example](./images/semi_structured.png)

---

### 🗃️ Where Data is Stored

Depending on the structure, data can be stored in various mediums:

- **Structured / Semi-Structured**: Stored in relational or NoSQL databases.
- **Unstructured**: Stored as files (text, images, audio, etc.).
- **Streaming**: Real-time events from producers like sensors or logs.

![Source Systems](./images/source_systems.png)

---

### 🗄️ Relational vs Non-Relational Databases

- **Relational Databases (SQL)**:  
  Store data in fixed tables (rows and columns). Best for structured data.

- **Non-Relational Databases (NoSQL)**:  
  Store data in key-value pairs or documents. Good for semi-structured or nested data.

![Databases](./images/databases.png)

---

### 🔁 Putting It All Together: Source System Ingestion

Whether from **databases**, **files**, or **streaming systems**, all types of data eventually flow into the **ingestion pipeline**. These source systems are the starting point of the data engineering lifecycle.

![Source System Ingestion Overview](./images/source_system_ingestion.png)

## 🧩 Understanding Relational Databases vs One Big Table

In data engineering, how you **structure your data** plays a critical role in ensuring **data integrity, efficiency, and scalability**. Let's explore the two most common approaches to storing data in a tabular format:

---

### 📦 One Big Table (OBT) Approach

The **One Big Table** approach stores all information — customer, product, order — in one single massive table. While it may look simple, it leads to **serious problems** like:

- 🔁 **Redundancy**: The same customer/product info is repeated across multiple rows.  
- ⚠️ **Inconsistency**: If a customer or product info changes, you have to manually update **every** row — or risk inconsistencies.  
- 🐢 **Slow Updates**: Updates require full-table scans and can cause performance bottlenecks.

![One Big Table with Inconsistency](./images/inconsistancey_rd.png)

> In the table above, the same customer “Jane Doe” has inconsistent addresses across rows. Similarly, the product SKU “w31” is suddenly changed to “w40” in one row, which can lead to reporting and transactional issues.

---

### 🧱 Relational Database Model

Instead of putting everything in one table, **relational databases** split the data into **multiple related tables** using **primary keys** and **foreign keys**.

This has several advantages:

- ✅ No duplicate data  
- ✅ Easier updates and maintenance  
- ✅ Ensures **data integrity**  
- ✅ Follows a **database schema** (predefined structure)

![Relational Table Schema](./images/rd.png)

---

### 🔐 Primary Keys and Foreign Keys

- **Primary Key**: A column (or set of columns) that uniquely identifies each record in a table.  
- **Foreign Key**: A reference to a primary key in another table, establishing relationships between tables.

This is how a relational schema looks for an e-commerce platform:

- `Customers` table → info about each customer  
- `Products` table → info about each product  
- `Orders` table → each purchase linked by customer & product IDs

![Advantages of Relational Databases](./images/adv_rd.png)

---

### 💻 RDBMS Software

A **Relational Database Management System (RDBMS)** is a software layer that sits on top of the database to help you manage and query it using **SQL (Structured Query Language)**.

Popular RDBMS systems:

- **MySQL**  
- **PostgreSQL**  
- **Oracle Database**  
- **SQL Server**

These tools power many **enterprise applications**, from e-commerce platforms to banking systems.

![RDBMS Software](./images/rdbms.png)

---

### 🧠 When to Use What?

| Feature                    | One Big Table                          | Relational Database                      |
|---------------------------|----------------------------------------|------------------------------------------|
| 🔁 Redundancy             | High                                   | Low                                      |
| ⚠️ Inconsistency Risk     | High                                   | Low                                      |
| ⚙️ Update Performance     | Poor (slow)                            | Optimized                                |
| 🔗 Relationship Handling   | Difficult                              | Natural with foreign keys                |
| 🧪 Use Case               | Simple analytics / ML joins (OLAP)           | Complex transactional systems (OLTP)     |
| 🚀 Query Speed (Simple)   | Fast for flat, large-scale datasets    | May require joins, slightly slower       |
| 💾 Storage Cost           | High due to duplicates                 | Lower, efficient structure               |

---

### ✅ Summary
- **OLTP** Online Transaction Processing 
- **OLAP** Online Analytical Processing

- Use **Relational Databases** when your system needs **data consistency, transactional reliability**, and **normalized structure**.  
- Use **One Big Table** (OBT) if you're optimizing for **query performance** in read-heavy scenarios, especially in data lakes or analytics.

You’ll often **ingest relational data** from OLTP systems, but later denormalize it (into OBT) during transformation for analytics or machine learning pipelines.

> 🧠 As storage becomes cheaper and time becomes more expensive, data engineers often use **One Big Table (OBT)** approaches for analytical workloads.