# 📚 Table of Contents

- [Source Systems for Data Ingestion](#source-systems-for-data-ingestion)
- [Understanding Relational Databases vs One Big Table](#understanding-relational-databases-vs-one-big-table)

##  Source Systems for Data Ingestion

As a data engineer, one of the core responsibilities is to extract **raw data** from different source systems. This raw data can be **structured**, **semi-structured**, or **unstructured**, and needs to be ingested and processed downstream.

---

### 🧱 Types of Data

There are **three main categories** of data based on structure:

- **Structured Data**:  
  Data organized as tables of rows and columns.  
  *Example*: SQL tables, CSV files.

- **Semi-Structured Data**:  
  Data not in strict tabular form but still containing structure like tags or keys.  
  *Example*: JSON, XML.

- **Unstructured Data**:  
  Data with no predefined structure.  
  *Example*: Text, audio, video, images.

![Types of Data](./images/types_of_data.png)

---

### 🔑 Example: Semi-Structured Data (JSON Format)

A common example of semi-structured data is **JSON (JavaScript Object Notation)**, which stores information as a collection of key-value pairs. These can also be **nested**, allowing complex data structures.

![Semi-Structured JSON Example](./images/semi_structured.png)

---

### 🗃️ Where Data is Stored

Depending on the structure, data can be stored in various mediums:

- **Structured / Semi-Structured**: Stored in relational or NoSQL databases.
- **Unstructured**: Stored as files (text, images, audio, etc.).
- **Streaming**: Real-time events from producers like sensors or logs.

![Source Systems](./images/source_systems.png)

---

### 🗄️ Relational vs Non-Relational Databases

- **Relational Databases (SQL)**:  
  Store data in fixed tables (rows and columns). Best for structured data.

- **Non-Relational Databases (NoSQL)**:  
  Store data in key-value pairs or documents. Good for semi-structured or nested data.

![Databases](./images/databases.png)

---

### 🔁 Putting It All Together: Source System Ingestion

Whether from **databases**, **files**, or **streaming systems**, all types of data eventually flow into the **ingestion pipeline**. These source systems are the starting point of the data engineering lifecycle.

![Source System Ingestion Overview](./images/source_system_ingestion.png)

## Understanding Relational Databases vs One Big Table

In data engineering, how you **structure your data** plays a critical role in ensuring **data integrity, efficiency, and scalability**. Let's explore the two most common approaches to storing data in a tabular format:

---

### 📦 One Big Table (OBT) Approach

The **One Big Table** approach stores all information — customer, product, order — in one single massive table. While it may look simple, it leads to **serious problems** like:

- 🔁 **Redundancy**: The same customer/product info is repeated across multiple rows.  
- ⚠️ **Inconsistency**: If a customer or product info changes, you have to manually update **every** row — or risk inconsistencies.  
- 🐢 **Slow Updates**: Updates require full-table scans and can cause performance bottlenecks.

![One Big Table with Inconsistency](./images/inconsistancey_rd.png)

> In the table above, the same customer “Jane Doe” has inconsistent addresses across rows. Similarly, the product SKU “w31” is suddenly changed to “w40” in one row, which can lead to reporting and transactional issues.

---

### 🧱 Relational Database Model

Instead of putting everything in one table, **relational databases** split the data into **multiple related tables** using **primary keys** and **foreign keys**.

This has several advantages:

- ✅ No duplicate data  
- ✅ Easier updates and maintenance  
- ✅ Ensures **data integrity**  
- ✅ Follows a **database schema** (predefined structure)

![Relational Table Schema](./images/rd.png)

---

### 🔐 Primary Keys and Foreign Keys

- **Primary Key**: A column (or set of columns) that uniquely identifies each record in a table.  
- **Foreign Key**: A reference to a primary key in another table, establishing relationships between tables.

This is how a relational schema looks for an e-commerce platform:

- `Customers` table → info about each customer  
- `Products` table → info about each product  
- `Orders` table → each purchase linked by customer & product IDs

![Advantages of Relational Databases](./images/adv_rd.png)

---

### 💻 RDBMS Software

A **Relational Database Management System (RDBMS)** is a software layer that sits on top of the database to help you manage and query it using **SQL (Structured Query Language)**.

Popular RDBMS systems:

- **MySQL**  
- **PostgreSQL**  
- **Oracle Database**  
- **SQL Server**

These tools power many **enterprise applications**, from e-commerce platforms to banking systems.

![RDBMS Software](./images/rdbms.png)

---

### 🧠 When to Use What?

| Feature                    | One Big Table                          | Relational Database                      |
|---------------------------|----------------------------------------|------------------------------------------|
| 🔁 Redundancy             | High                                   | Low                                      |
| ⚠️ Inconsistency Risk     | High                                   | Low                                      |
| ⚙️ Update Performance     | Poor (slow)                            | Optimized                                |
| 🔗 Relationship Handling   | Difficult                              | Natural with foreign keys                |
| 🧪 Use Case               | Simple analytics / ML joins (OLAP)           | Complex transactional systems (OLTP)     |
| 🚀 Query Speed (Simple)   | Fast for flat, large-scale datasets    | May require joins, slightly slower       |
| 💾 Storage Cost           | High due to duplicates                 | Lower, efficient structure               |

---

### ✅ Summary
- **OLTP** Online Transaction Processing 
- **OLAP** Online Analytical Processing

- Use **Relational Databases** when your system needs **data consistency, transactional reliability**, and **normalized structure**.  
- Use **One Big Table** (OBT) if you're optimizing for **query performance** in read-heavy scenarios, especially in data lakes or analytics.

You’ll often **ingest relational data** from OLTP systems, but later denormalize it (into OBT) during transformation for analytics or machine learning pipelines.

> 🧠 As storage becomes cheaper and time becomes more expensive, data engineers often use **One Big Table (OBT)** approaches for analytical workloads.

 # 🗃️ NoSQL Databases Explained

---

## ❓ What is NoSQL and Why Was It Needed?

In the early 2000s, tech companies like **Google** and **Amazon** were growing fast. They had to handle huge amounts of data — much of it was messy, came from many different sources, and didn’t fit into neat rows and columns like traditional **SQL** databases expected.

As data became more complex and unstructured (like logs, clicks, social media posts, etc.), relational databases started showing limitations:

- Difficult to scale out horizontally (across many servers)
- Required fixed schemas
- Slower for huge data writes and reads

So, these companies helped invent a new type of database — **NoSQL**, which means "**Not Only SQL**". It wasn't built to replace SQL but to handle situations where SQL struggles.

---

## 🧱 What Makes NoSQL Special?

NoSQL databases do **not use traditional tabular structures**. Instead, they support:

- **Key-Value** format
- **Document** format
- **Wide-Column** format
- **Graph** structures

This makes them more flexible for storing data in real-world formats — such as user profiles, sensor data, logs, and web activity.

📸 *Visual: NoSQL Structures*

![NoSQL Structures](./images/no_sql_structures.png)

---

## 📝 Real-World Example — Document Database

Let’s say you’re building a music app and need to store user profiles. In SQL, you’d need multiple tables: users, bands, user_band_links, etc. But in a **NoSQL document database**, you can store everything about a user in one document.

That makes it:

- Faster to retrieve complete data
- Easier to scale
- Flexible to change (you don’t need to update table schemas)

📸 *Visual: Document DB vs Relational DB*

![Document DB Comparison](./images/document_database_2.png)

---

## 🧩 Internals of a Document Store

Each user record is stored as a **document**. These documents are grouped into a **collection** (like a table).

📸 *Visual: Collection and Document Structure*

![Document Internals](./images/document_database.png)

- A document is like a row
- Each has a unique key (like `id`)
- Flexible structure — one document can have a different structure than another

---

## 🚀 What SQL Can’t Do Easily (but NoSQL Can)

| Feature                  | SQL                       | NoSQL                            |
|--------------------------|---------------------------|----------------------------------|
| Fixed Schema             | Yes                       | ❌ Flexible schema               |
| Horizontal Scalability   | Difficult                 | ✅ Easy with sharding            |
| Data Format              | Tabular only              | ✅ JSON-like, Key-Value, Graph   |
| Fast Writes & Reads      | Slower at massive scale   | ✅ Designed for speed            |

**Example Use Case**:
An e-commerce website tracking user activity like page views, clicks, cart updates, and more. These interactions happen every second. NoSQL helps by **quickly writing this data** and letting engineers **query it later** for analytics or ML.

---

## ⚖️ Eventual Consistency vs Strong Consistency

Relational databases use **strong consistency**:

- You can **only read** data after **all servers are updated**
- ✅ Always correct  
- ❌ Slower (has to wait)

NoSQL often uses **eventual consistency**:

- You can **read from one server** before others are updated
- ❌ Might get old data briefly
- ✅ But it’s **fast** and **available**

📌 **Real-World Analogy**:

Imagine you post a comment on Instagram. You see it instantly, but your friend may see it a second later. That’s **eventual consistency** — good enough for most real-time systems where speed matters more than perfection.

📸 *Visual: ACID Compliance & MongoDB*

![ACID Properties](./images/acid.png)

Some NoSQL databases (like MongoDB) **try to provide ACID guarantees**, but it’s not always possible when prioritizing speed and scale.

---

## ✅ Summary

- NoSQL databases were designed to handle **big, messy, fast-moving data**.
- They offer **schema flexibility**, **horizontal scaling**, and **faster performance** than SQL in many modern applications.
- They trade **strong consistency** for **eventual consistency** in many cases — which is often a good trade for **user-facing web apps, real-time systems, and distributed services**.

## 🔥 Understanding ACID Properties in Databases

Relational databases are **ACID compliant**, which means they follow 4 important rules:  
**Atomicity**, **Consistency**, **Isolation**, and **Durability**.

These rules help keep your data safe and reliable — especially when many users or systems are working with the same data at the same time.

---

### 📊 ACID vs NoSQL

|                  | Relational DBs     | NoSQL DBs              |
|------------------|--------------------|------------------------|
| ACID Compliant   | ✅ Yes              | ❌ Not by default       |

![ACID Compliance](./images/acid_compliance.png)

![ACID Definition](./images/acid_definition.png)
---

### ⚛️ 1. Atomicity

**Meaning**: All operations in a transaction must succeed or none should happen.  
If one part fails, the whole transaction is rolled back.

**Example**:  
A user tries to place an online order:
- Step 1: Deduct ₹500 from their wallet  
- Step 2: Add the product to the orders table  

If step 2 fails (e.g., database error), step 1 will also be undone. The user won’t lose money without an order.

---

### ✅ 2. Consistency

**Meaning**: After a transaction, the data should follow all the rules of the database.

**Example**:  
Let’s say your app has a rule:  
**A user’s age must be between 0 and 120.**  

If someone tries to insert `age = 900`, the transaction will fail. This keeps the database **valid and clean**.

---

### 🧩 3. Isolation

**Meaning**: Transactions happen independently. Even if two people order something at the same time, the database will treat them **one after another** to avoid conflicts.

**Example**:  
- User A adds a product to cart and places an order  
- At the exact same time, User B also places an order for the same item  
Even if they both acted simultaneously, the database ensures proper stock adjustment without mixing the transactions.

---

### 💾 4. Durability

**Meaning**: Once a transaction is complete, it’s **saved forever**, even if there's a system crash.

**Example**:  
A payment is successful and the order is confirmed.  
Even if the server goes down after that, the order won’t disappear. It’s written to disk or backup properly.

---

### 🧠 Summary

- **Atomicity** → All or nothing  
- **Consistency** → Follow the rules  
- **Isolation** → One at a time  
- **Durability** → Saved forever  

These principles are **critical** for financial apps, inventory systems, or any place where **reliable data** is a must.

#  Understanding Amazon DynamoDB with Python (Boto3 SDK)

---

## 📦 What is DynamoDB?

Amazon DynamoDB is a fully managed NoSQL key-value database offered by AWS that delivers single-digit millisecond performance at any scale.

- It is serverless, highly scalable, and available.
- Commonly used in real-time applications like IoT, gaming, recommendation engines, etc.

---

## 🗝️ Key-Value Storage

In DynamoDB, each table stores items (rows), and each item is a set of attributes (columns).

- Each item is uniquely identified using a primary key.
- Key = Attribute that uniquely identifies an item.
- Value = The actual contents of that attribute.

💡 Legal Note: While we say “key is 101” in table diagrams, in formal terms, `PersonID` is the key name, and `101` is the value.

---

### 🏫 Analogy: School ID System

A student's ID card has: ID = 123, Name = Alice, Grade = A

In this case:
- Key = ID (like PersonID)
- Value = 123
- All other details are attributes.

---

### 📸 Visual: Key-Value Table (Simple Primary Key)

![Simple Primary Key](./images/dynamodb.png)

- Table Name: Person  
- Primary Key: PersonID  
- Attributes: FirstName, LastName, Phone, Country, FavoriteBands

---

## 🧩 Composite Primary Key

Sometimes one key is not enough!

DynamoDB allows composite primary keys consisting of:
- Partition Key (aka Hash Key)
- Sort Key (aka Range Key)

This allows grouping multiple related items under one partition.

🧠 Analogy:  
Partition Key = OrderID (e.g., 1234)  
Sort Key = ItemNum (e.g., Item1, Item2)

---

### 📸 Visual: Composite Primary Key

![Composite Primary Key](./images/dynamodb_2.png)

---

## 🧱 DynamoDB is Schema-less

Each item in a DynamoDB table can have a different set of attributes. You do not need to pre-define a schema like in relational databases.

💡 This flexibility makes DynamoDB great for agile and fast-moving applications, as your data model can evolve over time.

---

## ⚙️ CRUD Operations in DynamoDB using Boto3

![Boto3 SDK](./images/boto3.png)

### 🔵 What is Boto3?

- Boto3 is the AWS SDK for Python.
- Allows you to create, configure, and manage AWS services using Python.
- You can perform all CRUD operations on DynamoDB using Boto3.

---

### 📊 CRUD Operations Summary

| Operation | Boto3 Method        | Description                    |
|----------|---------------------|--------------------------------|
| Create   | create_table         | Create a new table             |
| Read     | get_item, scan       | Read data from the table       |
| Update   | update_item          | Modify an item                 |
| Delete   | delete_item          | Delete an item                 |

---

## 🧬 Understanding **kwargs in Python (used in Boto3)

### What is **kwargs?

- `**kwargs` allows passing a dictionary of named arguments to a function.
- It unpacks dictionary keys and values into separate named parameters.

Example:
```python
def my_function(**kwargs):  
    print(kwargs)

my_function(name="Alice", age=30)  
# Output: {'name': 'Alice', 'age': 30}
```
🧠 It’s like saying: “I don’t know how many named arguments I’ll get, just pass them all!”

---

## 🧑‍💻 Parameters vs Arguments vs Attributes

- Parameter: A variable in a function definition.
- Argument: Actual value passed to that function.
- Keyword Argument: Argument passed with a name (e.g., name='Alice')
- Attributes: Variables belonging to an object.

Example:
```python
class Person:  
    def __init__(self, name, age):        # name & age = parameters  
        self.name = name                  # name = attribute  
        self.age = age                    # age = attribute  
```
---

### ☕️ Analogy: Coffee Machine

- Object = Coffee Machine  
- Attributes = Temperature, Size  
- Parameter = brew(coffee_type='espresso')  
- When you run the function, the coffee_type parameter decides how the internal attributes are used.

---

## 🧾 Loading JSON into DynamoDB

![Data Load Diagram](./images/data_dynamo.png)

---

### What is Parsing?

Parsing means converting one format into another so that a program can work with it.

---

### 🔄 Example: Parsing JSON into Dictionary
```python
import json

with open('forum.json', 'r') as file:  
    data = json.load(file)   # parsing JSON → Python dict
```
---

## 🚀 Inserting Items into DynamoDB

Each JSON record is inserted using the PutRequest format.  
S = String, N = Number

Example JSON:
```json
{
  "PutRequest": {
    "Item": {
      "Name": { "S": "Amazon DynamoDB" },
      "Category": { "S": "Amazon Web Services" },
      "Threads": { "N": "2" },
      "Messages": { "N": "4" },
      "Views": { "N": "1000" }
    }
  }
}
```
---

![Forum Insert Visual](./images/forum_json.png)

---
```python
## 📥 Full Code to Load JSON into DynamoDB

import json  
import boto3

# Initialize DynamoDB client  
client = boto3.client('dynamodb')

# Load JSON data  
with open('forum.json', 'r') as f:  
    forum_data = json.load(f)

# Loop through each item and insert into DynamoDB  
for record in forum_data['Forum']:  
    item = record['PutRequest']['Item']  
    client.put_item(TableName='Forum', Item=item)
```
---