# 📚 Table of Contents

- [Source Systems for Data Ingestion](#source-systems-for-data-ingestion)
- [Understanding Relational Databases vs One Big Table](#understanding-relational-databases-vs-one-big-table)
- [nosql-databases-explained](#nosql-databases-explained)
- [acid-properties-in-databases](#acid-properties-in-databases)
- [amazon-dynamodb-with-python-boto3-sdk](#amazon-dynamodb-with-python-boto3-sdk)
- [object-storage](#object-storage)
- [reference for sql, dynamodb, s3 bucket pratice lab](#note)
- [Logs](#logs)
- [streaming-terminology-and-architecture](#streaming-terminology-and-architecture)
- [connecting to systems](#connecting-to-systems)
- [IAM](#iam-identity-and-access-management)
- [basics-of-networking-in-cloud-aws](#basics-of-networking-in-cloud-aws)
- [aws networkign basics vpc subnets ip cider bits](#aws-networking-basics--vpc-subnets-ip-cidr-bits)
- [aws vpc setup](#aws-vpc-setup)
- [aws internet and nat gatway setup in a vpc](#aws-internet--nat-gateways-setup-in-a-vpc)

##  Source Systems for Data Ingestion

As a data engineer, one of the core responsibilities is to extract **raw data** from different source systems. This raw data can be **structured**, **semi-structured**, or **unstructured**, and needs to be ingested and processed downstream.

---

### 🧱 Types of Data

There are **three main categories** of data based on structure:

- **Structured Data**:  
  Data organized as tables of rows and columns.  
  *Example*: SQL tables, CSV files.

- **Semi-Structured Data**:  
  Data not in strict tabular form but still containing structure like tags or keys.  
  *Example*: JSON, XML.

- **Unstructured Data**:  
  Data with no predefined structure.  
  *Example*: Text, audio, video, images.

![Types of Data](./images/types_of_data.png)

---

### 🔑 Example: Semi-Structured Data (JSON Format)

A common example of semi-structured data is **JSON (JavaScript Object Notation)**, which stores information as a collection of key-value pairs. These can also be **nested**, allowing complex data structures.

![Semi-Structured JSON Example](./images/semi_structured.png)

---

### 🗃️ Where Data is Stored

Depending on the structure, data can be stored in various mediums:

- **Structured / Semi-Structured**: Stored in relational or NoSQL databases.
- **Unstructured**: Stored as files (text, images, audio, etc.).
- **Streaming**: Real-time events from producers like sensors or logs.

![Source Systems](./images/source_systems.png)

---

### 🗄️ Relational vs Non-Relational Databases

- **Relational Databases (SQL)**:  
  Store data in fixed tables (rows and columns). Best for structured data.

- **Non-Relational Databases (NoSQL)**:  
  Store data in key-value pairs or documents. Good for semi-structured or nested data.

![Databases](./images/databases.png)

---

### 🔁 Putting It All Together: Source System Ingestion

Whether from **databases**, **files**, or **streaming systems**, all types of data eventually flow into the **ingestion pipeline**. These source systems are the starting point of the data engineering lifecycle.

![Source System Ingestion Overview](./images/source_system_ingestion.png)

## Understanding Relational Databases vs One Big Table

In data engineering, how you **structure your data** plays a critical role in ensuring **data integrity, efficiency, and scalability**. Let's explore the two most common approaches to storing data in a tabular format:

---

### 📦 One Big Table (OBT) Approach

The **One Big Table** approach stores all information — customer, product, order — in one single massive table. While it may look simple, it leads to **serious problems** like:

- 🔁 **Redundancy**: The same customer/product info is repeated across multiple rows.  
- ⚠️ **Inconsistency**: If a customer or product info changes, you have to manually update **every** row — or risk inconsistencies.  
- 🐢 **Slow Updates**: Updates require full-table scans and can cause performance bottlenecks.

![One Big Table with Inconsistency](./images/inconsistancey_rd.png)

> In the table above, the same customer “Jane Doe” has inconsistent addresses across rows. Similarly, the product SKU “w31” is suddenly changed to “w40” in one row, which can lead to reporting and transactional issues.

---

### 🧱 Relational Database Model

Instead of putting everything in one table, **relational databases** split the data into **multiple related tables** using **primary keys** and **foreign keys**.

This has several advantages:

- ✅ No duplicate data  
- ✅ Easier updates and maintenance  
- ✅ Ensures **data integrity**  
- ✅ Follows a **database schema** (predefined structure)

![Relational Table Schema](./images/rd.png)

---

### 🔐 Primary Keys and Foreign Keys

- **Primary Key**: A column (or set of columns) that uniquely identifies each record in a table.  
- **Foreign Key**: A reference to a primary key in another table, establishing relationships between tables.

This is how a relational schema looks for an e-commerce platform:

- `Customers` table → info about each customer  
- `Products` table → info about each product  
- `Orders` table → each purchase linked by customer & product IDs

![Advantages of Relational Databases](./images/adv_rd.png)

---

### 💻 RDBMS Software

A **Relational Database Management System (RDBMS)** is a software layer that sits on top of the database to help you manage and query it using **SQL (Structured Query Language)**.

Popular RDBMS systems:

- **MySQL**  
- **PostgreSQL**  
- **Oracle Database**  
- **SQL Server**

These tools power many **enterprise applications**, from e-commerce platforms to banking systems.

![RDBMS Software](./images/rdbms.png)

---

### 🧠 When to Use What?

| Feature                    | One Big Table                          | Relational Database                      |
|---------------------------|----------------------------------------|------------------------------------------|
| 🔁 Redundancy             | High                                   | Low                                      |
| ⚠️ Inconsistency Risk     | High                                   | Low                                      |
| ⚙️ Update Performance     | Poor (slow)                            | Optimized                                |
| 🔗 Relationship Handling   | Difficult                              | Natural with foreign keys                |
| 🧪 Use Case               | Simple analytics / ML joins (OLAP)           | Complex transactional systems (OLTP)     |
| 🚀 Query Speed (Simple)   | Fast for flat, large-scale datasets    | May require joins, slightly slower       |
| 💾 Storage Cost           | High due to duplicates                 | Lower, efficient structure               |

---

### ✅ Summary
- **OLTP** Online Transaction Processing 
- **OLAP** Online Analytical Processing

- Use **Relational Databases** when your system needs **data consistency, transactional reliability**, and **normalized structure**.  
- Use **One Big Table** (OBT) if you're optimizing for **query performance** in read-heavy scenarios, especially in data lakes or analytics.

You’ll often **ingest relational data** from OLTP systems, but later denormalize it (into OBT) during transformation for analytics or machine learning pipelines.

> 🧠 As storage becomes cheaper and time becomes more expensive, data engineers often use **One Big Table (OBT)** approaches for analytical workloads.

 # NoSQL Databases Explained

---

## ❓ What is NoSQL and Why Was It Needed?

In the early 2000s, tech companies like **Google** and **Amazon** were growing fast. They had to handle huge amounts of data — much of it was messy, came from many different sources, and didn’t fit into neat rows and columns like traditional **SQL** databases expected.

As data became more complex and unstructured (like logs, clicks, social media posts, etc.), relational databases started showing limitations:

- Difficult to scale out horizontally (across many servers)
- Required fixed schemas
- Slower for huge data writes and reads

So, these companies helped invent a new type of database — **NoSQL**, which means "**Not Only SQL**". It wasn't built to replace SQL but to handle situations where SQL struggles.

---

## 🧱 What Makes NoSQL Special?

NoSQL databases do **not use traditional tabular structures**. Instead, they support:

- **Key-Value** format
- **Document** format
- **Wide-Column** format
- **Graph** structures

This makes them more flexible for storing data in real-world formats — such as user profiles, sensor data, logs, and web activity.

📸 *Visual: NoSQL Structures*

![NoSQL Structures](./images/no_sql_structures.png)

---

## 📝 Real-World Example — Document Database

Let’s say you’re building a music app and need to store user profiles. In SQL, you’d need multiple tables: users, bands, user_band_links, etc. But in a **NoSQL document database**, you can store everything about a user in one document.

That makes it:

- Faster to retrieve complete data
- Easier to scale
- Flexible to change (you don’t need to update table schemas)

📸 *Visual: Document DB vs Relational DB*

![Document DB Comparison](./images/document_database_2.png)

---

## 🧩 Internals of a Document Store

Each user record is stored as a **document**. These documents are grouped into a **collection** (like a table).

📸 *Visual: Collection and Document Structure*

![Document Internals](./images/document_database.png)

- A document is like a row
- Each has a unique key (like `id`)
- Flexible structure — one document can have a different structure than another

---

## 🚀 What SQL Can’t Do Easily (but NoSQL Can)

| Feature                  | SQL                       | NoSQL                            |
|--------------------------|---------------------------|----------------------------------|
| Fixed Schema             | Yes                       | ❌ Flexible schema               |
| Horizontal Scalability   | Difficult                 | ✅ Easy with sharding            |
| Data Format              | Tabular only              | ✅ JSON-like, Key-Value, Graph   |
| Fast Writes & Reads      | Slower at massive scale   | ✅ Designed for speed            |

**Example Use Case**:
An e-commerce website tracking user activity like page views, clicks, cart updates, and more. These interactions happen every second. NoSQL helps by **quickly writing this data** and letting engineers **query it later** for analytics or ML.

---

## ⚖️ Eventual Consistency vs Strong Consistency

Relational databases use **strong consistency**:

- You can **only read** data after **all servers are updated**
- ✅ Always correct  
- ❌ Slower (has to wait)

NoSQL often uses **eventual consistency**:

- You can **read from one server** before others are updated
- ❌ Might get old data briefly
- ✅ But it’s **fast** and **available**

📌 **Real-World Analogy**:

Imagine you post a comment on Instagram. You see it instantly, but your friend may see it a second later. That’s **eventual consistency** — good enough for most real-time systems where speed matters more than perfection.

📸 *Visual: ACID Compliance & MongoDB*

![ACID Properties](./images/acid.png)

Some NoSQL databases (like MongoDB) **try to provide ACID guarantees**, but it’s not always possible when prioritizing speed and scale.

---

## ✅ Summary

- NoSQL databases were designed to handle **big, messy, fast-moving data**.
- They offer **schema flexibility**, **horizontal scaling**, and **faster performance** than SQL in many modern applications.
- They trade **strong consistency** for **eventual consistency** in many cases — which is often a good trade for **user-facing web apps, real-time systems, and distributed services**.

## ACID Properties in Databases

Relational databases are **ACID compliant**, which means they follow 4 important rules:  
**Atomicity**, **Consistency**, **Isolation**, and **Durability**.

These rules help keep your data safe and reliable — especially when many users or systems are working with the same data at the same time.

---

### 📊 ACID vs NoSQL

|                  | Relational DBs     | NoSQL DBs              |
|------------------|--------------------|------------------------|
| ACID Compliant   | ✅ Yes              | ❌ Not by default       |

![ACID Compliance](./images/acid_compliance.png)

![ACID Definition](./images/acid_definition.png)
---

### ⚛️ 1. Atomicity

**Meaning**: All operations in a transaction must succeed or none should happen.  
If one part fails, the whole transaction is rolled back.

**Example**:  
A user tries to place an online order:
- Step 1: Deduct ₹500 from their wallet  
- Step 2: Add the product to the orders table  

If step 2 fails (e.g., database error), step 1 will also be undone. The user won’t lose money without an order.

---

### ✅ 2. Consistency

**Meaning**: After a transaction, the data should follow all the rules of the database.

**Example**:  
Let’s say your app has a rule:  
**A user’s age must be between 0 and 120.**  

If someone tries to insert `age = 900`, the transaction will fail. This keeps the database **valid and clean**.

---

### 🧩 3. Isolation

**Meaning**: Transactions happen independently. Even if two people order something at the same time, the database will treat them **one after another** to avoid conflicts.

**Example**:  
- User A adds a product to cart and places an order  
- At the exact same time, User B also places an order for the same item  
Even if they both acted simultaneously, the database ensures proper stock adjustment without mixing the transactions.

---

### 💾 4. Durability

**Meaning**: Once a transaction is complete, it’s **saved forever**, even if there's a system crash.

**Example**:  
A payment is successful and the order is confirmed.  
Even if the server goes down after that, the order won’t disappear. It’s written to disk or backup properly.

---

### 🧠 Summary

- **Atomicity** → All or nothing  
- **Consistency** → Follow the rules  
- **Isolation** → One at a time  
- **Durability** → Saved forever  

These principles are **critical** for financial apps, inventory systems, or any place where **reliable data** is a must.

# Amazon DynamoDB with Python (Boto3 SDK)

---

## 📦 What is DynamoDB?

Amazon DynamoDB is a fully managed NoSQL key-value database offered by AWS that delivers single-digit millisecond performance at any scale.

- It is serverless, highly scalable, and available.
- Commonly used in real-time applications like IoT, gaming, recommendation engines, etc.

---

## 🗝️ Key-Value Storage

In DynamoDB, each table stores items (rows), and each item is a set of attributes (columns).

- Each item is uniquely identified using a primary key.
- Key = Attribute that uniquely identifies an item.
- Value = The actual contents of that attribute.

💡 Legal Note: While we say “key is 101” in table diagrams, in formal terms, `PersonID` is the key name, and `101` is the value.

---

### 🏫 Analogy: School ID System

A student's ID card has: ID = 123, Name = Alice, Grade = A

In this case:
- Key = ID (like PersonID)
- Value = 123
- All other details are attributes.

---

### 📸 Visual: Key-Value Table (Simple Primary Key)

![Simple Primary Key](./images/dynamodb.png)

- Table Name: Person  
- Primary Key: PersonID  
- Attributes: FirstName, LastName, Phone, Country, FavoriteBands

---

## 🧩 Composite Primary Key

Sometimes one key is not enough!

DynamoDB allows composite primary keys consisting of:
- Partition Key (aka Hash Key)
- Sort Key (aka Range Key)

This allows grouping multiple related items under one partition.

🧠 Analogy:  
Partition Key = OrderID (e.g., 1234)  
Sort Key = ItemNum (e.g., Item1, Item2)

---

### 📸 Visual: Composite Primary Key

![Composite Primary Key](./images/dynamodb_2.png)

---

## 🧱 DynamoDB is Schema-less

Each item in a DynamoDB table can have a different set of attributes. You do not need to pre-define a schema like in relational databases.

💡 This flexibility makes DynamoDB great for agile and fast-moving applications, as your data model can evolve over time.

---

## ⚙️ CRUD Operations in DynamoDB using Boto3

![Boto3 SDK](./images/boto3.png)

### 🔵 What is Boto3?

- Boto3 is the AWS SDK for Python.
- Allows you to create, configure, and manage AWS services using Python.
- You can perform all CRUD operations on DynamoDB using Boto3.

---

### 📊 CRUD Operations Summary

| Operation | Boto3 Method        | Description                    |
|----------|---------------------|--------------------------------|
| Create   | create_table         | Create a new table             |
| Read     | get_item, scan       | Read data from the table       |
| Update   | update_item          | Modify an item                 |
| Delete   | delete_item          | Delete an item                 |

---

## 🧬 Understanding **kwargs in Python (used in Boto3)

### What is **kwargs?

- `**kwargs` allows passing a dictionary of named arguments to a function.
- It unpacks dictionary keys and values into separate named parameters.

Example:
```python
def my_function(**kwargs):  
    print(kwargs)

my_function(name="Alice", age=30)  
# Output: {'name': 'Alice', 'age': 30}
```
🧠 It’s like saying: “I don’t know how many named arguments I’ll get, just pass them all!”

---

## 🧑‍💻 Parameters vs Arguments vs Attributes

- Parameter: A variable in a function definition.
- Argument: Actual value passed to that function.
- Keyword Argument: Argument passed with a name (e.g., name='Alice')
- Attributes: Variables belonging to an object.

Example:
```python
class Person:  
    def __init__(self, name, age):        # name & age = parameters  
        self.name = name                  # name = attribute  
        self.age = age                    # age = attribute  
```
---

### ☕️ Analogy: Coffee Machine

- Object = Coffee Machine  
- Attributes = Temperature, Size  
- Parameter = brew(coffee_type='espresso')  
- When you run the function, the coffee_type parameter decides how the internal attributes are used.

---

## 🧾 Loading JSON into DynamoDB

![Data Load Diagram](./images/data_dynamo.png)

---

### What is Parsing?

Parsing means converting one format into another so that a program can work with it.

---

### 🔄 Example: Parsing JSON into Dictionary
```python
import json

with open('forum.json', 'r') as file:  
    data = json.load(file)   # parsing JSON → Python dict
```
---

## 🚀 Inserting Items into DynamoDB

Each JSON record is inserted using the PutRequest format.  
S = String, N = Number

Example JSON:
```json
{
  "PutRequest": {
    "Item": {
      "Name": { "S": "Amazon DynamoDB" },
      "Category": { "S": "Amazon Web Services" },
      "Threads": { "N": "2" },
      "Messages": { "N": "4" },
      "Views": { "N": "1000" }
    }
  }
}
```
---

![Forum Insert Visual](./images/forum_json.png)

---
```python
## 📥 Full Code to Load JSON into DynamoDB

import json  
import boto3

# Initialize DynamoDB client  
client = boto3.client('dynamodb')

# Load JSON data  
with open('forum.json', 'r') as f:  
    forum_data = json.load(f)

# Loop through each item and insert into DynamoDB  
for record in forum_data['Forum']:  
    item = record['PutRequest']['Item']  
    client.put_item(TableName='Forum', Item=item)
```
---

# Object Storage

Object Storage is a modern storage architecture where files (called **objects**) are stored in a flat structure rather than a hierarchical folder-based system.

---

## 🗂️ Traditional vs. Object Storage

Unlike traditional file systems where data is organized into folders and subfolders, object storage treats all files as flat objects without a hierarchy.

![Object Storage Overview](./images/object_storage.png)

> 🔍 **No hierarchy**: What may look like folders in the UI (e.g., Amazon S3) is just a visual trick. Internally, **all files are stored at the top level**.

---

## 🧾 Types of Files Stored

Object storage supports storing **semi-structured and unstructured data**, including:

- CSV, JSON, TXT (semi-structured)
- PNG, MP4, MP3, BIN (unstructured)

This makes it a great fit for use cases like training machine learning models.

![Supported File Types](./images/object_stoarge_2.png)

---

## 🔐 Key Properties of Object Storage

![Object Write and UUID](./images/os.png)

When programs write files to object storage, each file becomes an object with the following properties:

- **UUID (Universal Unique Identifier)**: Each object gets a unique key.
- **Metadata**: Includes file creation date, file type, owner, etc.
- **Immutable**: Once written, **objects cannot be modified**.

> ⚠️ **Important:** You **cannot append or modify** an object. If you need to change it:
>
> 1. Create a new version of the object.
> 2. Point the UUID to the new file (if versioning is not enabled).
> 3. Or store the new version alongside the old one (if versioning is enabled).

---

## ✅ Why Use Object Storage?

- 📂 **Flat structure** simplifies storage.
- 🧠 **Great for ML workflows** (data lakes, lakehouses).
- 💰 **Cost-effective** for infrequent access.
- 🔁 **Highly durable** (e.g., S3 offers 11 9s durability).
- 🌍 **Geo-replicated** across availability zones.

---

## 🛠️ Real-World Uses

- Backing data lakes
- Storing logs, images, and training datasets
- Archiving backups
- Supporting cloud-native apps

---

## Next Steps

You’ll now explore Amazon S3 in practice:
- Create an S3 bucket
- Upload/query data
- Use object versioning

---

## Note

To revisit concepts related to **SQL**, **DynamoDB**, and **S3**, refer to the `C2_W1_Assignment` folder.  
It contains documentation and examples useful for reinforcing these topics.

## Logs 

Logs are **append-only records** that capture events in a software system over time. They’re considered the simplest form of a streaming system and are often referred to as "exhaust" or "byproduct" of application processes.

Logs play a key role in:
- Monitoring systems
- Debugging errors
- Tracking user behavior
- Feeding machine learning pipelines

---

### 🖥️ What Do Logs Look Like?

Logs are often generated by software applications and record various types of events, such as user actions, system errors, or data updates.

![logs](./images/logs.png)

---

### 🔧 Logs for Monitoring, Debugging, and Beyond

Logs aren't just technical exhaust — they’re **valuable data sources** for multiple downstream tasks in data engineering. Below is a breakdown of the different types of logs and how each can be used.

---

### 1. 🧭 Web Server Logs  
**Purpose**: Capture detailed **user activity** on web and mobile applications.  
**Example Use Cases**:
- Tracking user behavior and clickstreams
- Analyzing bounce rate, session duration, and navigation paths
- Building **analytics dashboards** and conversion funnels  
- A/B testing user flows or feature performance

👉 **Used For**:  
📊 *Behavior analytics*, 📈 *UX optimization*, 🎯 *Personalization engines*

---

### 2. 🗄️ Database System Logs  
**Purpose**: Record all changes and operations on a database.  
**Example Use Cases**:
- Monitoring inserts, updates, and deletes (CRUD)
- Performing **Change Data Capture (CDC)** for incremental ETL
- Triggering alerts for schema changes or performance issues
- Auditing for compliance or rollback

👉 **Used For**:  
🛠️ *Data replication*, 🔄 *CDC pipelines*, 🧩 *Data quality tracking*

---

### 3. 🔐 Security System Logs  
**Purpose**: Track access and authentication activities, especially for secure systems.  
**Example Use Cases**:
- Logging login attempts, IP addresses, token usage
- Flagging unauthorized access or multiple failed logins
- Feeding into **anomaly detection models** for threat identification

👉 **Used For**:  
🧠 *Machine learning models* (fraud, intrusion), 🔐 *Security audits*, 🚨 *Alert systems*

---

### 🔁 Summary Diagram

![logs usage](./images/logs_usage.png)

Logs form the **foundation of observability** in modern systems and are an integral input to real-time and batch analytics workflows.

---

### 📋 Log Structure

A typical log contains:
- **Timestamp**: When the event occurred
- **User/System ID**: Who performed the action
- **Status**: Success or failure
- **Action**: What exactly happened

![log format](./images/log_formate.png)

---

### 🏷️ Log Levels

Each log entry is often tagged with a **log level** to indicate severity or importance:

- `debug` – Low-level technical info
- `info` – General operation messages
- `warn` – Something unexpected, but not harmful
- `error` – A serious issue occurred
- `fatal` – A critical failure occurred

![log level](./images/log_level.png)

---

Logs are powerful data sources for a data engineer. Understanding them helps in building resilient pipelines, debugging systems, and enabling downstream applications like anomaly detection and user analytics.

# Streaming Terminology and Architecture

To truly understand how data moves in a streaming system like Kafka or Kinesis, it's helpful to get familiar with some core concepts. Let's break them down step-by-step.

---

## 🔹 Event, Message, and Stream

![Streaming Terminology](./images/streaming_terminology.png)

Streaming systems revolve around **three key ideas**: **Event**, **Message**, and **Stream**.

- An **Event** is anything that happens in the world — for example, a user clicking on a link, or a temperature sensor recording a value. It's a change in the state of a system that we care about.

- A **Message** is a record of that event. It contains:
  - **Event Details** (what happened)
  - **Event Metadata** (extra info like user ID or device type)
  - **Timestamp** (when it happened)

- A **Stream** is just a continuous sequence of such messages — think of it as an ongoing pipeline of data.

---

## 🔸 What Does a Streaming System Look Like?

![Streaming System](./images/streaming_systems.png)

In a typical streaming system:
- The **Event Producer** emits events (this could be an app, sensor, service, etc.).
- These events are picked up by a **Streaming Broker** (like Kafka), which acts as an **event router**.

The broker does two important things:
1. **Acts as a buffer** — it collects and holds messages briefly so they don’t overwhelm consumers.
2. **Decouples producers and consumers** — they don’t need to interact directly or work at the same pace.

This separation improves scalability and resilience in distributed systems.

---

## 📬 What Is a Message Queue?

![Message Queue](./images/message_queue.png)

A **Message Queue** (like Amazon SQS) is a simpler form of messaging system.

- It stores messages in the order they arrive (FIFO — First In, First Out).
- Consumers take messages one at a time asynchronously.
- It's useful when **guaranteed delivery and strict order** are important.

This is more like a temporary handoff — once the message is delivered, it’s typically gone from the queue.

---

## 🌊 Event Streaming Platforms (Kafka, Kinesis)

![Event Streaming Platform](./images/event_streaming_platform.png)

Platforms like **Apache Kafka** and **Amazon Kinesis** are built for high-volume real-time event processing.

- Events are stored in a **log** — an append-only record.
- Consumers can **read at different speeds** and **from different positions** in the log.
- You can **re-read or replay** events anytime — which is extremely helpful for:
  - Debugging
  - Backfilling analytics
  - Machine learning model training

This means Kafka isn’t just passing messages — it’s also storing and managing them in a powerful, flexible way.

---


## Connecting to Systems


For step-by-step instructions on how to connect to an Amazon RDS MySQL instance

👉 [Link for connecting to an Amazon RDS MySQL database procedure](https://www.coursera.org/learn/source-systems-data-ingestion-and-pipelines/supplement/li6KL/optional-connecting-to-an-amazon-rds-mysql-database)

## IAM (Identity and Access Management)

> **For better understanding, refer to this official course resource**:  
[Basics of AWS IAM – Coursera Supplement](https://www.coursera.org/learn/source-systems-data-ingestion-and-pipelines/supplement/HlS45/basics-of-aws-iam)

---

### 🔐 What is IAM?

**IAM is a framework for managing permissions.**  
It defines **who** (person/application) can perform **what actions** on **which resources** (e.g., databases, S3, Glue jobs).

IAM helps enforce the **principle of least privilege**, ensuring users or systems get **only the access they need**, and **only when needed**.

---

### 📸 Visual Overview

![IAM Overview](./images/iam.png)

---

### 🧑‍💼 IAM Identity Types

In AWS IAM, **identities** are the actors who interact with your resources. These include:

- **Root User**:  
  - The account owner with unrestricted access.  
  - Should be used rarely.

- **IAM Users**:  
  - Individual identities with long-term credentials (like passwords or access keys).  
  - Usually assigned to humans.

- **IAM Groups**:  
  - Collections of users.  
  - Policies are attached to the group and apply to all its users.

- **IAM Roles**:  
  - Temporary identities assumed by users, apps, or services.  
  - Used for granting temporary access **without exposing credentials**.  
  - Example: An EC2 instance can assume a role to access an S3 bucket.

---

### 📜 IAM Policies

Policies are **JSON documents** that define **what actions** (e.g., `s3:GetObject`) are allowed on **which resources** (e.g., a specific S3 bucket).

There are 3 main components:

- **Effect** (Allow or Deny)
- **Action** (like `s3:GetObject`, `glue:StartJobRun`)
- **Resource** (e.g., specific bucket or Glue job ARN)

Example actions in a policy:
```json
{
  "Action": [
    "s3:List*",
    "s3:Get*"
  ],
  "Resource": "arn:aws:s3:::dlai-data-engineering*"
}

# Basics of Networking in Cloud (AWS)

When you build a data pipeline in the cloud, you're really creating a **network of connected resources**. Understanding this network is key to ensuring smooth data flow, proper access, and secure configurations.

---

## 📍 Cloud Infrastructure Hierarchy

Cloud computing is not abstract—it relies on **real, physical data centers** distributed globally.

- **Region**: A geographical area (like Mumbai, Tokyo, N. Virginia).
- **Availability Zone (AZ)**: Each region has multiple AZs, which are isolated data centers with their own power and networking.
- **Data Center**: The actual facility hosting servers and networking hardware.

### 🧠 Why Regions Matter

When choosing a region to deploy resources, consider:

![Region Considerations](./images/region_consideration.png)

- **Legal compliance**: Does your data need to stay within a specific country's jurisdiction?
- **Latency**: Lower if your users are closer to the region.
- **Availability**: Distributing across multiple AZs helps with disaster recovery.

---

## 🏗️ Virtual Private Cloud (VPC)

A **VPC (Virtual Private Cloud)** is a customizable private network in AWS that lets you launch AWS resources in a logically isolated section of the cloud.

- It spans **multiple availability zones**.
- Gives you control over networking—IP ranges, subnets, route tables, and gateways.

![VPC](./images/vpc.png)

---

## 🌍 Subnets within a VPC

Inside a VPC, we divide the network into **subnets**:

- **Public Subnet**: Has access to the internet (usually for web servers).
- **Private Subnet**: No direct internet access (used for databases, internal services).

You misunderstood this part slightly—**public and private** subnets are **not based on access behavior**, but on **network configurations and routing**.

Here's a corrected diagram:

![Public and Private Subnets](./images/vpc_public_private_subnet.png)

---

## 🔐 Subnet Configuration: ACLs & Gateways

Every subnet can have:

- **Network ACLs (Access Control Lists)**: Security rules that apply to traffic in/out of the subnet.
- **Internet Gateway**: Needed to allow public subnets to talk to the internet.
- **Route Tables**: Define where traffic should go (e.g., internet or internal).

Visualizing the configuration:

![Subnet Routing and Gateway](./images/subnets.png)

---

## ✅ Summary

- 🏢 **Region**: Where your resources are hosted—choose based on latency, compliance, and cost.
- 🧱 **Availability Zones**: Improve reliability—spread your infrastructure.
- 🧠 **VPC**: Your own private network inside AWS.
- 📦 **Subnets**: Divide VPC into public (internet-facing) and private (internal-only).
- 🔐 **ACLs & Gateways**: Control who can talk to what.

# AWS Networking Basics – VPC, Subnets, IP, CIDR, Bits
---

## 📦 What is a VPC?

- **VPC** stands for **Virtual Private Cloud**.
- It's like a **protected network boundary** for your AWS resources.
- You use it to **isolate and control** networking for services like EC2, RDS, etc.
- You define how they **connect to each other** and to the **internet**.

🧱 Think of it as a **walled city**:
- Inside: EC2 instances, databases, etc.
- Wall: Firewall + routing rules + IP controls.

---

## 🌍 IPv4 Addressing

- IPv4 = **Internet Protocol Version 4**
- Address format: `x.x.x.x` (e.g., `192.168.0.1`)
- Each number ranges from `0–255` (8 bits per segment)
- So total: `8 + 8 + 8 + 8 = 32 bits`
- Total possible unique addresses: `2^32 = ~4.3 billion`

---

## 📏 CIDR – Classless Inter-Domain Routing

- CIDR notation: `10.0.0.0/16`
  - `10.0.0.0` → **Network prefix**
  - `/16` → **First 16 bits are fixed** (used for network)
- Remaining bits → used to assign **host addresses**.

Example:
- `10.0.0.0/16` gives us:
  - `2^(32-16) = 2^16 = 65,536` private IP addresses
  - All starting with `10.0.*.*`

✅ Yes, you can have `/8`, `/12`, `/22`, `/24`, etc.
- The smaller the suffix (`/8`), the **larger the network**.

---

## 🧠 How Bits Work

- A **bit** is the **smallest unit of data** → 0 or 1.
- **8 bits = 1 byte**
- Bits represent **electrical states** (via transistors).
- Combining bits creates **characters** humans understand (e.g., ASCII: `01100001` = 'a').

---

## 💻 64-bit Processors

- A **64-bit CPU** can process **64 bits of data at once** (not per second, but per instruction).
- It can address a huge memory space (2^64).
- Modern systems are typically 64-bit.
- ✅ 64-bit is **not the max** — higher-bit architectures exist in research or specialized chips.

---

## 📡 Public vs Private Subnet

| Type | Description | Example Resources | Internet Access |
|------|-------------|-------------------|-----------------|
| **Public Subnet** | Direct access to the Internet | Load Balancer, Web Server | Yes |
| **Private Subnet** | Isolated, no direct access | Database, Backend services | No |

---

## 🌐 Internet Gateway vs NAT Gateway

| Gateway Type | Used By | Allows Internet Access | Public IP Required |
|--------------|---------|-------------------------|---------------------|
| **Internet Gateway** | Public Subnet | Yes (inbound/outbound) | Yes |
| **NAT Gateway** | Private Subnet | Yes (outbound only) | No |

---

## 📘 Default VPC

- AWS gives you a **default VPC** in each region.
- It includes:
  - One public subnet per AZ
  - An internet gateway
  - Routing set up
- ⚠️ **Not recommended** for production — it’s mainly for testing.
- Create **custom VPCs** for fine-grained control.

---

## 🧠 Memory: Bits to Megabytes

- `1 byte = 8 bits`
- `1 kilobyte (KB) = 1024 bytes`
- `1 megabyte (MB) = 1024 KB = 1024 × 1024 × 8 bits = 8,388,608 bits`

---

Let me know if you'd like me to add diagrams or example diagrams using ASCII or image links!

# AWS VPC Setup 

## 🎯 Scenario

We are setting up the networking components for a web application that runs on an EC2 instance and queries a backend RDS database. 

In this setup:

- EC2 + RDS will reside in **private subnets**.
- NAT Gateways will be used in **public subnets** for outbound internet access.
- An **Application Load Balancer** will sit in the public subnets to forward traffic to EC2.

We are starting by building the **VPC and subnets** only. We'll add EC2, RDS, and gateways later.

---

## 🧠 Why Not Use the Default VPC?

Each AWS region comes with a **default VPC**, which has:

- Public subnets
- An Internet Gateway

This makes it easy to quickly launch internet-facing instances.

> However, for most **real-world applications**, we don't want resources like databases to be directly internet-accessible.  
> ✅ Best Practice: Always create **custom VPCs** with fine-grained control over networking and security.

---

## 📦 VPC Basics

- A **VPC (Virtual Private Cloud)** is a private, isolated section of AWS cloud where we can define our networking.
- Each **VPC can span multiple Availability Zones (AZs)** within a region.
- Each **VPC must have a CIDR block**, which defines the range of private IPs.

📷  
![VPC Overview](./images/vpc_example.png)  
![AWS VPC Components](./images/vpc_aws.png)

---

## 🔧 Step 1: Create a Custom VPC

- Go to AWS Console → Search for `VPC` → Click on `Create VPC`
- Choose: "VPC only" (manual creation of subnets)

🖼️  
![Create VPC](./images/create_vpc.png)

### ✍️ VPC Settings

| Setting            | Value              |
|--------------------|--------------------|
| VPC Name           | `project-1`        |
| Region             | `us-east-1`        |
| IPv4 CIDR block    | `10.0.0.0/16`      |

📌 This gives us **2¹⁶ = 65,536 IPs** in this private IP range.

🧠 CIDR Explanation:
- IPs are written as: `A.B.C.D/n`, where `n` is how many **prefix bits** are fixed.
- `10.0.0.0/16` = First 16 bits (or first 2 octets) are fixed.
- IPs inside this VPC will start with `10.0`.

📷  
![CIDR Block Input](./images/vpc_cider.png)  
![CIDR Notation Explanation](./images/ip_cider.png)

---

## 🔄 Communication Between VPCs

Each VPC is **isolated by default**.

If needed, VPCs can connect through **VPC Peering** or **Transit Gateways**.

📷  
![VPC-to-VPC Communication](./images/vpc_to_vpc.png)

---

## 🌐 Step 2: Create Public and Private Subnets

A **Subnet** is a subdivision of a VPC. It must be:

- Assigned to a specific **Availability Zone**
- A **subset of VPC CIDR range**
- Can be **public** or **private**

We’ll create:

| Subnet Type     | AZ           | CIDR Block      |
|------------------|--------------|------------------|
| Public Subnet 1 | us-east-1a   | 10.0.1.0/24     |
| Private Subnet 1| us-east-1a   | 10.0.2.0/24     |
| Public Subnet 2 | us-east-1b   | 10.0.3.0/24     |
| Private Subnet 2| us-east-1b   | 10.0.4.0/24     |

📷  
![VPC with Subnets](./images/vpc_subnet.png)

🖼️  
![Subnet Settings](./images/subnet_setting.png)

---

## ✅ Current Status

We’ve created:

- A VPC named `project-1`
- 2 Public Subnets
- 2 Private Subnets

None of the subnets currently have Internet access. The EC2 and RDS instances we’ll deploy **cannot yet access the Internet**.


# AWS Internet & NAT Gateways setup in a VPC

---

## 🔒 Problem Statement

By default, **a VPC is isolated** — no Internet access in or out.  
So, even if we launch an EC2 instance inside a public or private subnet, **it can't connect to the Internet**.

To solve this:
- Public resources (like Load Balancers) use an **Internet Gateway**
- Private resources (like EC2s, RDS) use a **NAT Gateway** to access the Internet **outbound only(Outbound means data going out from your computer or server to the Internet or another external system.)**

---

## 🧱 Architecture Overview

![EC2 Use Case](./images/ec2_consideration.png)

- EC2 & RDS in private subnets
- Need for:
  - App updates (EC2 → Internet ✅)
  - Users accessing app (Internet → ALB → EC2 ✅)

---

## 🚪 Step 1: Create and Attach Internet Gateway

An **Internet Gateway (IGW)** acts like the **main door** to your VPC.

### Steps:
1. Go to **VPC Dashboard** → Internet Gateways
2. Click **Create Internet Gateway**
3. Give it a name (e.g., `Project1-Gateway`)
4. Click **Attach to VPC** and select your VPC

📸  
![Internet Gateway](./images/internet_gateway.png)

⛓ 1 Internet Gateway ↔ 1 VPC (One-to-One)

---

## 🚪 Step 2: Create NAT Gateway(s) in Public Subnet(s)

A **NAT Gateway** allows:
- EC2 in **private subnets** to **send traffic out** to Internet (e.g., `apt-get update`)
- But it **blocks incoming Internet connections** to those EC2s

### Steps:
1. Go to **VPC Dashboard** → NAT Gateways
2. Click **Create NAT Gateway**
3. Select:
   - Public Subnet 1
   - Allocate new **Elastic IP**
4. Click **Create**
5. Repeat for Public Subnet 2 (for High Availability)

📸  
![NAT Gateway](./images/nat_gateway.png)

✅ **Elastic IP is required** so the NAT Gateway can communicate with the Internet.

🔗 [More on Elastic IPs (AWS Docs)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html)

---

## 🔁 EC2 → Internet via NAT: Step-by-Step

Let’s say EC2 runs: `sudo apt-get update`

### Flow:

1. EC2 (Private Subnet) → Sends request to Internet
2. **Route Table** sends 0.0.0.0/0 traffic to NAT Gateway
3. **NAT Gateway**:
   - Replaces EC2's private IP with **Elastic IP**
   - Sends request to Internet
4. **Ubuntu repo** replies to the **Elastic IP**
5. **NAT Gateway**:
   - Matches reply to original EC2
   - Sends it back to EC2 in Private Subnet

💡 Internet never sees the EC2 directly.

---

## 🔐 Security Summary

| Action                             | Allowed? |
|-----------------------------------|----------|
| Internet → EC2 (Private Subnet)   | ❌ Blocked |
| EC2 → Internet (Outbound Request) | ✅ Allowed |
| Internet → NAT Gateway            | ❌ Blocked |
| NAT Gateway → Internet            | ✅ Allowed |

---

## 🏗 Final Architecture

![Final Architecture](./images/final_architecture.png)

---

## ✅ Summary

- **Internet Gateway** enables Internet **access for public** subnets
- **NAT Gateway** enables **outbound Internet** for private subnets
- **Elastic IP** is a must-have for NAT Gateways

---

> ✅ Tip: Always associate the correct **Route Tables** to ensure subnets know where to send outbound traffic.

# VPC Route Tables Configuration: Internet Gateway & NAT Gateway

In this notebook, we will configure **route tables** to enable internet access for instances in both public and private subnets.

---

## 🧱 Architecture Overview

The VPC contains:
- 2 Public Subnets (each with a NAT Gateway)
- 2 Private Subnets (containing EC2 & RDS)
- 1 Internet Gateway attached to the VPC
- 1 Application Load Balancer (ALB) to route traffic

This is the high-level network setup:

![Route Table Diagram](./images/route_table_diagram.png)

---

## 📍 What are Route Tables?

Route tables direct network traffic within your VPC:
- The **default route table** allows internal communication (`10.0.0.0/16`, `local`)
- Custom routes are needed for internet access

---

## ✅ Public Subnet Configuration (via Internet Gateway)

To enable **direct internet access** for public subnets:

1. Go to **Route Tables** in VPC Dashboard
2. Click `Create route table`

![Create Route Table](./images/create_route_table.png)

3. Associate it with the appropriate **public subnet**

![Associate Subnet](./images/selecting.png)

4. Click **Actions > Edit Routes**

![Edit Routes](./images/actions.png)

5. Add this route:
   - **Destination:** `0.0.0.0/0`
   - **Target:** `Internet Gateway`

![Internet Gateway Route](./images/internet_gateway_routel.png)

✅ This allows instances with **public IPs** to send/receive internet traffic.

Repeat these steps for all public subnets.

---

## 🔒 Private Subnet Configuration (via NAT Gateway)

To allow **outbound internet access only** for private subnets:

1. Go to the **route table** associated with the private subnet
2. Click **Edit Routes**
3. Add this route:
   - **Destination:** `0.0.0.0/0`
   - **Target:** `NAT Gateway` (in the **public subnet**)

![NAT Gateway Route](./images/nat_gateway_route.png)

🔁 **Repeat this for all private subnets.**

🛡️ This setup **prevents direct inbound internet access**, while still allowing private instances to:
- Pull updates
- Download software
- Connect to external APIs

---

## 📘 Summary

| Subnet Type   | Destination    | Target            | Purpose                              |
|---------------|----------------|-------------------|--------------------------------------|
| Public Subnet | `0.0.0.0/0`     | Internet Gateway  | Full internet access (in/out)        |
| Private Subnet| `0.0.0.0/0`     | NAT Gateway       | Outbound-only internet access        |

---

Now your VPC is ready for both public-facing and internal workloads. 🎯

# 🔗 Resource

To understand everything about **VPC**, **Subnets**, **Security Groups**, and **Network ACLs**, refer to this coursera resource:

[Optional AWS Networking Overview (VPC)](https://www.coursera.org/learn/source-systems-data-ingestion-and-pipelines/supplement/j3i2z/optional-aws-networking-overview-vpc)

---
# AWS Networking: Security Groups & Network ACLs

## 🧱 VPC Architecture Overview

In this setup:
- Public subnets contain NAT Gateways
- Private subnets host EC2 and RDS instances
- An Application Load Balancer (ALB) routes Internet traffic to the EC2 instances

![VPC Overview](./images/full_breakdown.png)

---

## 🔄 Internet to Application Flow

1. The Internet Gateway accepts traffic from the user.
2. ALB sends traffic to EC2 inside private subnets.
3. EC2 communicates with RDS.

Each resource must have proper networking rules configured to allow traffic.

---

## 🔐 What are Security Groups?

Security Groups are **instance-level virtual firewalls** that control both **inbound and outbound traffic**.

![Security Group Basics](./images/securit_group_iamge.png)

### 📌 Key Points
- By default:
  - Inbound traffic is **denied**
  - Outbound traffic is **allowed**
- **Stateful**: Return traffic is automatically allowed
- Can reference other **security groups** (security group chaining)

---

## 🔗 Security Group Chaining Example

- ALB Security Group accepts HTTP (port 80) and HTTPS (port 443) from the internet.
- EC2 Security Group allows HTTP and HTTPS **only from the ALB's SG**.
- RDS Security Group allows MySQL (port 3306) only **from the EC2's SG**.

![Security Group Chaining](./images/security_group_chaining.png)

---

## 🛠️ Creating a Security Group

To create a security group:

1. Navigate to **VPC → Security Groups**
2. Click **Create Security Group**

![Create Security Group](./images/create_sc.png)

3. Provide name, description, and select the VPC

4. Add inbound rules:
   - Type: HTTP (port 80) → Source: `0.0.0.0/0`
   - Type: HTTPS (port 443) → Source: `0.0.0.0/0`

![Creating Security Group](./images/creating_sc.png)

---

## 🔒 What are Network ACLs?

Network ACLs are **stateless** firewalls applied at the **subnet level**.

![ACL Overview](./images/acls.png)

### 📌 Key Points
- Must define both **inbound and outbound** rules explicitly
- Evaluate rules by number (lowest first)
- Can **allow** or **deny** traffic
- Useful for blocking specific IPs or enforcing strict subnet-level rules



---

## 🧪 Troubleshooting Network Connectivity

If you're facing issues accessing EC2, RDS, or external services, verify the following:

![Connectivity Issues Checklist](./images/encounter_connectivity_issues.png)

1. Internet Gateway is attached to the VPC
2. Route Tables have correct routes
3. Subnet associations are configured properly
4. Security Groups allow the needed traffic
5. Network ACLs aren’t blocking required traffic
6. Instances are in correct subnets and SGs

# AWS Network Security Controls: VPC, Network ACLs, and Security Groups

In this notebook, we'll cover how AWS network security works using VPCs, subnets, Network ACLs, and Security Groups.

## 🏗️ Architecture Overview

AWS networking security is built on these layers:

- **VPC (Virtual Private Cloud):** isolated virtual network in AWS
- **Subnets:** smaller segments within your VPC
- **Network ACLs:** subnet-level stateless firewalls
- **Security Groups:** instance-level stateful firewalls


## 🗺️ What is a VPC?

- A Virtual Private Cloud is a logically isolated network you create in AWS.
- It functions like your own private datacenter.
- VPC is not a firewall, but it defines boundaries within which other security controls work.

**Key role:** Creates an isolated environment for AWS resources.

## 🌐 Subnets

- Subnets split your VPC into smaller networks.
- Types:
    - **Public Subnets:** have Internet access through an Internet Gateway
    - **Private Subnets:** have no direct Internet access
- Every EC2 instance is launched inside a subnet.


## 🔑 Network ACLs (Access Control Lists)

- Acts as a firewall at the subnet level.

**Properties:**

- **Stateless:** You must configure both inbound and outbound rules separately
- Can allow or deny traffic
- Rules are processed in number order (lowest first)
- Applies to all resources within a subnet

**Controls:**

- Inbound rules: control traffic into the subnet
- Outbound rules: control traffic out of the subnet

**Default behavior:**

- Inbound: Allow all
- Outbound: Allow all

_Note: NACLs impact internal subnet-to-subnet traffic too._

## 🛡️ Security Groups

- Act as firewalls at the instance (resource) level.

**Properties:**

- **Stateful:** Responses are automatically allowed
- Only allow traffic, no explicit deny rules
- Applied to individual resources
- All rules evaluated together

**Controls:**

- Inbound: controls who can connect to the instance
- Outbound: controls where the instance can connect

**Default behavior:**

- Inbound: Deny all
- Outbound: Allow all


## 📊 Summary Table

| Feature | VPC | Network ACLs | Security Groups |
| :-- | :-- | :-- | :-- |
| Level | Network boundary | Subnet level | Instance level |
| Stateful | N/A | No (Stateless) | Yes (Stateful) |
| Allow/Deny | N/A | Allow and Deny | Allow only |
| Affects | All resources | All traffic in/out | Only attached resources |
| Default Inbound | N/A | Allow all | Deny all |
| Default Outbound | N/A | Allow all | Allow all |

## 📝 Best Practices

- Use Security Groups to tightly control instance-level traffic (for example, allow only port 22 for SSH)
- Use Network ACLs for broader subnet-level controls (such as blocking IP ranges)
- NACLs filter traffic at the subnet border
- Security Groups filter at the resource level

Your AWS networks can now be both accessible and secure!

