# What is Data Engineering?

**Data Engineering** is the practice of designing, building, and maintaining systems that enable the collection, storage, processing, and analysis of data at scale.

It provides the foundation that allows data scientists, analysts, and business teams to work with clean, reliable, and accessible data.

---

## Key Responsibilities of a Data Engineer:

- **Build data pipelines** to move data from source systems to data warehouses or data lakes.
- **Clean and transform raw data** so it's usable for analytics and machine learning.
- **Design scalable architectures** for storing and querying large volumes of data.
- **Ensure data quality, security, and governance.**
- **Collaborate with stakeholders** to understand data needs and translate them into technical systems.

---

## Why It Matters

Without data engineering:
- Data scientists and analysts waste time trying to access or clean data.
- Business decisions may be based on incomplete or incorrect data.
- Machine learning models cannot be trained or deployed effectively.

In short, **Data Engineering powers the entire data ecosystem** in a company.

# The Evolution of Data Engineering

In the early days, **data engineering** didn't exist as a dedicated role. The original data engineers were simply **software engineers**, focused on building applications. The **data** generated by these applications was treated as a byproduct — useful mainly for **debugging** or **monitoring**, but not much else.

This data was like **"exhaust" from a car** — a natural outcome, but not something with standalone value.

---

Over time, as companies recognized the **intrinsic value of data**, especially with the rise in **volume and variety**, the role of engineers shifted. Engineers started building systems *specifically* for data ingestion, transformation, and delivery. This led to the formalization of the **Data Engineer** role.

---

## What Is Data Engineering?

In the book *Fundamentals of Data Engineering*, the authors Joe Reis and Matt Housley define data engineering as:

> **"The development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning."**

---

### Data Engineering Sits at the Intersection of:

- **Security**
- **Data Management**
- **DataOps**
- **Data Architecture**
- **Orchestration**
- **Software Engineering**

These are referred to as the **undercurrents** of data engineering — they apply across every stage of the data life cycle.

---

## The Data Engineering Life Cycle

The lifecycle can be visualized as a flow from data generation to end use cases:

1. **Data Generation & Source Systems**  
   (e.g. app logs, sensor data, uploaded files)

2. **Ingestion**  
   (bringing raw data into your systems)

3. **Storage**  
   (saving raw or processed data efficiently)

4. **Transformation**  
   (cleaning, filtering, enriching, and shaping the data)

5. **Serving**  
   (making processed data available for use)

6. **End Use Cases**  
   (analytics, machine learning, reverse ETL)

---

## 📌 Data Engineering Lifecycle Diagram
![data engineering life cycle](image/data_engineering_lifecycle.png)

---

## Final Thought

As a Data Engineer, your goal is to **transform raw data into useful, reliable, and accessible information** — supporting analytics, machine learning, and other business needs.

# History of Data Engineering – What I Understood

From the video, I understood that data has always existed — not just in the form of numbers or words, but also as natural signals like wind, sound, or light. But the kind of data we focus on in data engineering is **digitally recorded data**, the type that can be stored on a computer and used for analysis.

---

## The Early Days (1960s–1980s)

Data engineering didn’t start as a formal role. In the 1960s, computers introduced the first **digital databases**. In the 1970s, IBM engineers created **relational databases** and **SQL** (Structured Query Language), which is still widely used today.

In the 1980s, **Bill Inmon** came up with the idea of the **data warehouse**, which helped businesses transform and organize data for better decision-making. These early efforts laid the foundation for what would eventually become the field of data engineering.

---

## The 1990s: Growth of BI and the Internet

In the 1990s, companies began building **data pipelines** to support **Business Intelligence (BI)** and reporting. Two important names during this time were **Bill Inmon** and **Ralph Kimball**, who proposed different approaches to data modeling for analytics.

Also, the rise of the **Internet** (and companies like Amazon) created massive growth in data. This made data pipelines and infrastructure even more important.

---

## The 2000s: Big Data Era Begins

In the early 2000s, tech giants like Google, Amazon, and Yahoo faced data at a scale that traditional systems couldn’t handle. This led to the **Big Data movement**.

Here’s what stood out:
- **Big Data** was defined by the 3 V's: **Volume**, **Velocity**, and **Variety**.
- In **2004**, Google released the **MapReduce** paper, which introduced a new way to process huge amounts of data.
- In **2006**, Yahoo open-sourced **Apache Hadoop**, inspired by Google’s paper. This was a major shift — suddenly, even small companies could work with massive datasets.
- Around the same time, **Amazon Web Services (AWS)** launched services like **EC2**, **S3**, and **DynamoDB**, making cloud computing accessible to everyone.

These developments created a new generation of **data engineers**, especially those focused on solving problems at scale.

---

## The 2010s: Cloud and Real-Time Data

By the 2010s, public cloud platforms like **AWS, Google Cloud, and Azure** became mainstream. This made it much easier to manage and process data without setting up massive infrastructure.

A big shift also happened from **batch processing** (processing data in chunks) to **real-time streaming**, where data could be processed as it arrived. Tools like **Apache Kafka** and **Apache Spark** helped with this transition.

Eventually, the term **Big Data** faded because working with large-scale data became a common part of engineering. It wasn’t a “special case” anymore — just part of the job.

---

## Today: The Role of Modern Data Engineers

Today, **data engineering** is a central and strategic part of any data-driven business. Data engineers now work with:
- Cloud-native tools (like AWS Glue, Snowflake, Redshift)
- Workflow orchestrators (like Airflow or Dagster)
- Real-time data processing tools (like Kafka and Flink)

# Stakeholders in the Data Engineering Workflow

In a data engineering workflow, the data engineer plays a central role — acting as a bridge between upstream and downstream stakeholders.

---

## Downstream Stakeholders

Downstream stakeholders are the consumers of processed and served data. These include:

- **Analysts**
- **Data Scientists**
- **Machine Learning Engineers**
- **Salespeople**
- **Marketing Professionals**
- **Executives**

Each stakeholder group has different data needs, goals, and expectations.

To serve downstream stakeholders effectively, the data engineer must understand:
- **How often** data is needed (e.g., real-time vs daily)
- **What information** is needed (specific tables, metrics, joins, aggregations)
- **How much latency** is acceptable (seconds, hours, days)

An example:  
A business analyst may need to run SQL queries for dashboards and trend analysis. To support this, the data engineer must ensure:
- Data is refreshed frequently enough
- Queries run efficiently (through pre-aggregations or materialized views)
- Time zones and definitions (like start/end of "day") are consistent and well-aligned

---

## Upstream Stakeholders

Upstream stakeholders are the owners of source systems that generate raw data. Most often, these are **software engineers**, either within the organization or from external systems.

From upstream stakeholders, the data engineer needs to understand:
- **Data volume** (how much data to expect)
- **Frequency** (how often data is emitted)
- **Format** (JSON, CSV, Avro, etc.)
- **Data security** and **compliance** considerations
- **Schema changes**, system outages, or disruptions to the data flow

In many ways, the data engineer becomes the **consumer** of upstream data, similar to how analysts consume the outputs produced by the data engineer.

Establishing a strong communication loop with upstream teams can help anticipate changes and design resilient data pipelines.

---

## Core Responsibilities

The data engineer must:
- Translate upstream raw data into usable, reliable formats
- Serve that data to downstream users in a way that supports their specific goals
- Maintain communication in both directions to ensure smooth data operations and system alignment

---

## Visual Summary

![Data Engineering Stakeholders Diagram](image/workflow_data_engineering.png)

---


# Requirements Gathering for Data Engineering Systems

Before building any data system — writing code, provisioning infrastructure, or deploying to the Cloud — it is critical to **understand stakeholder needs** and translate them into system requirements.

---

## Types of Requirements

### 1. Business Requirements  
These define the **high-level goals of the organization**, such as:
- Increasing revenue
- Growing user base
- Improving customer experience

### 2. Stakeholder Requirements  
These are the **individual needs** of downstream stakeholders (analysts, data scientists, executives, etc.) — the tasks they need data to help accomplish.

### 3. System Requirements  
System requirements define what the system must do to meet business and stakeholder needs. These are split into:

- **Functional Requirements** (the **what**)  
  - Example: "Refresh the dashboard data every 24 hours"
  - Example: "Send alerts when data anomalies are detected"

- **Non-Functional Requirements** (the **how**)  
  - Example: "Use a cloud-based ingestion system that supports real-time streaming"
  - Example: "Ensure latency is under 2 minutes and uptime is 99.9%"

---

## What to Consider in System Requirements

- Business goals and stakeholder needs
- Features and attributes of your data products
- Compute, memory, and storage capacity
- Data freshness and latency expectations
- Security and regulatory compliance
- Cost constraints

---

## The Process of Requirements Gathering

Requirements gathering starts with **talking to stakeholders**. However, stakeholders rarely present their needs in the form of technical specifications. Instead, they express **business goals** or frustrations.

It’s the data engineer’s responsibility to:
- Ask the right questions
- Translate ambiguous needs into clear system requirements
- Tailor conversations based on the stakeholder's technical background

# Stakeholder Conversation: Data Engineer & Data Scientist

This section summarizes a mock stakeholder conversation between Joe (Data Engineer) and Colleen (Data Scientist). The goal of the conversation is to gather requirements for a data system that supports marketing and product analytics.

---

## 🎯 Objective

To understand the challenges faced by the data scientist and translate them into **system requirements** that the data engineer can address through ingestion, transformation, and serving of data.

---

## 🔍 Key Problems Identified

### 1. Limited and Manual Data Access
- Sales data resides in a **production database**.
- Colleen does not have direct access due to risks; receives **daily data dumps** instead.
- Data is delivered in **CSV/JSON formats**.

### 2. Dirty and Overwhelming Data
- 90% of the data received is not useful.
- Colleen spends **80% of her time** cleaning and preparing the data.
- Scripts **frequently break** due to anomalies or schema changes.

### 3. Outdated Data for Analytics
- Marketing team wants **real-time** sales data by region.
- Current dashboards show data that is **2 days old**.

---

## 📊 Current Use Cases

### Dashboards for Marketing
- Show 30-day sales trends by **product category** and **region**.
- Support **drill-down** to hourly/product-level insights.
- Require fresher data for **campaign timing and evaluation**.

### Product Recommendation Engine
- Currently based on **most popular products**.
- Future goal: **personalized recommendations** using content-based filtering.
- Needs:
  - Real-time or recent **user behavior data**
  - A way to **deploy and serve model outputs**

---

## 🧠 Insights from the Conversation

- Need to **automate** data ingestion and processing.
- Requirement for **low-latency** data (exact threshold TBD).
- Marketing may benefit from:
  - Data updated **hourly or more frequently**
  - Ability to **target campaigns** using fresh user data
- Additional follow-up needed with marketing team to clarify:
  - **Actionable use cases**
  - **Required freshness/latency**

---

## ✅ Initial System Requirements

### Functional Requirements (What the system should do)
- Ingest sales data more **frequently and automatically**
- Transform and clean data to remove irrelevant parts
- Serve aggregated, cleaned data to dashboards
- Support training and deployment of **ML recommendation models**

### Non-Functional Requirements (How the system should work)
- Handle **schema changes gracefully**
- Maintain **data freshness** suitable for near real-time needs
- Be **resilient** and monitored for anomalies or failures

---

## 📌 Next Steps

- Follow up with **marketing stakeholders** to refine data freshness and use case requirements.
- Design a pipeline that supports:
  - Reliable and automated **data ingestion**
  - Scalable **data transformation and storage**
  - Efficient **data serving** for both dashboards and ML use cases

---

This conversation highlights how initial stakeholder interviews are essential for uncovering not only technical needs but also organizational goals and pain points.

# Breaking Down a Stakeholder Conversation: Requirements Gathering

In this section, we reflect on the mock conversation between the data engineer and the data scientist (Colleen) to extract actionable requirements and outline a reusable approach to stakeholder interviews in data engineering projects.

---

## 🎯 Purpose

To demonstrate how to:
- Extract system requirements from stakeholder conversations.
- Clarify vague terminology like “real-time.”
- Identify additional stakeholders to speak with.
- Translate business needs into data engineering solutions.

---

## ✅ Key Steps in Requirements Gathering

### 1. Understand Existing Systems and Pain Points
- Learn what systems are currently in place.
- Identify inefficiencies, delays, manual steps, and risks.
- In this case:
  - Only daily data dumps from the production database.
  - Data is messy and frequently breaks scripts due to schema changes.

### 2. Ask What Actions Stakeholders Will Take with the Data
- Don't just ask what data they want — ask what decisions or operations they will perform.
- Helps clarify how critical the data is and what latency is acceptable.
- Example:
  - Dashboards are used to optimize marketing campaigns.
  - Recommendation engine is used to influence customer purchases during browsing.

### 3. Confirm Your Understanding
- Summarize and repeat back what you've heard.
- Ensure alignment before building any solution.
- In this case:
  - Automating ingestion and transformation would be a huge help.
  - Serving clean, near real-time data would improve dashboard and ML performance.

### 4. Identify Other Stakeholders
- Your current stakeholder may not have all the information.
- Additional conversations are needed with:
  - **Software Engineers** (who manage the source system)
  - **Marketing Team** (to clarify what "real-time" means and what they’ll do with the data)

---

## 🧩 Requirements Identified

### Functional Requirements (What the system should do)
- Ingest data from the sales platform.
- Transform and clean the data automatically.
- Serve data to:
  - Dashboards (marketing use case)
  - Recommendation engine (ML use case)

### Non-Functional Requirements (How the system should perform)
- Define and meet latency thresholds (e.g., hourly vs real-time).
- Handle schema changes without failure.
- Ensure reliability and scalability of pipelines.

---

## 💡 Important Tactic

### Ask: “What action will you take based on the data?”
- Avoid letting stakeholders define the system themselves.
- Focus on understanding their **business use case** first.
- Then derive the appropriate technical implementation.

---

### Key Takeaway:
> The foundation of a successful data engineering project is built during **requirements gathering**, where listening, translating, and aligning business goals with technical execution is critical.

---

# Thinking Like a Data Engineer

In any data engineering project, success depends on more than just technical implementation. It starts with understanding the business context and ends with continuous iteration based on stakeholder needs. Below is a four-stage framework for thinking like a data engineer.

---

## 🧭 Framework Overview

This framework consists of four main stages:

1. **Identify business goals & stakeholder needs**
2. **Define system requirements**
3. **Choose tools & technologies**
4. **Build, evaluate, iterate & evolve**

This is not a strict linear process — in real-world scenarios, you'll often revisit earlier stages as goals and technologies evolve.

---

## 📌 Stage 1: Identify Business Goals & Stakeholder Needs

- Identify business goals and the stakeholders you are serving.
- Explore existing systems and what pain points stakeholders currently face.
- Ask stakeholders: **"What actions will you take using the data product?"**

---

## 📌 Stage 2: Define System Requirements

- Translate stakeholder needs into **functional requirements** (what the system should do).
- Define **non-functional requirements** (how the system should perform).
- Document and confirm requirements with stakeholders to ensure alignment.

---

## 📌 Stage 3: Choose Tools & Technologies

- Identify tools and technologies that meet the non-functional requirements.
- Conduct a **cost-benefit analysis** to compare different options.
- **Prototype and test** your design to check if it meets stakeholder expectations.

---

## 📌 Stage 4: Build, Evaluate, Iterate & Evolve

- Build and deploy the **production-ready system**.
- Continuously **monitor and evaluate** its performance.
- Iterate and evolve the system based on feedback and changing needs.

---

## 📊 Visual Summary
![Thinking Like a Data Engineer](Introduction_to_Data_Engineering/Module1/image/framework.png)


## 🌩️ Introduction to Cloud Computing on AWS

### What is the Cloud?

At **AWS**, the cloud is described as:

> **“The on-demand delivery of IT resources over the Internet with pay-as-you-go pricing.”**

This means:
- You get computing/storage/networking resources instantly, whenever needed.
- You shut them down when not in use.
- You only pay for what you use.

This is very different from traditional on-premise setups, where you:
- Purchase hardware up-front.
- Commit to long-term investments.
- Must manage capacity and scaling yourself.

---

### 🧱 Core AWS Resources

These core resources are the **building blocks** of most cloud systems:

#### 1. **Compute Resources**
Places to run code:
- Virtual Machines (e.g., EC2)
- Container services (e.g., ECS, EKS)
- Serverless functions (e.g., AWS Lambda)

#### 2. **Storage Resources**
Places to store data:
- **Amazon S3** (object storage)
- **Amazon EBS** (block storage)
- **Databases** (RDS, DynamoDB, Neptune, etc.)

#### 3. **Networking Resources**
Connect services internally or to the Internet:
- **Amazon VPC** – Your own private network in the cloud.

---

### 🚀 Benefits of Cloud Computing

#### ✅ **Scalable & Elastic**
- You don't have to estimate or provision storage ahead of time.
- AWS services like **S3** scale automatically.
- You're always ready for traffic spikes or demand changes.

#### ✅ **Cost-Efficient**
- No upfront investment.
- Like electricity — you only pay for what you use.

---

### 🌍 AWS Global Infrastructure

AWS resources are not tied to a single data center. They are distributed across:

#### 🔹 **Regions**
- A Region is a **geographic area** (e.g., `us-east-1`, `ap-south-1`).
- Examples:  
  - **US East (N. Virginia)**  
  - **Asia Pacific (Mumbai)**  
  - **Europe (Frankfurt)**

#### 🔹 **Availability Zones (AZs)**
- Each region has **multiple AZs**, which are **isolated clusters of data centers**.
- AZs are designed for **fault tolerance**:
  - If one AZ fails (e.g., due to power outage or flood), the others handle the load.

#### 🛠️ Region → AZ → Data Centers  

Multiple Data Centers → 1 AZ
Multiple AZs → 1 AWS Region

---

### 🌐 High-Speed Global Network

AWS connects its AZs and data centers using:
- A global network of **fiber cables**.
- High-speed **low-latency** connections.
- This helps services stay **available**, **reliable**, and **fast**.

---

### 🧩 Combining Services as a Data Engineer

As a data engineer, you’ll:
- Combine multiple AWS services like building blocks.
- Build solutions for **data ingestion**, **transformation**, **orchestration**, and **analytics**.

---

---

# ☁️ Introduction to AWS Core Services & Concepts

## 🌐 What is the AWS Cloud?

- **AWS** is the on-demand delivery of IT resources over the internet with **pay-as-you-go pricing**.
- No upfront cost, no need to manage physical servers.
- You only pay for what you use — like electricity.

---

## 📦 Core Categories of AWS Services

We will break this down into 5 major categories:

1. **Compute**
2. **Networking**
3. **Storage**
4. **Databases**
5. **Security**

---

### ⚙️ 1. Compute – *Processing Power*

- **Amazon EC2 (Elastic Compute Cloud):**  
  Virtual machines (VMs) in the cloud where you can run your code, host applications, or build pipelines.

  - You can choose OS (Linux/Windows), storage, CPU/RAM specs.
  - You have **full control** over the environment.
  - You can **scale horizontally** by launching many EC2 instances.

- **AWS Lambda:**  
  A serverless service to run code in response to events. You **don’t manage servers** at all.

- **Amazon ECS / EKS:**  
  Managed services to run **containers** (Docker/Kubernetes).

---

### 🌐 2. Networking – *Connectivity & Isolation*

- **Amazon VPC (Virtual Private Cloud):**  
  A private network inside AWS where your resources (like EC2s, RDS) reside.

  - You control subnets, IP addresses, routing.
  - Ensures **secure and isolated networking**.
  - All resources like EC2, RDS, Redshift are launched **inside a VPC**.

- VPC spans across **Availability Zones (AZs)** inside a **Region**.

---

### 💾 3. Storage – *Saving Data*

- **Amazon S3 (Simple Storage Service):**
  - Object storage for **any kind of file or unstructured data**.
  - Use it for storing logs, images, videos, documents, etc.
  - Automatically scales and is highly durable and elastic.

- **Amazon EBS (Elastic Block Store):**
  - Block storage used as **disks attached to EC2 instances**.
  - High performance, low latency.

- **Amazon EFS (Elastic File System):**
  - Managed file system that can be **shared across multiple EC2s**.
  - Works like your laptop file system (hierarchical).

---

### 🗃️ 4. Databases – *Structured Data*

- **Amazon RDS (Relational Database Service):**
  - Fully managed relational databases (MySQL, PostgreSQL, etc.).
  - Easy to scale, secure, and backup.

- **Amazon Redshift:**
  - A **data warehouse** built for analytics.
  - Used for running **complex queries on large datasets** quickly.
  - Ideal for **business intelligence (BI)** and **dashboards**.

---

### 🔐 5. Security – *Keeping Everything Safe*

- **Shared Responsibility Model:**
  - AWS secures the cloud (hardware, data centers, virtualization).
  - **You secure what you build inside the cloud** (your data, your code, your configurations).

  📌 *Analogy:*  
  Think of AWS like an apartment building.  
  - AWS ensures the building is secure.
  - You must lock your own door and protect your own apartment.

---

## 🧠 What is an Instance in Cloud?

> A **cloud instance** is a **virtual machine** (VM) that runs on hardware managed by the cloud provider.

### 🧰 Features of a Cloud Instance:
- Acts like a **remote computer**
- You can:
  - Install OS (Linux/Windows)
  - Run code and scripts
  - Host applications
- Fully configurable and scalable
- Can be started, stopped, or terminated anytime

### 🔁 Analogy:
> A cloud **instance** is like renting a room in a hotel.  
You don’t own the hotel (cloud servers), but you get **full access** to your own private room (VM).

### ✅ Real Use Case:
- A data engineer can use **EC2 instances** to:
  - Write Python code
  - Ingest data from APIs or S3
  - Process data and store the result in **Redshift** or **RDS**

---

## 🧠 Quick Clarification

- ✅ S3 is **object storage**, ideal for storing files long-term (but not for querying directly).
- ✅ Redshift is a **data warehouse**, optimized for **fast querying**, but it's still persistent.

---

# ☁️ Compute – Amazon Elastic Compute Cloud (EC2)

## What is EC2?

- Amazon EC2 (Elastic Compute Cloud) is a **virtual server** (or **virtual machine**) in the cloud.
- Lets you **run applications, code, or pipelines** on a machine you can configure (OS, CPU, memory).
- Highly **scalable, elastic, and pay-as-you-go**.

## What is a Server vs Virtual Server?

| Concept             | Physical Server                             | Virtual Server (VM)                              |
|---------------------|---------------------------------------------|--------------------------------------------------|
| Hardware            | Real CPU, RAM, Storage                      | Emulated (software-based) hardware               |
| OS & Applications   | Installed directly on machine               | Installed on top of virtualized environment      |
| Flexibility         | Limited by physical resources               | Scales as needed using shared resources          |

## 🧠 What is a Hypervisor?

- A **Hypervisor** is software that allows multiple virtual machines to run on a single physical machine.
- It **distributes CPU, memory, and disk** among different VMs.
- Acts as a **bridge between physical hardware and VMs**.

## 🧱 Components of a Virtual Machine

- **Virtual hardware** (emulated CPU, memory, etc.)
- **Operating System** (Linux, Windows, etc.)
- **Applications** (your code or services)

## 💡 Why Virtualization?

- Efficient use of underlying hardware.
- Multiple VMs share the same physical machine.
- Reduces cost and improves scalability.


![rs_Vs_vs](introduction_to_Data_Engineering/Module1/image/rs_vs_vs.png)
---

## 📏 EC2 Instance Types and Pricing

- **Instance Types**:
  - **General Purpose** (e.g., t3)
  - **Compute Optimized** (e.g., c5)
  - **Memory Optimized** (e.g., r5)
  - **Storage Optimized** (e.g., i3)
  - **Accelerated Computing** (e.g., p3)

- **Naming Example**: `t3a.micro`
  - `t` = instance family
  - `3` = generation
  - `a` = optional capabilities (e.g., AMD-based)
  - `micro` = instance size

### 💰 EC2 Pricing Options

- **On-Demand**: No long-term commitments. Pay per hour or second.
- **Spot Instances**: Use spare AWS capacity at lower cost.
- **Reserved Instances**: Commit to 1 or 3 years for heavy workloads.

---

## ✅ Summary

- EC2 is like a **cloud-based computer** where you run pipelines or services.
- You can **scale vertically (size)** or **horizontally (add more instances)**.
- It connects with tools like **S3, Redshift**, and operates inside a **VPC** (Virtual Private Cloud).
- You control everything inside the instance (OS, apps, firewall, etc.).

# 🌐 Networking – VPC and Subnets

## 🧱 What is a VPC (Virtual Private Cloud)?

A **VPC** is an isolated private network where you can launch your AWS resources.

- Exists **within a region**, and spans **multiple availability zones**.
- Helps isolate and **secure your EC2 instances, databases**, and more.
- Think of it as a **private floor in an office building** just for your team.

### 🔐 Analogy

Imagine an **office building** (the AWS Region). Your company rents an **entire private floor** (VPC).  
You install security doors, choose who gets access, and control whether rooms connect to the internet.

### 💡 Real Example

- You launch EC2 instances in a VPC.
- Unless you configure internet access, they're **not reachable from the outside**.
- You can add Internet Gateways or NAT Gateways to allow selected traffic.

---

## 🪟 What is a Subnet?

A **subnet** is a smaller network within your VPC — a **room inside your floor**.

- Can be **public** (accessible from internet) or **private** (no internet).
- Helps you group resources by **function** and **access level**.
- Each subnet exists in **one availability zone**.

### 🔐 Analogy

On your private office floor (VPC):
- Some rooms (subnets) like the **reception area** are open to visitors (public).
- Others like the **finance room** are locked and accessible only internally (private).

---

## 🛠️ Use Case: Web App Deployment

| Component         | Placement        | Notes                                               |
|------------------|------------------|-----------------------------------------------------|
| Web Server (EC2) | Public Subnet     | Accepts user requests via the internet              |
| Database (RDS)   | Private Subnet    | Not internet-facing; only web server can access it  |

---

## 🧠 Summary

- **VPC** = your private network on AWS
- **Subnet** = slices of that network for public/private access
- VPCs ensure **isolation**, **security**, and **control** over cloud networking.

# Security - AWS Shared Responsibility Model

When you host your applications and resources in the cloud, you’re offloading the heavy duty of managing the physical hardware to the cloud provider. The **security of the physical facility** is the **responsibility of the cloud provider**. However, you **still own your data in AWS**, and **you are responsible for managing its security**. This is known as the **Shared Responsibility Model** on AWS.

---

## 🔒 AWS is responsible for **security of the cloud**

This includes:

- Maintaining, protecting, and securing the **physical facilities** (data centers).
- Securing the **global infrastructure**:
  - Fiber optic cables connecting regions
  - Software and hardware running AWS services

---

## 🔐 You are responsible for **security in the cloud**

This includes:

- **Protecting your data** (at rest and in transit)
- Managing **who can access the data**
- Configuring **access control** (IAM roles, policies, encryption)
- Setting up **networking rules** (like VPC security groups and firewall rules)
- Ensuring **proper configurations** for the services you use

> 🧠 **Note**: Your responsibilities depend on the services you choose. For example, EC2 requires managing the operating system and patching, whereas services like S3 or Lambda have fewer responsibilities.

---

## ✅ Why it matters

It’s essential to understand and apply the Shared Responsibility Model because:

- You **control access** to the data you store in the cloud
- You are accountable for **securing pipelines** and workloads
- Misconfiguration on your part can lead to **security breaches**

---

### 🏢 Analogy: Apartment Building

| AWS                        | You                               |
|---------------------------|------------------------------------|
| Maintains the building     | Locks your door                    |
| Ensures elevators work     | Keeps your valuables safe          |
| Protects wiring & plumbing | Grants access to guests (or not!)  |

Both parties need to do their part to ensure full security 🔐.