# 📌 Building a Data Pipeline: Requirements Gathering and System Design

---

## 🧱 Hierarchy of Needs

To design a reliable and effective data pipeline, it is important to understand the hierarchy of needs, starting from business goals to system requirements.
![hierarchy of needs](./images/hierarchy_of_needs.png)

---

## 🎯 Business Goals (CTO Conversation)

The CTO shared high-level priorities for the company:

- Expanding market share through new product offerings and international expansion
- Refactoring legacy systems to avoid outages
- Transitioning from batch-based systems to streaming architectures (Kinesis/Kafka)
- Improving customer retention with a personalized recommendation engine
- Minimizing the software–data divide by ensuring data is generated in an analytics-ready schema

---

## 👥 Stakeholder Needs

### 🔽 Downstream Stakeholders – Marketing Team & Data Scientists

**Dashboards:**
- Current dashboards show data with a 2-day delay.
- The marketing team needs access to near real-time data (within 1 hour) to act on trending products.
- Quick action is required to launch region-specific campaigns during sales spikes that last a few hours.

**Recommendation System:**
- Currently built on weekly popular products for all users.
- Needs to shift toward personalized recommendations based on individual user history and cart behavior.

---

### 📄 Functional Requirements

#### 📊 Dashboards

- Process and deliver product sales data with no more than 1-hour delay.
- Enable filtering by region, category, product, and time (down to hourly granularity).

![Dashboard Functional Requirements](./images/functional_requirements_dashboard.png)

#### 🤖 Recommendation System

- Ingest user activity and purchase behavior.
- Serve personalized product recommendations in real time or near real time.

![Recommender Functional Requirements](./images/functional_requirements_recommender.png)

---

### 🔼 Upstream Stakeholder – Software Engineer (Source System Owner)

**Challenges Identified:**
- Direct access to the production database is restricted for safety.
- Data is shared via downloadable files (daily exports), leading to delays.
- Schema changes occur frequently due to new feature rollouts and regional expansions.
- Read replicas and APIs are proposed as alternatives for safer, real-time data access.
- Data delivery may be affected by server outages or maintenance tasks.
- Schema change notifications can be shared one week in advance.

---

## 📄 Non-Functional Requirements

#### 📊 Dashboards

- Low-latency data delivery (<1 hour)
- Scalable to high volume spikes
- Automatic schema validation and data quality checks
- Fault tolerance and graceful fallback

![Dashboard Non-Functional Requirements](./images/non_functional_dashboard.png)

#### 🤖 Recommendation System

- <1 second response time for recommendation queries
- Reliable ingestion of customer activity streams
- Support for both batch (training) and streaming (serving) workflows
- Scalable under increasing user load

![Recommender Non-Functional Requirements](./images/non_functional_recommender.png)

---

## 🧭 System Design Summary

Business goals were clarified through discussion with the CTO. Stakeholder needs were gathered from the marketing team and data scientists. Functional and non-functional requirements were documented based on these conversations. Constraints and upstream realities were confirmed through a conversation with the software engineer, leading to a plan for implementing a resilient, real-time data pipeline.

---

## ⚖️ Iron Triangle Consideration

Design decisions must consider the trade-offs in the **Iron Triangle** of software:

- **Fast**
- **Cheap**
- **Good**

Only two of the three can be fully optimized at once.

![Iron Triangle](./images/iron_triangle.png)

---

# ⚙️ Batch ETL Services on AWS

This section compares **AWS Glue** and **Amazon EMR**, two key services used for **batch data processing** and ETL workflows.

---

## 🧠 What is AWS Glue?

**AWS Glue** is a **fully managed, serverless ETL service** designed for structured and semi-structured data. It is ideal when you want to build pipelines **quickly** without managing infrastructure.

### 🔧 Key Features

- **Serverless**: No need to manage compute clusters.
- **Built-in Crawlers**: Automatically infer schema and build data catalog.
- **Jobs in Python or Scala** using Apache Spark under the hood.
- **AWS Glue Studio**: No-code/low-code visual interface for ETL.
- **Supports S3, RDS, Redshift, DynamoDB, and JDBC sources**.
- **Max vCPUs per job**: ~100 vCPUs (approx.)
- **Max memory per job**: ~1600 GB (combined)
- **Max parallel job runs (per region)**: 10,000+
- **Best for**: Up to **1–10 TB** of data per job.

### 💡 When to Use AWS Glue

| Use Case                                     | Recommended? |
|---------------------------------------------|--------------|
| Schema inference & data cataloging          | ✅ Yes        |
| Serverless, fully managed ETL               | ✅ Yes        |
| 1–10 TB batch jobs                          | ✅ Yes        |
| No infra to manage                          | ✅ Yes        |
| Fine-tuned job performance                  | ❌ No         |

---

## 🏗️ What is Amazon EMR?

**Amazon EMR** (Elastic MapReduce) is a **managed big data platform** that allows you to **provision clusters** for frameworks like **Apache Spark, Hadoop, Hive, Presto, HBase**, etc.

### 🔧 Key Features

- **You manage the cluster** size and configuration.
- **Fine-grained control** over nodes, memory, CPU, autoscaling, and pricing.
- **Pay only for what you use**, but must manage jobs and infra.
- **Use Spot, On-Demand, or Reserved EC2 instances.**
- **Integrates with S3, HDFS, Hive Metastore, etc.**
- **Max vCPUs per cluster**: Thousands (scalable by instance type)
- **Max memory per cluster**: Several TBs (based on EC2 instance types)
- **Best for**: **10 TB to Petabyte-scale** processing.

### 💡 When to Use Amazon EMR

| Use Case                                     | Recommended? |
|---------------------------------------------|--------------|
| Fine-tuned Spark/Hadoop workloads            | ✅ Yes        |
| Petabyte-scale ETL                          | ✅ Yes        |
| Control over compute and cost optimization  | ✅ Yes        |
| Need fully serverless experience            | ❌ No         |
| Small batch jobs with fast setup            | ❌ No         |

---

## 🔍 AWS Glue vs Amazon EMR — Comparison Table

| Feature                     | AWS Glue                              | Amazon EMR                          |
|----------------------------|----------------------------------------|-------------------------------------|
| **Type**                   | Serverless ETL                         | Managed Cluster-based ETL          |
| **Management**             | Fully Managed                          | You manage cluster lifecycle       |
| **Code Support**           | Python, Scala (via Spark)              | Spark, Hadoop, Hive, Presto, etc.  |
| **Infra Control**          | ❌ No                                   | ✅ Yes                              |
| **Startup Time**           | ⏱️ 2–5 minutes                         | ⚙️ 5–15 minutes                    |
| **Data Volume**            | ✅ 1–10 TB                              | ✅ 10 TB – Petabytes               |
| **Use with S3**            | ✅ Native integration                   | ✅ Native integration              |
| **Autoscaling**            | ✅ Handled by AWS                       | ✅ Custom with EMR Auto Scaling   |
| **Cost Visibility**        | Moderate                               | High control                       |
| **Cost Estimation**        | ~$0.44 per DPU-hour                    | Based on EC2, EBS, EMR usage       |

---

## 📌 Final Takeaways

- Use **AWS Glue** if:
  - You want **quick setup**, **no cluster management**, and **1–10 TB** of ETL per job.
  - You prefer a **code-free or low-code** ETL with crawlers and jobs.

- Use **Amazon EMR** if:
  - You need **full control** over compute/storage.
  - Your workloads exceed **10 TB**, or require **fine-tuning**, **custom clusters**, or **frameworks** like Hive or Presto.

---

## 🔗 Resources
- [Coursera](https://www.coursera.org/learn/intro-to-data-engineering/supplement/f0PVn/aws-services-to-meet-your-requirements)
- [AWS Glue Docs](https://docs.aws.amazon.com/glue/)
- [Amazon EMR Docs](https://docs.aws.amazon.com/emr/)

# 🧠 AWS Batch Storage & Warehousing Services

---

## 📦 Amazon S3 (Simple Storage Service)

### ✅ What is Amazon S3?

Amazon S3 is a **scalable object storage service** used to store and retrieve **any amount of data**, at **any time**, from anywhere on the web.

---

### 🔧 Key Features

- **Unlimited storage** (practically)
- Stores data as **objects** in **buckets**
- Supports a wide variety of file types (CSV, JSON, Parquet, images, videos, etc.)
- Low cost: **starts from $0.023 per GB/month**
- **Highly durable**: 99.999999999% (11 9’s) durability
- **Fine-grained access control** via IAM and bucket policies
- **Lifecycle policies** to transition data to colder storage (Glacier, Deep Archive)

---

### 📊 Use Case Fit

| Scenario                          | Suitability    |
|----------------------------------|----------------|
| Raw data lake                    | ✅ Excellent    |
| Storing unstructured data        | ✅ Excellent    |
| Long-term archival               | ✅ Excellent    |
| Querying via Athena              | ✅ Good         |
| Complex analytics (joins, OLAP) | ❌ Not suitable |

---

### 💬 Simple Definition

> Amazon S3 is like your data lake or cold storage — great for storing lots of raw or unstructured data cheaply and reliably.

---

### 🤖 ML & Analytics Perspective

- Ideal for storing **large volumes of unstructured or semi-structured data**
- Frequently used as the **source for training machine learning models** in batches
- Common pattern: **store training data in S3 → load into notebook or model pipeline**
- Works well with Glue, SageMaker, and EMR

---

## 🏢 Amazon Redshift

### ✅ What is Amazon Redshift?

Amazon Redshift is a **fully managed cloud data warehouse** designed for **analytical queries (OLAP)** on structured data.

---

### 🔧 Key Features

- Columnar storage format for faster analytics
- MPP (Massively Parallel Processing) architecture
- Integrated with S3, Glue, Kinesis, SageMaker
- **Fast performance**: optimized for complex SQL joins and aggregations
- **Scalable**: up to petabytes of data
- Cost: starts at **$0.25 per hour per node**, or ~$180/month for 1 DC2.large node

---

### 📊 Use Case Fit

| Scenario                          | Suitability    |
|----------------------------------|----------------|
| BI dashboards                    | ✅ Excellent    |
| Large-scale SQL analytics        | ✅ Excellent    |
| Complex joins, aggregations      | ✅ Excellent    |
| Real-time streaming (insert-heavy) | ❌ Not ideal  |
| Storing images, files, raw logs  | ❌ Not suitable |

---

### 💬 Simple Definition

> Amazon Redshift is like a high-performance, SQL-powered warehouse for crunching large volumes of clean, structured data quickly.

---

### 🤖 ML & Analytics Perspective

- **Ideal for feature engineering and batch model training**
- Can **query terabytes of structured data** in seconds to prepare ML datasets
- Often used after data is cleaned/transformed in Glue or stored in S3
- You can export features from Redshift directly into **SageMaker** or other ML platforms

---

## 🆚 S3 vs Redshift: When to Use What?

| Feature                  | Amazon S3                        | Amazon Redshift                    |
|--------------------------|----------------------------------|------------------------------------|
| Storage Type             | Object storage                   | Columnar, relational database      |
| Use Case                 | Raw/unstructured data lake       | Structured, analytical workloads   |
| Query Engine             | Athena, EMR, Glue                | Native SQL engine (Redshift SQL)   |
| Scalability              | Virtually unlimited              | Up to petabytes                    |
| Pricing (est.)           | ~$23/TB/month                    | ~$180/month/node (DC2.large)       |
| Durability               | 11 9s (99.999999999%)            | 99.9%                              |
| Typical Query Volume     | Moderate (via Athena)            | High-volume, analytical queries    |
| Setup Complexity         | Very low                         | Moderate (needs schema, tuning)    |
| Latency (Read/Query)     | ~100–300 ms (S3 GET latency)     | ~1–2 sec for analytical queries on large datasets |

> 📝 Note:
> - Redshift latency can drop to **~100–300 ms** for small queries and increase to **3–5 sec**+ for complex aggregations.
> - S3 is not designed for direct querying, but with Athena, you can expect **~2–5 sec** per query on large datasets.
> - Athena is a serverless query service that lets you run SQL queries directly on data stored in Amazon S3, without needing to load that data into a warehouse like Redshift.
---

## 🧰 Common Patterns

- **Store raw data in S3**, clean/transform with Glue or EMR
- **Load clean data into Redshift** for advanced analytics and ML
- Use **Athena** to query S3 directly for ad-hoc SQL exploration
- Export features from Redshift into **ML pipelines** or BI dashboards

---

## 📎 Resources
- [Coursera](https://www.coursera.org/learn/intro-to-data-engineering/supplement/f0PVn/aws-services-to-meet-your-requirements)
- [Amazon S3 Docs](https://aws.amazon.com/s3/)
- [Amazon Redshift Docs](https://aws.amazon.com/redshift/)


# 🚀 AWS Streaming Services: Kinesis, Firehose & MSK

---

## 1️⃣ What is Amazon Kinesis?

**Amazon Kinesis** is a suite of services for real-time data streaming at scale. It allows you to ingest, process, and analyze data as it arrives.

---

### 🔧 Components of Kinesis

| Service               | Description                                                                 |
|-----------------------|-----------------------------------------------------------------------------|
| **Kinesis Data Streams** | Real-time stream processing service (low-level).                             |
| **Kinesis Data Firehose** | Fully managed service to load streaming data to destinations like S3, Redshift. |
| **Kinesis Data Analytics** | Run SQL queries directly on streaming data.                                 |

---

## 2️⃣ What is Amazon Kinesis Data Streams?

### ✅ Key Concepts

- **Shard**: A unit of capacity.  
  - 1 shard = **1 MB/sec input** and **2 MB/sec output**
  - Each shard supports up to **1,000 PUT records/sec**
- **Buffer**: Temporary storage of records before processing or delivery
- **Data Retention**: Time the stream stores data (default: 24 hours, max: 7 days)

---

### 🧠 How It Works

- Producers send data to **Kinesis Stream**
- Consumers (like Lambda, EMR, Glue) read data from the stream
- Use cases: real-time dashboards, anomaly detection, log aggregation

---

## 🔄 ETL with Kinesis

- **You need an ETL service** like:
  - **AWS Glue**: For complex data transformation and cataloging
  - **AWS Lambda**: For lightweight real-time processing
  - **Amazon EMR**: For large-scale Spark/MapReduce jobs

---

## 3️⃣ What is Amazon Kinesis Data Firehose?

A **fully managed, no-code option** for streaming data delivery.

---

### 🔧 Key Features

- **No shard management** or provisioning
- **Automatic batching**, compression, and encryption
- **Delivers data** to:
  - Amazon S3
  - Amazon Redshift
  - Amazon OpenSearch
  - HTTP endpoints

---

### 🧠 How It Works

1. You send streaming data to Firehose.
2. It buffers and transforms data (if needed).
3. Then delivers to destination — no code required.

---

### 🔁 Firehose Buffer Size

- Buffer by **size** (1–128 MB) or **time** (60–900 seconds)
- Example: If buffer is 5MB or 60 seconds, it delivers whichever comes first.

---

## 4️⃣ What is Amazon MSK (Managed Streaming for Apache Kafka)?

**MSK** is a **fully managed Apache Kafka** service.

---

### ✅ Why Use MSK?

- Use **Kafka APIs** if you're already familiar with Kafka
- Suitable for **complex event-driven architectures**
- Offers more **customization and flexibility** than Kinesis

---

### 🔧 MSK vs. Kinesis

| Feature             | Kinesis                            | MSK (Kafka)                          |
|---------------------|------------------------------------|--------------------------------------|
| API Compatibility   | AWS proprietary                    | Apache Kafka APIs                    |
| Setup               | Easy (fully managed by AWS)        | More setup required (but managed)    |
| Ecosystem           | AWS-native integrations            | Huge open-source ecosystem           |
| Cost                | Lower for small workloads          | More control, but potentially costlier|
| Use Case Examples   | Simple ingestion, log streaming    | Complex pipelines, microservices     |

---

## 🔀 When to Use What?

| Use Case                                      | Recommended Service        |
|-----------------------------------------------|-----------------------------|
| Real-time ingestion with AWS services         | Kinesis Data Streams        |
| Serverless streaming to destinations          | Kinesis Firehose            |
| Real-time SQL on streams                      | Kinesis Data Analytics      |
| Apache Kafka compatibility required           | Amazon MSK                  |
| Complex data transformation                   | Kinesis + Glue or EMR       |
| Minimal config + auto delivery                | Firehose                    |

---

## ⚙️ Architecture Example (Kinesis Firehose + Glue)

1. Stream raw logs from web app into **Kinesis Firehose**
2. Firehose buffers & sends to **S3**
3. **AWS Glue** picks up from S3 → transforms → stores in **Redshift**

---

## 📌 Notes

- All these services integrate well with **CloudWatch** for monitoring.
- You can trigger **Lambda** directly from Kinesis Streams for real-time reactions.
- For very low-latency, **Kinesis Streams + Lambda** is a common pairing.
- For hands-off delivery, go with **Kinesis Firehose**.

---

## 📎 AWS Docs

- [Amazon Kinesis Overview](https://aws.amazon.com/kinesis/)
- [Amazon Kinesis Firehose](https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html)
- [Amazon MSK](https://aws.amazon.com/msk/)