## 📚 Table of Contents 
- [Data Ingestion](#data-ingestion)
- [Stakeholder Conversation Summary](#stakeholder-conversation-summary-marketing-analyst)
- [ETL VS ELT Batch Ingestion Patterns](#etl-vs-elt-batch-ingestion-patterns)





# Data Ingestion 

# 🔗 Resource

To understand more about **Data Ingestion**, refer to this Coursera reading resource:

[Batch and Streaming Tools (Coursera)](https://www.coursera.org/learn/source-systems-data-ingestion-and-pipelines/supplement/YD08f/batch-and-streaming-tools)

---
Nearly all data originates as a **continuous stream of events** (e.g., button clicks, stock price changes, IoT sensor readings).  
To handle and process that data, we use ingestion techniques that fall along a **continuum** of frequency.

---

## 📈 Ingestion Frequencies

![Ingestion Frequencies](./images/ingestion_frequencies.png)

| Frequency Type | Description       |
|----------------|-------------------|
| Batch          | Semi-Frequent     |
| Micro-batch    | Frequent          |
| Streaming      | Very Frequent     |

> The **choice of ingestion frequency** depends on:
- The **source systems**
- The **end use case**

---

# 🔌 Ways to Ingest Data

---

## 🗄️ From Databases

![Ingest from DB](./images/ways_to_ingest.png)

### 🔗 Using Connectors (JDBC/ODBC APIs)
- Pulls data using **standard drivers**.
- Ingests:
  - At regular intervals
  - After a threshold of new records

> JDBC (Java Database Connectivity) and ODBC (Open Database Connectivity) allow apps to query databases in a standard, language-independent way.

---

## 🔄 Using Ingestion Tools

- Example: **AWS Glue ETL**
- Automates the pull from databases
- Ingests data **on a regular basis**

---

## 📁 From Files

![Ingest via Files](./images/ingest_via_files.png)

### 🛠️ Manual File Download
- Receive file from external source
- Upload it manually to the system

### 🔐 Secure File Transfer (e.g., AWS Transfer Family)
- Protocols used:
  - **SFTP**: Secure File Transfer Protocol
  - **SCP**: Secure Copy Protocol

---

## 📡 From Streaming Systems

![Streaming Ingestion](./images/ingest_via_streaming_systems.png)

- For **real-time or near-real-time** event ingestion
- Source: Event Producers like **IoT devices**, apps, etc.
- Sent to: Message Queues or Streaming Platforms (e.g., **Amazon Kinesis**, **Apache Kafka**)
- Consumed by: Downstream **event consumers**

---

# 🧠 Batching vs Streaming: Conceptual Continuum

Every event can be ingested either:
- **One-by-one** (→ **Streaming**)
- **Grouped together** (→ **Batch**)

### You can impose batch boundaries using:
- **Size** (e.g., 10GB chunks)
- **Count** (e.g., every 1,000 events)
- **Time** (e.g., every 24 hours, every hour)

> 🌀 High-frequency batch ingestion eventually approaches real-time streaming.

---

# ⚖️ Choosing the Right Ingestion Pattern

Your choice depends on:
- 🔹 What kind of **source system** you're working with (API, DB, Stream)
- 🔹 What **latency** the business case demands
- 🔹 What the **API or system constraints** are (rate limits, payload size)

---

# 🧪 Practical Use Cases Coming Up

This module covers **two hands-on case studies**:
- **Batch ingestion from an API**
- **Streaming ingestion from Amazon Kinesis**

You'll work with real-world tools like:
- **AWS Glue**
- **Streaming platforms**
- **Secure file transfers**
- **Custom connectors**

---

# Stakeholder Conversation Summary: Marketing Analyst

## 🎯 Stakeholder
**Colleen** — Marketing Analyst

## 👋 Data Engineer
**Joe** — New Data Engineer at the e-commerce company

---

## 🗣️ Conversation Overview

In this discussion, Colleen (Marketing Analyst) and Joe (Data Engineer) explore how the marketing team can gain insights into **external factors** that may influence **customer purchasing behavior**.

---

## 🎯 Business Goal
To analyze **external signals** — such as **music listening trends** — that could correlate with **online shopping behavior**, helping to uncover new **sales insights and patterns**.

---

## 💡 Key Idea from Marketing
- Customer **emotions or moods** (e.g., happy, sad, excited, relaxed) may influence shopping habits.
- Direct emotional data is unavailable, but **music listening patterns** may act as a **proxy** for mood.

---

## 📊 Proposed Data Source
- **Spotify Public API**:
  - Provides access to **trending artists**, **listening trends** over time.
  - Data available across **different regions**.
  
---

## 📥 Data Ingestion Needs
- Pull data from the **Spotify API** (public third-party REST API).
- Compare **regional music trends** with **product sales data**.
- Use this for **marketing analysis and insight generation**.

---

## 🧰 Next Steps for the Data Engineer
- Review the **Spotify API documentation**.
- Identify what **data can be accessed** (e.g., top artists by region, listening time series).
- Clarify what **specific data** the marketing team wants and **how to serve it** (e.g., dashboards, reports).

---

## 🧠 Note from the Course
While using music trends may not seem like a strong marketing strategy, it's common for stakeholders to request **unusual data sources**. The goal here is to learn **requirement gathering** and **API data ingestion** techniques — not to judge the validity of the idea.

---

## 📌 Key Concept Introduced
- The pipeline will require **batch data ingestion** from a **third-party REST API**.
- Next steps involve exploring **ETL vs ELT** strategies before implementing the ingestion.

---

# ETL vs ELT: Batch Ingestion Patterns

##   Resource  

For an official summary of the differences between ETL and ELT, refer to this Coursera resource:  
[Summary of the Differences: ETL vs ELT](https://www.coursera.org/learn/source-systems-data-ingestion-and-pipelines/supplement/FN6ny/summary-of-the-differences-etl-vs-elt)

---

## 🧠 Goals of the Marketing Analyst

The marketing analyst is interested in **analyzing historical trends** by ingesting **external data** (e.g., from Spotify API) in **batch**. Since there is:

- **No need for real-time analysis**
- **Limited frequency of API requests**

A batch ingestion pipeline is most suitable.

![Goals of the Marketing Analyst](./images/ma_goals.png)

---

## ⚙️ Batch Ingestion Patterns: ETL vs ELT

![ETL vs ELT](./images/etl-vs_elt.png)

### 🧪 ETL (Extract → Transform → Load)

- **Extract** raw data from source
- **Transform** it in a staging area
- **Load** the transformed data into a data warehouse

📉 *Potential information loss* during early transformation.

---

### 🧪 ELT (Extract → Load → Transform)

- **Extract** raw data
- **Load** raw data directly into the target destination
- **Transform** inside the target system (e.g., data warehouse)

📦 *All data is captured*, providing more flexibility.

---

## 📈 Advantages of ELT

![Advantages of ELT](./images/adv_elt.png)

1. ✅ **Faster to implement**
2. ⚡ **Data available quickly to end users**
3. 🔄 **Transformations can still be done efficiently**
4. 🔧 **Flexible — transformations can be decided later**

---

## ⚠️ Downsides of ELT

![Disadvantages of ELT](./images/dis_elt.png)

- 💥 If not carefully planned, it becomes just an **EL pipeline**
- 🐊 May result in a **Data Swamp**: unorganized, unmanageable, and unusable data
- ❓ *You must ask*: "How will you use the data?"

---

## 👥 Use Case: Conversation with the Marketing Analyst

The marketing analyst wants to explore new patterns from external data (e.g., music listening trends) via exploratory analysis.

Since the transformations required aren't clearly defined upfront, **ELT is the better choice** here.

![ELT for Marketing Analyst](./images/elt_ma.png)

---

# 🧮 Detailed Comparison Table

| Feature | **ETL** | **ELT** |
|--------|--------|--------|
| **History** | Popular in the 80s–90s when storage was expensive and data was smaller | Became popular with cloud storage & explosion of big data |
| **Transformation Timing** | Before loading into the warehouse (predefined schema needed) | After loading into the warehouse (can delay schema decisions) |
| **Processing Power Used** | External staging tools or ETL platforms | Modern data warehouse (e.g., BigQuery, Redshift, Snowflake) |
| **Flexibility** | Low — schema must be known early | High — raw data allows flexible querying and transformations |
| **Data Types Supported** | Mainly structured data | Structured, semi-structured (JSON), unstructured (images/text) |
| **Maintenance** | Costly — must re-ingest if transformation was wrong | Easier — raw data already loaded, can re-transform anytime |
| **Load Time** | Longer — needs staging and transformation first | Faster — directly load raw data |
| **Transformation Time** | Depends on tool and complexity | Faster — utilizes scalable DW compute |
| **Scalability** | Scalable but harder to manage multiple sources/targets | Highly scalable with cloud warehouses |
| **Cost** | Depends on ETL tools and compute | Lower due to modern cloud infra but still depends on volume |