# 📦 Source Systems in Data Engineering

In data engineering, the **first step** in the lifecycle is obtaining data from various **source systems**. These are the systems where raw data originates and flows into your pipeline for processing and analysis.

---

## 🔍 What Are Source Systems?

Source systems are where your data comes from. As a **data engineer**, you don't typically own these systems — they are created and maintained by other teams like software developers, third-party vendors, or partner platforms.

Your job is to build **pipelines** that consume data from these sources and deliver it to **downstream systems** like dashboards, machine learning models, or data warehouses.

---

## 📊 Common Types of Source Systems

| Type             | Description                                                                 | Real-World Example                                 |
|------------------|-----------------------------------------------------------------------------|----------------------------------------------------|
| **Databases**     | Structured data organized into tables or documents                         | Sales transactions from an e-commerce app          |
| **Files**         | Unstructured data like CSVs, MP3s, images                                   | Product catalog stored in `products.csv`           |
| **APIs**          | On-demand data accessed over the web                                       | Twitter API providing trending hashtags            |
| **IoT Devices**   | Real-time data streamed from connected devices                              | GPS trackers on delivery vehicles                  |
| **Data Sharing Platforms** | External datasets provided by other organizations                | AWS Data Exchange sharing market research files    |

---

## 🧠 Why Understanding Source Systems Is Important

- Source systems are **not in your control**.
- They can **fail**, change format/schema, or update without notice.
- If you rely on unstable source systems without planning, your downstream pipelines may **break** silently.

> Example: A software team renames or deletes columns in their database without telling you. Your pipeline crashes because it expects columns that no longer exist.

---

## 🤝 Best Practices

- **Collaborate** with the owners of source systems
- Understand how the data is **generated and updated**
- Know what can **change** — and when
- Design pipelines to be **resilient** to schema or format changes

---

## 📌 Analogy

> Imagine you're a chef (data engineer). Your ingredients (data) come from various suppliers (source systems). One day, a supplier changes packaging or skips delivery without notice — your kitchen operations (pipelines) are disrupted unless you're prepared and have good communication.

---

## 📷 Diagram: Source Systems → Downstream Systems

![Source Systems Flow](./image/source_systems.png)

In the diagram above:
- The left side shows **data sources** like databases, files, APIs, IoT, and data sharing platforms.
- These sources **deliver data** through pipelines.
- The right side shows **downstream systems** that consume the processed data.

---

## ✅ Summary

- **Source systems** are external systems where your pipeline starts.
- They include databases, APIs, files, IoT devices, and more.
- As a data engineer, your success depends on **understanding, monitoring, and adapting** to these systems.

---

# 🔄 Data Ingestion in Data Engineering

## 1. Source Systems

In the first stage of the data engineering lifecycle, data originates from **source systems**. These are systems that generate or hold the raw data we want to work with.

### Common Source Systems:

- **Databases** (e.g., sales or customer databases)
- **Files** (CSV, JSON, audio, video)
- **APIs** (e.g., Twitter API, product data API)
- **IoT Devices** (e.g., GPS trackers, sensors)
- **Data Sharing Platforms** (e.g., internal dashboards, 3rd-party datasets)

These systems may be maintained by:
- Other internal teams (e.g., backend engineers)
- External vendors
- Partner organizations

As a data engineer, you don’t control these systems—but your pipelines depend on their structure and consistency.

> 🔄 *Analogy:* Think of source systems as different taps (faucets) from which water (data) flows into a treatment plant (your data pipeline). You don’t control how the taps are built, but you must design your plant to handle the water reliably.

### 📌 Image: Source Systems Diagram

![Source Systems](./images/source_systems.png)

---

## 2. Frequency of Ingestion

Once the data source is identified, you need to decide how **frequently** data should be ingested from it:

- **Batch Ingestion**: Collect and move large chunks of data periodically (e.g., every hour/day).
- **Streaming Ingestion**: Capture and process events/data in near real-time.

> 💡 *Example:*  
> - Batch: Moving website logs daily into a warehouse for weekly analysis.  
> - Stream: Capturing user clicks in real-time to recommend products instantly.

> 📦 *Analogy:*  
> - **Batch**: Like collecting mail once a day from a mailbox.  
> - **Streaming**: Like receiving WhatsApp messages instantly as they come in.

### 📌 Image: Frequency of Ingestion

![Frequency of Ingestion](./image/ingestion_frequency.png)

---

## 3. Batch Ingestion

**Batch ingestion** means pulling data in chunks at scheduled times or after a set data size.

- Often used in analytics and model training
- Simple and resource-efficient
- Great when real-time updates are not critical

> 🕒 Example: Pulling transaction records from a POS (Point of Sale) system every night at 2 AM for analysis.

### 📌 Image: Batch Ingestion

![Batch Ingestion](./image/batch_ingestion.png)

---

## 4. Streaming and Batch Together

Most real-world systems use a combination of batch and streaming ingestion.

- **Batch Use Case**: Train ML models on historical data
- **Streaming Use Case**: Detect fraud or anomalies in real time

You can stream data continuously, store it, and periodically run batch processes on the stored data.

### 📌 Image: Streaming and Batch Components

![Streaming and Batch Components](./image/streaming_and_batch_ingestion.png)

---

## 5. Streaming Ingestion Internals

Streaming ingestion uses tools like:
- **Apache Kafka**
- **Amazon Kinesis**
- **Google Pub/Sub**

These tools ingest continuous data and forward it to processing layers with minimal delay (seconds or milliseconds).

> ⏱ *Example:* Streaming GPS signals from a fleet of delivery trucks to track their location live.

### 📌 Image: Streaming Ingestion Architecture

![Streaming Ingestion](./image/streaming_ingestion.png)

---

## ✅ Summary

- Source systems are where data originates.
- Batch is great for periodic, reliable ingestion.
- Streaming is useful when near real-time decisions are needed.
- Most pipelines use a hybrid of both.

# 📦 Understanding Storage in Data Engineering

Data is constantly being created, moved, and stored — whether on your laptop, phone, or in massive cloud systems. As a data engineer, your ability to manage this data depends on how well you understand different **layers of storage**.

---

## 🔧 1. Raw Ingredients of Storage

These are the **fundamental hardware and processes** that make all storage possible.

- **Physical components**:
  - 💽 Hard Disks (HDD) – cheap and large, but slow.
  - ⚡ Solid State Drives (SSD) – faster, more costly.
  - 🧠 RAM – very fast but volatile and expensive.

- **Software-level processes**:
  - Networking
  - CPU operations
  - Serialization
  - Compression
  - Caching

📸 *Raw ingredients of storage:*

![Raw Ingredients](./image/storage.png)

---

## 🗃️ 2. Storage Systems

Built on top of raw ingredients, these systems **organize, store, and manage access** to data.

- **Database Management Systems** – For structured data (e.g., PostgreSQL)
- **Object Storage** – For blobs/files (e.g., Amazon S3)
- **Apache Iceberg/Hudi** – For handling big data tables
- **Cache/Memory Systems** – e.g., Redis
- **Streaming Storage** – e.g., Kafka for real-time data streams

📸 *Common storage systems:*

![Storage Systems](./image/storage_systems.png)

---

## 🧱 3. Storage Abstractions

These are **combinations of storage systems** that serve higher-level business needs.

- **Data Warehouse** – Optimized for fast queries (e.g., Snowflake, BigQuery)
- **Data Lake** – Stores all types of raw data (e.g., AWS S3)
- **Data Lakehouse** – Hybrid of the above two (e.g., Databricks Delta Lake)

📸 *Types of storage abstractions:*

![Storage Abstractions](./image/storage_abstractions.png)

---

## 🪜 4. Storage Hierarchy Overview

Storage can be visualized as a 3-layer hierarchy:

1. **Raw Ingredients** – HDDs, SSDs, RAM, Networking
2. **Storage Systems** – Databases, object stores, caches
3. **Storage Abstractions** – Warehouses, Lakes, Lakehouses

As a data engineer, you often work with **abstractions**, but understanding the **lower levels** makes your system faster, cheaper, and more scalable.

📸 *Full storage hierarchy:*

![Storage Hierarchy](./image/storage_hierarchy.png)

---

## 🧠 Summary

- You interact with storage systems constantly, even without realizing it.
- Storage involves hardware, software, and smart architecture.
- **Efficiency and cost** are determined by how well you choose your storage strategy.
- Always understand where your data is going and how it's stored.

# 💾 In-Depth Look: How HDD, SSD, and RAM Actually Work

Understanding the **physical mechanisms** behind how data is stored and accessed helps data engineers design systems that are optimized for performance, reliability, and cost.

---

## 💽 Hard Disk Drives (HDDs)

### 🔧 What Happens Inside?

An HDD consists of:
- **Spinning platters** coated with magnetic material.
- A **read/write head** mounted on an actuator arm.
- A **motor** that spins the platters at 5400–7200 RPM (or higher).
- **Firmware** that controls operations.

### 🧲 How Is Data Stored?

- Data is stored magnetically in **tiny regions** on the platter called **magnetic domains**.
- These regions can be **magnetized in one of two directions**:
  - One direction = binary `1`
  - Opposite direction = binary `0`
- The platter is divided into:
  - **Tracks** (concentric circles)
  - **Sectors** (segments of tracks)
  - **Cylinders** (aligned tracks across platters)

### ✍️ Writing Data (Engraving Analogy):

- The **write head** generates a magnetic field.
- It **flips the magnetic direction** of a region to represent 1s and 0s.
- This is similar to an **engraving tool carving marks** on a rotating disc — the tool (head) needs to be **positioned precisely** to write data.

### 🔍 Reading Data:

- The **read head** senses the magnetic polarity of each domain as the platter spins.
- The magnetic change is translated into a stream of binary data.

### 🧠 Summary:
- ❌ Slower due to mechanical movement.
- ✅ Great for large capacity, cheap storage.
- ⚠️ Fragile — can be damaged by shocks.

---

## ⚡ Solid State Drives (SSDs)

### 🔧 What Happens Inside?

An SSD contains:
- **NAND Flash memory chips** (non-volatile)
- **Controller** chip to manage read/write operations
- No moving parts

### 🧠 How Data Is Stored (Electric Charge):

- Data is stored in **floating-gate transistors**, which **trap electrons**.
- Electrons inside the gate represent binary `1`, and absence of charge is binary `0`.
- These transistors are organized into:
  - **Cells** (SLC, MLC, TLC depending on how many bits per cell)
  - **Pages** (group of cells)
  - **Blocks** (group of pages)

### ✍️ Writing Data (Charge Manipulation):

- To write, a **voltage is applied** to "trap" electrons in a floating gate.
- To erase, the **voltage releases the trapped electrons**.
- This is **slower than reading**, and **blocks must be erased before being rewritten**.

### 🔍 Reading Data:

- A small voltage is applied.
- The presence or absence of current flow tells whether the bit is 0 or 1.

### 📦 Summary:
- ✅ Fast access, no moving parts.
- ⚠️ Limited write cycles — cells wear out over time.
- 💸 More expensive than HDDs.

---

## 🧬 Random Access Memory (RAM)

### 🔧 What Happens Inside?

RAM consists of:
- Millions of **capacitor-transistor pairs**
- Each **pair stores one bit** of data
- Volatile — **requires constant power** to retain data

### ⚡ How Data Is Stored (Capacitor Charging):

- A **charged capacitor** = binary `1`
- A **discharged capacitor** = binary `0`
- Transistors act like gates that allow read/write access to the capacitor

### 🔍 Reading Data:

- The system checks whether a capacitor holds charge.
- This check **discharges the capacitor**, so RAM must **refresh data** constantly (thousands of times per second).

### ✍️ Writing Data:

- A transistor opens a path to the capacitor.
- A voltage is applied to **store charge** (1) or **drain it** (0).

### 🧠 Summary:
- 🚀 Extremely fast (nanoseconds latency)
- ❌ Volatile (data is lost when power goes off)
- ✅ Perfect for temporary processing (e.g., active variables in programs)

---

## 📊 Final Comparison (Physically)

| Feature          | HDD                            | SSD                                 | RAM                                 |
|------------------|---------------------------------|--------------------------------------|--------------------------------------|
| Storage Method   | Magnetic domains on platters   | Floating-gate transistors            | Charge in capacitors                 |
| Moving Parts     | Yes                             | No                                   | No                                   |
| Read/Write Speed | Slow (ms)                       | Fast (μs)                            | Very Fast (ns)                       |
| Volatile?        | No                              | No                                   | Yes                                  |
| Use Case         | Archive, backups                | OS, active pipelines                 | In-memory compute, temp storage      |
| Failure Risk     | Higher (mechanical)             | Lower (but finite write endurance)   | Data lost on power loss              |

---

## 🧠 Analogy Recap

- **HDD** = Like a record player writing grooves on a vinyl disc.
- **SSD** = Like a chalkboard where you "charge" and "discharge" cells.
- **RAM** = Like a whiteboard used for calculations — fast, but wiped clean when power is off.

---

Understanding these differences allows you to pick the right type of storage for your data engineering architecture — whether you're dealing with **hot data**, **cold data**, or **high-speed temporary computation**.

## 🔄 Data Transformation in the Data Engineering Lifecycle

The **transformation stage** is where a data engineer starts to deliver **real business value**. While ingesting and storing raw data is important, it doesn't directly help downstream users like analysts or data scientists.

Transformation is the stage where **raw data is turned into something useful**.

### 👥 Who Benefits from Transformation?

- **Business Analysts**: They might need quick access to clean, structured data like `customer_id`, `product_name`, `quantity`, and `time_of_sale` to generate reports.
- **Data Scientists / ML Engineers**: They rely on you to prepare features and cleaned datasets for model training.

Transformation includes **3 major components**:
- Queries
- Modeling
- Transformation logic

---

### 🧠 Queries

A **query** is simply a request to read records from a database or other storage system. SQL is the most commonly used query language.

> Poorly written queries can slow down performance, overload databases, or even crash your infrastructure (e.g., row explosion from bad joins).

#### 📘 SQL Commands in Transformation

- **Data Cleaning**: `DROP`, `TRUNCATE`, `TRIM`, `REPLACE`, `SELECT DISTINCT`
- **Data Joining**: `INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, `FULL JOIN`, `UNION`
- **Data Aggregating**: `SUM`, `AVG`, `COUNT`, `MAX`, `MIN`, `GROUP BY`
- **Data Filtering**: `WHERE`, `AND`, `OR`, `IS NULL`, `IS NOT NULL`, `IN`, `LIKE`

![Query Commands](./image//sql_commands.png)

---

### 📐 Data Modeling

Data modeling involves choosing the **right structure** to represent data for business needs.

- If data comes from **normalized relational databases** (separate tables for orders, products, customers), you may need to **denormalize** it for faster analytics.
- Example: A business analyst shouldn't need to join five tables to get product sales data.

Good models reflect:
- Business logic
- Terminology (e.g., how different departments define “customer”)
- Reporting or ML requirements

You'll learn more about **normalization** and **data modeling** later in the specialization.

---

### 🔧 Transformation Logic

Transformation happens **across multiple stages** of the pipeline:

- At the **source system**: timestamps or metadata added
- During **ingestion**: data type mapping, standardization
- In **streaming pipelines**: records enriched or calculated
- Before **machine learning**: features engineered
- Before **reporting**: aggregation, schema reshaping

---

### 📊 Examples of Transformation Use Cases

#### 🧑‍💻 Business Analyst

Goal: Generate daily sales reports

![Transformation for Analyst](./image//transformation.png)

---

#### 👩‍🔬 Data Scientist

Goal: Use transformed data for predictive analytics

![Transformation for DS](./image/transformation_ds.png)

# Serving Data in the Data Engineering Lifecycle

Once you've ingested, transformed, and stored your data, you're ready for the final stage of the data engineering lifecycle: **serving**. This is when your work directly creates business value by enabling stakeholders to consume and act on the data.

Serving isn't a one-size-fits-all process—it depends on the use case. Let’s break it down.

---

## 📊 Analytics

Analytics is about identifying patterns and insights from data. There are 3 main types:

### 1. Business Intelligence (BI)

- **Example**: The marketing team wants to see **weekly signup trends** from different cities.
- **You as a Data Engineer**:
  - Ingest user signup logs from the website.
  - Transform the data (extract city names, dates, counts).
  - Store in a clean reporting table.
  - Serve it via a BI dashboard (e.g., Tableau or Looker).

### 2. Operational Analytics

- **Example**: An e-commerce website wants to **track orders per minute** to detect site crashes or slowdowns.
- **You as a Data Engineer**:
  - Build a real-time streaming pipeline using tools like Apache Kafka + Spark.
  - Transform the incoming order events.
  - Push the metrics to a live dashboard.
  - Set up alerts for low or zero activity.

### 3. Embedded Analytics

- **Example**: A food delivery app shows customers their **monthly spend** and **top restaurants**.
- **You as a Data Engineer**:
  - Join order history, prices, and restaurant data.
  - Aggregate spend per month.
  - Provide a real-time or scheduled API or dataset to the app team for embedding.

![Analytics](./image/analytics.png)

---

## 🤖 Machine Learning

Machine learning requires serving data for model training, inference, and tracking.

### Example: Product Recommendation System

- **Goal**: Recommend products to users based on their past purchases.
- **You as a Data Engineer**:
  - Ingest user purchase logs.
  - Extract features like categories purchased, price range, time of day.
  - Store in a **feature store** (a clean structured table).
  - Serve this data for:
    - **Training** the ML model.
    - **Real-time inference** when a user visits the site.
    - Track when the model was trained and with which data (lineage).

![Machine Learning](./image/ml.png)

---

## 🔁 Reverse ETL

Reverse ETL = Sending cleaned and enhanced data **back to source tools** like CRMs or ad platforms.

### Example: Lead Scoring in CRM

- **Goal**: Prioritize customers who are most likely to buy.
- **You as a Data Engineer**:
  - Ingest CRM data (names, interactions).
  - Transform into features (e.g., # of visits, email opens).
  - Data scientist trains a **lead score model**.
  - Serve the lead scores **back into the CRM** so the sales team sees them next to each client’s profile.

### Another Simple Example: Email Targeting

- Your marketing team wants to send emails only to users **who haven’t logged in for 30 days**.
- You write a job that finds those users from your warehouse.
- Then push their emails **back to the email platform** (like Mailchimp or HubSpot).

![Reverse ETL](./image/reverse_etl.png)

---

This final stage—**serving**—is where data becomes **useful** and **visible** to the business. Whether it's through dashboards, apps, machine learning models, or external systems, your job is to **deliver clean, usable, timely data** to wherever it's needed.

1