# 📚 Index

- [Enterprise Architecture](#enterprise-architecture)

- [Conway’s Law](#conways-law)

- [Principles of Good Data Architecture](#principles-of-good-data-architecture)


# Enterprise Architecture

Enterprise Architecture is the design of systems to support **change** in an enterprise — achieved by **flexible and reversible decisions** through careful evaluation of trade-offs.

It consists of **4 core components**:

![Enterprise Architecture Diagram](./images/enterprice_architecture.png)

---

## 1. 🧩 Business Architecture  
**Purpose:** Defines the **product or service strategy** and **business model** of the enterprise.

**Example:**  
An e-commerce company wants:
- 1-day delivery
- Personalized product recommendations
- Expansion to new regions

**Data Engineer's Role:**  
You need to understand these goals and build pipelines to track:
- Order delivery times  
- Customer behavior  
- Sales by region

---

## 2. 🏗 Application Architecture  
**Purpose:** Describes the **structure and interaction** of key applications that serve business needs.

**Example:**  
Flipkart might use:
- Login Service  
- Product Catalog  
- Recommendation Engine  
- Order Management  

**Data Engineer's Role:**  
You extract data via:
- APIs  
- Event streams (Kafka)  
- Database snapshots  

Then push it into the warehouse/lake for analytics.

---

## 3. 🖥 Technical Architecture  
**Purpose:** Defines the **software and hardware** infrastructure (cloud, compute, network, tools).

**Example:**  
- AWS EC2 for compute  
- S3 for storage  
- Glue & Spark for processing  
- Airflow for orchestration  
- Terraform for IaC  

**Data Engineer's Role:**  
You build and deploy using:
- Scalable, secure, cost-effective cloud resources  
- Tools like Terraform to manage infrastructure as code  

---

## 4. 🗃 Data Architecture  
**Purpose:** Supports the **evolving data needs** of the organization.

**Example Workflow:**
- Extract from MySQL (RDS)  
- Transform via Glue into a star schema  
- Store in S3 as Parquet  
- Query via Athena  
- Visualize via Jupyter or QuickSight  

**Data Engineer's Role:**  
Own the end-to-end data pipeline: from ingestion → transformation → serving.

---

## 🔁 Why It Matters to You as a Data Engineer

- Your pipelines must **support changing business needs**
- You must make **reversible choices** when possible (2-way doors)
- You contribute not just to the tech, but to **how the organization runs**

---

## Conway’s Law

> **"Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure."**  
> — Melvin Conway

### 🔍 What It Means

Conway's Law suggests that the way teams **communicate and organize internally** will directly influence the **structure of the systems** they build.

For example, if your company has separate departments that rarely collaborate (like Sales, Marketing, Finance, and Operations), each team may create their own isolated data systems — leading to **siloed architectures**.

📉 **Siloed Teams = Siloed Systems**

![Siloed Systems](./images/conway_law_1.png)

But if the same departments work in a **collaborative, cross-functional** way — communicating frequently — the systems they build will be **more integrated and unified**.

📈 **Collaborative Teams = Unified Architecture**

![Unified System](./images/conway_law_2.png)

### 🧑‍💻 Why It Matters to You as a Data Engineer

Before designing a data architecture, **understand how your company communicates**:
- Are teams siloed or cross-functional?
- Do they share common goals or work in isolation?

Even if your architecture looks perfect on paper, if it **clashes with your org's communication structure**, it’s likely to fail in practice.

> ✅ Good data architecture reflects how people in the company work together.

## Principles of Good Data Architecture

![Principles of Good Data Architecture](./images/principles_data_archi.png)

Data architecture is not just about tools — it's about making smart, flexible, and impactful decisions that evolve with the organization.

### 🔹 Theme 1: How Architecture Affects Others
- **Choose common components wisely**  
  Use tools that benefit multiple teams (e.g., Git, S3, Spark).
- **Architecture is leadership**  
  Architects lead by enabling others, mentoring, and setting standards.

### 🔹 Theme 2: Architecture Is an Ongoing Process
- **Always be architecting**  
  Keep improving as needs change.
- **Build loosely coupled systems**  
  Make systems modular for flexibility.
- **Make reversible decisions**  
  Favor decisions you can back out of (e.g., changing storage classes).

### 🔹 Theme 3: Unspoken But Understood Priorities
- **Plan for failure**  
  Assume systems will break — design for recovery.
- **Architect for scalability**  
  Plan ahead for data/user growth.
- **Prioritize security**  
  Build in data protection from day one.
- **Embrace FinOps**  
  Design systems with cost efficiency in mind.

---

## 🔗 Common Components

![Common Components](./images/common_components.png)

Common components are tools and platforms shared across teams to increase efficiency and reduce duplication.

### 🔧 Examples of Common Components
- **Object Storage** – like Amazon S3, shared by all teams  
- **Version Control Systems** – like Git for code collaboration  
- **Monitoring & Observability** – systems to track health/performance  
- **Processing Engines** – e.g., Spark, for distributed data processing

Choosing common components well promotes collaboration, avoids silos, and reduces maintenance overhead.

# Plan for Failure

One of the key responsibilities of a data engineer is to **anticipate system failure** and design resilient, secure, and scalable architectures that minimize its impact. This principle is broken down into the following components:

---

## 📌 Availability

> **Definition**: The percentage of time an IT service or component is expected to be in an operable state.

Availability ensures users can access the system when needed. For example:

- **Amazon S3 One Zone-IA**: 99.5% (≈ 44 hours downtime/year)
- **Amazon S3 Standard**: 99.99% (≈ 1 hour downtime/year)

![Availability](./images/plf_availability.png)

---

## 📌 Reliability

> **Definition**: The probability of a service or component performing its intended function within a specific time interval.

Reliable systems operate predictably and meet defined performance standards, ensuring smooth user experience.

![Reliability](./images/plf_reliability.png)

---

## 📌 Durability

> **Definition**: The ability of a system to **withstand data loss** from hardware failures, software bugs, or natural disasters.

Durability ensures data is not lost. For instance, **Amazon S3** offers 99.999999999% durability — also called **11 nines**.

![Durability](./images/plf_durability.png)

---

## 🔐 Prioritize Security

Security is critical to prevent breaches and ensure that failures do not lead to data loss or corruption.

**Best Practices:**

- 🛡️ Build a **Culture of Security**
- 🔑 Apply **Principle of Least Privilege**
- 🚫 Adopt **Zero-Trust Security** (no implicit trust, every action must be authenticated)

![Security](./images/prioritize_security.png)

---

## 🛠️ Recovery Objectives

Understanding **how quickly** and **how much data** can be recovered helps mitigate risks during failures.

- **RTO (Recovery Time Objective)**: Max acceptable outage time.
- **RPO (Recovery Point Objective)**: Max acceptable data loss after recovery.

These guide architectural decisions such as storage class or backup frequency.

![RTO and RPO](./images/rto_rpo.png)

---

## ⚙️ Architect for Scalability & 💰 Embrace FinOps

Anticipating load spikes or failures also means building systems that can scale cost-effectively.

### Risks:
- 🔺 Unforeseen high cloud costs
- 🔻 Lost revenue due to system crashes during peak demand

**Recommendations:**
- Use **on-demand vs spot instances** wisely
- Optimize for **cost and performance**
- Build elastic systems that **scale up or down as needed**

![Scalability and FinOps](./images/archi_for_scalability.png)

---

## ✅ Summary

A good data engineer doesn’t just build systems that work under ideal conditions — they build systems that:

- Remain **available** and **reliable** under stress  
- **Protect data** from loss and corruption  
- Are **secure** by design  
- Are **cost-effective** and **scalable**  
- Are built with **failure recovery** in mind

## 🧱 Batch Data Architectures: ETL vs ELT

Batch data architectures process data in **fixed intervals (batches)** rather than in real-time. Two popular patterns are:

---

### 🔄 ETL – Extract, Transform, Load

ETL is the **traditional data pipeline** used when real-time analysis isn't required.

**Steps:**

1. **Extract** data from source systems (databases, files, APIs).
2. **Transform** data in a staging area (cleaning, aggregating, standardizing).
3. **Load** transformed data into a data warehouse.

![ETL](./images/etl.png)

**✅ When to use ETL:**
- Your transformations are **complex** or require **external processing tools**.
- You want **control over data quality** before loading.
- You’re working with **smaller volumes** of data.
- You’re using legacy or on-premise systems.

---

### 🔁 ELT – Extract, Load, Transform

ELT is a **modern pattern** made possible by the power of cloud data warehouses (e.g., BigQuery, Snowflake).

**Steps:**

1. **Extract** data from sources.
2. **Load** raw data directly into the data warehouse.
3. **Transform** inside the warehouse using SQL or built-in tools.

![ELT](./images/elt.png)

**✅ When to use ELT:**
- Your data warehouse supports **high-performance computation**.
- You’re dealing with **large-scale data** (big data).
- You want to **defer transformation** to be more flexible for analysis.
- You prefer a **schema-on-read** approach (raw data first, model later).

---

### ⚖️ Trade-offs Between ETL and ELT

| Criteria         | ETL                                | ELT                                |
|------------------|-------------------------------------|-------------------------------------|
| **Flexibility**  | Less flexible (schema-on-write)     | More flexible (schema-on-read)      |
| **Speed**        | Slower (transformation before load) | Faster ingestion                    |
| **Complexity**   | Complex transformations outside DB  | Simplified using SQL in warehouse   |
| **Use Case**     | Legacy systems, strict governance   | Modern cloud data platforms         |
| **Cost**         | May require external tools          | Optimized using warehouse compute   |

---

**In summary**:  
- Use **ETL** when you need control over transformation before data reaches your warehouse.  
- Use **ELT** to **leverage warehouse power** and increase flexibility in modeling and analytics.

# ⛓️ Lambda, ⚡ Kappa, and 🔁 Unified Architectures in Data Engineering

## ⚙️ Streaming Frameworks

Before diving into the architectures, it's important to know the tools that enable real-time data processing:

![Streaming Frameworks](./images/streaming_frameworks.png)

- **Apache Kafka**: A distributed event streaming platform that stores and transports events reliably at scale.
- **Apache Storm**: A real-time computation system for processing unbounded streams of data.
- **Apache Samza**: Works with Kafka to process event streams in near real time.

---

## ⛓️ Lambda Architecture

![Lambda Architecture](./images/lambda_archi.png)

### 💡 What It Is:
Lambda uses **two parallel pipelines** — one for batch, one for streaming — to handle both historical and real-time data.

### 🔄 How It Works:
- **Batch Layer**:
  - Processes historical data in large chunks.
  - Uses a data warehouse (e.g., BigQuery, Redshift) for storage and querying.
- **Speed (Streaming) Layer**:
  - Handles real-time data from sources like Kafka.
  - Stores output in a NoSQL database (e.g., Cassandra).
- **Serving Layer**:
  - Combines both outputs to deliver a **complete view** for dashboards or ML models.

### ✅ Pros:
- Supports both fresh (stream) and comprehensive (batch) data.
- Can serve accurate analytics with mixed granularity.

### ❌ Cons:
- Requires maintaining **two separate pipelines**.
- Duplicate logic and maintenance effort.
- Possible inconsistency between batch and stream outputs.

---

## ⚡ Kappa Architecture

![Kappa Architecture](./images/kappa_archi.png)

### 💡 What It Is:
Kappa eliminates the batch pipeline and uses only **streaming**. It treats all data as events and enables reprocessing from historical streams.

### 🔄 How It Works:
- Data from **source systems** flows into a **stream processing engine** (e.g., Kafka Streams, Flink).
- The processed data feeds into a **single serving layer** for querying and consumption.
- **Historical replay**: Because Kafka can retain logs, older data can be reprocessed if logic changes.

### ✅ Pros:
- **Simpler** than Lambda — only one codebase to maintain.
- Supports real-time and reprocessing use cases.

### ❌ Cons:
- Not great for very large historical aggregations.
- Needs long-term stream retention if historical replays are required.

---

## 🔁 Unified Batch & Streaming Architecture

![Unified Architecture](./images/unified_batch_streaming.png)

### 💡 What It Is:
Unified architecture views **batch as a special case of streaming**. Uses a **single codebase** for both.

### 🔄 How It Works:
- Treats all data as events:
  - Real-time data = **unbounded event streams**
  - Batch = **bounded slices** of event streams (e.g., hourly/daily windows)
- Stream processors (e.g., **Apache Beam**, **Apache Flink**, **Google Dataflow**) can apply the same transformations on both batch and streaming data.
- Output goes to warehouses, dashboards, ML pipelines, etc.

### ✅ Pros:
- One codebase = less duplication, easier maintenance.
- Flexible — adapts easily as batch or streaming.
- Scalable and modern.

### ❌ Cons:
- Steeper learning curve.
- Needs modern infrastructure and tools to implement correctly.

---

## 🧠 Summary

| Feature            | Lambda                    | Kappa                    | Unified Batch & Streaming    |
|--------------------|----------------------------|--------------------------|-------------------------------|
| Pipelines          | Batch + Stream             | Stream only              | Single (unified)              |
| Codebase           | Two                        | One                      | One                           |
| Complexity         | High                       | Medium                   | Medium                        |
| Historical Replay  | Yes (via batch)            | Yes (via stream replay)  | Yes (via event windowing)     |
| Tools              | Kafka, Hadoop, Cassandra   | Kafka, Flink             | Beam, Flink, Dataflow         |

---

> ⚠️ Tip: In modern data engineering, **Unified** is the most preferred approach. But Kappa is a great real-time-first alternative when batch is not a priority. Lambda is more legacy but still seen in enterprises transitioning to real-time.