# Table of Contents 

- [Cron Jobs (Before Orchestration)](#cron-jobs-before-orchestration)
- [Evolution of Orchestration tools](#evolution-of-orchestration-tools)
- [Orchestration Basics](#orchestration-basics)
- [Airflow Core Components](#airflow-core-components)
- [Airflow UI](#video-recommendation-learn-the-airflow-ui)

# Cron Jobs (Before Orchestration)

![Cron Logo](./images/cron.png)

## What is Cron?
**Cron** is a Unix/Linux utility (from the 1970s) that automatically runs commands or scripts on a schedule you define.

---

## How a Cron Job Works
A cron job line has **five timing fields** followed by the command:

    MINUTE(0–59) HOUR(0–23) DAY(1–31) MONTH(1–12) WEEKDAY(0–6)  command

You can use `*` (asterisk) to mean “any value”.

**Examples**
- Run at midnight on Jan 1 every year:

      0 0 1 1 * echo "Happy New Year"

- Run every night at midnight:

      0 0 * * * python ingest_from_rest_api.py

![Cron Fields](./images/cron_work.png)

---

## Before Orchestration: Pure Scheduling with Cron
Teams used to chain data pipeline steps by scheduling **multiple cron jobs** a little apart in time so they’d (hopefully) run in order:

- 12:00 AM → ingest API  
- 01:00 AM → transform  
- 02:00 AM → combine with DB  
- 03:00 AM → load to warehouse

This is a **pure scheduling approach**—no dependency awareness, just timed starts.

---

## Problems with Pure Cron Scheduling
- ❌ No dependency checks (a 1 AM job runs even if the midnight job failed or ran long)  
- ❌ Minimal monitoring/alerting; failures often discovered late  
- ❌ Debugging & observability are DIY (logs, alerts, retries)  
- ❌ Fragile when task durations vary

![Pure Scheduling Drawbacks](./images/pure_scheduiling.png)

---

## When Cron Is Still a Good Fit
- ✅ Simple, independent, recurring tasks (backups, log cleanup, small data fetches)  
- ✅ Quick prototypes where a full orchestrator is overkill  
- ✅ Environments with very light automation needs

**Rule of thumb:** If tasks depend on other tasks finishing, or you need retries, backfills, SLAs, or rich monitoring—use an **orchestration tool** (e.g., Airflow, Prefect, Dagster). Otherwise, Cron is perfectly fine for small periodic jobs.

# Evolution of Orchestration Tools

Orchestration is the backbone of modern data engineering, ensuring that complex workflows run in sequence, on time, and with reliability. Let’s walk through its evolution and see how the tooling landscape has shifted over the years.

---

## 📜 Early Days – In-House Solutions
Before the past decade, orchestration was largely **limited to big tech companies** because:
- Open-source or managed orchestration tools didn’t exist.
- Building in-house systems was expensive and complicated.

---

## 🕰 Timeline of Orchestration Tools

![Evolution Timeline](./images/evolution.png)

- **Late 2000s**:  
  - **Facebook’s DataSwarm** → built internally to manage their growing data workflows.
  
- **2010s**:  
  - **Apache Oozie** → became popular, but it was tied to Hadoop clusters, making it less flexible in heterogeneous environments.

- **2014**:  
  - **Airbnb released Airflow** → inspired by earlier tools like DataSwarm but designed to be **open-source, flexible, and Python-based**.  
  - Quickly became the *industry standard*.

- **2019**:  
  - **Apache Airflow** graduated to a full Apache Software Foundation project, solidifying its role as the most widely adopted orchestration framework.

---

## 🌟 Airflow: Advantages & Challenges

![Airflow Advantages](./images/adv_airflow.png)

**Advantages**
- Written in **Python**, making it flexible and widely accessible.  
- Very **active open-source community** with frequent commits and bug fixes.  
- Available as a **managed service** through providers like AWS, GCP, and Astronomer.  

**Challenges**
- Struggles with **scalability** for very large workflows.  
- Limited built-in support for **data integrity**.  
- Lacks **native support for streaming pipelines**.

---

## 🔮 Other Open-Source Orchestration Tools

![Other Tools](./images/other_orche_tools.png)

As the ecosystem evolved, newer tools emerged, aiming to improve on Airflow’s design while addressing its shortcomings:

- **Luigi**: Early workflow management tool.  
- **Conductor**: Focused on microservice orchestration.  
- **Prefect**: More scalable and developer-friendly orchestration system.  
- **Dagster**: Adds built-in data quality testing and transformation features.  
- **Mage**: Provides integrated data transformation and monitoring.

---

## ⚡ Example Improvements

![Examples of Other Tools](./images/eg_other_orche.png)

- **Prefect** → More scalable than Airflow, making it a good fit for heavy workloads.  
- **Dagster & Mage** → Bring built-in **data quality testing** and transformation features, helping ensure correctness beyond just scheduling.


# Orchestration Basics

## Why Orchestration?
Orchestration is about managing complex **data pipelines** more reliably than simple Cron jobs. While Cron can schedule tasks, orchestration tools give you advanced control over dependencies, monitoring, alerts, and fallback plans.

**Pros:**
- Set up dependencies between tasks  
- Monitor task execution  
- Get alerts on failures  
- Create fallback plans  

**Cons:**
- Adds more operational overhead compared to simple Cron scheduling  

![Pros and Cons](./images/pro_con.png)

---

## Directed Acyclic Graphs (DAGs)
At the heart of orchestration is the **Directed Acyclic Graph (DAG)** — a structure where:
- Each **task** is a node  
- Each **arrow** (edge) shows data flow  
- Data flows **only in one direction**  
- No loops or cycles are allowed  

This ensures predictable execution order.

![DAG](./images/dag.png)

---

## Dependencies Between Tasks
With Cron, tasks could overlap or break if one runs late.  
With orchestration, you can **define dependencies**: a task won’t start until its upstream tasks are complete.

![Task Dependencies](./images/task_based.png)

---

## Orchestration in Airflow
In Airflow, DAGs are defined in Python. You programmatically specify tasks and how they depend on each other.

![Code Example](./images/code_for_orchestration_.png)

Airflow then lets you:
- Visualize DAGs  
- Trigger runs manually or on schedule  
- Monitor progress & debug issues  

---

## Time-based vs Event-based Triggers
Airflow DAGs can run on **time-based schedules** (like Cron) or be triggered by **events** (like new data arriving).

![Time or Event Based](./images/time_or_event_based.png)

### Example: Time-based
Run daily at midnight:  
![Example Time-based](./images/eg_time_based.png)

### Example: Event-based
Run when a dataset updates:  
![Example Event-based](./images/eg_event_based.png)

You can even make **part of a DAG** wait for an external event, e.g., a file landing in S3.  

![External Flow](./images/external_flow.png)

---

## Data Quality Checks
Another orchestration benefit: embedding **data quality checks** into the DAG.  
For example:
- Count of null values  
- Validating ranges of values  
- Schema verification  

![Quality Checks](./images/quality_checks.png)

---

## Summary
Orchestration tools (like Airflow) provide the structure to:
- Define pipelines as **DAGs**  
- Manage **dependencies** between tasks  
- Run on **time or event conditions**  
- Monitor and alert on failures  
- Enforce **data quality**  

Though more complex than Cron, orchestration is essential for reliable and scalable data engineering workflows.

# Airflow Core Components

Airflow is built around a set of core components that work together to run your DAGs (Directed Acyclic Graphs), monitor dependencies, execute tasks, and display status updates to users.

---

## 🔑 Core Components of Airflow

The main components of Airflow are:

- **Web Server** → Hosts the Airflow **User Interface (UI)**.  
- **Scheduler** → Monitors DAGs and determines when tasks should run.  
- **Workers** → Execute the tasks that are scheduled.  
- **Metadata Database** → Stores the state of DAGs and tasks (success, failure, etc.).  
- **DAG Directory** → Stores Python scripts that define your DAGs.

All these components are essential parts of an Airflow environment, whether you install it directly or use a managed service like MWAA.

![Airflow Components](./images/airflow_components.png)

---

## 🖥️ User Interaction with Airflow

- You (the user) write **Python scripts** to define DAGs and place them in the **DAG Directory**.  
- These DAGs automatically appear in the **Web Server UI**, where you can:
  - Visualize DAGs  
  - Monitor tasks  
  - Trigger DAGs manually  
  - Troubleshoot issues  

Thus, the DAG directory + user interface are the **main interaction points** for users, while the other components work in the background.

![User Interaction](./images/user_interaction.png)

---

## ⏰ Scheduling with the Scheduler

- The **Scheduler** runs every minute by default.  
- It checks all DAGs in the DAG Directory and determines:
  - If a task should be triggered by time (schedule-based)  
  - If a task’s dependencies are complete (dependency-based)  
- Once ready, the **Scheduler**:
  1. Pushes tasks into a queue  
  2. Uses an **Executor** to extract tasks from the queue  
  3. Sends the tasks to **Workers**, which run them  

As tasks move through this process, their status transitions from:  
`schedule → queued → running → success/failed`.

![Scheduler Workflow](./images/scheduler_workflow.png)

---

## 📊 Task Status and Metadata Database

- The **Scheduler** and **Workers** update the **Metadata Database** with task status and DAG states.  
- The **Web Server** then queries the metadata database to extract these statuses.  
- Finally, the **UI** displays task states to the user.  

This is why you can see real-time updates of task states (like running, success, or failed) in the Airflow UI.

![Task Status Workflow](./images/status_workflow.png)

---

## ☁️ Managed Workflows for Apache Airflow (MWAA)

Amazon provides a managed service called **MWAA** (Managed Workflows for Apache Airflow), which automatically sets up and manages all Airflow components for you.

In MWAA:

- **DAG Directory** → Stored in **Amazon S3**.  
- **Metadata Database** → Hosted on **Amazon Aurora PostgreSQL**.  
- **Schedulers, Workers, Web Server** → Managed as AWS services inside a secure **VPC**.  
- Additional integrations → AWS CloudWatch for logging, SQS for queuing, and ECR for containers.

This allows you to use Airflow at scale without worrying about manually configuring or maintaining its infrastructure.

![MWAA](./images/mwaa.png)

---

# Video Recommendation: Learn the Airflow UI

> Refer to this video to understand the Airflow UI end-to-end:

**[Apache Airflow UI Tour | Airflow UI Walkthrough for Beginners](https://www.youtube.com/watch?v=sMIW8dLjzRU)**

## What you’ll learn
- Navigating DAGs: **Grid**, **Graph**, **Gantt**, **Tree** views  
- Triggering DAG runs, **pausing/unpausing**, filtering & tags  
- Inspecting **Task Instance** details: logs, retries, XComs  
- Monitoring run states and understanding status colors  
- Viewing code, variables, connections, and admin panels
