Beata Sirowy
# Building data pipelines using Apache Airflow
Based on the IBM Data Engineering Professional Certificate, _ETL and Data Pipelines with Shell, Airflow and Kafka_ and Apache Airflow documentation (airflow.apache.org) <br> Images' copyright: IBM Skills Network

## An overview
- not a streaming solution
- with Apache Airflow, a workflow is
represented as a directed acyclic graph (DAG).
The DAG is made of tasks that
are arranged in a specific order of execution.


![image.png](attachment:image.png)

__Airflow's architecture:__

![image.png](attachment:image.png)

- Airflow comes with a built-in __scheduler__,
which handles the triggering of all scheduled workflows.
The scheduler is responsible for submitting
individual tasks from
each scheduled workflow to the executor.
- The __executor__ handles the running of
these tasks by assigning them to workers,
which then run the tasks.
- The __web server__ component of the Airflow
provides a user-friendly, graphical __user interface__.
- From this __UI__, you can inspect,
trigger, and debug any of your __DAGs__ and their individual tasks.
- The __DAG directory__ contains all of your DAG files,
ready to be accessed by the scheduler,
the executor, and each of its employed __workers__.
- Finally, Airflow hosts a __metadata database__,
which is used by the scheduler, executor,
and the web server to store
the state of each DAG and its tasks.


__An alternative visualization of Airflow architecture (from airflow.apache.org)__

![diagram_basic_airflow_architecture.png](attachment:diagram_basic_airflow_architecture.png)

The meaning of the different connection types in the diagrams below is as follows:

- brown solid lines represent DAG files submission and synchronization

- blue solid lines represent deploying and accessing installed packages and plugins

- black dashed lines represent control flow of workers by the scheduler (via executor)

- black solid lines represent accessing the UI to manage execution of the workflows

- red dashed lines represent accessing the metadata database by all components


### DAG: a directed acyclic graph

![image.png](attachment:image.png)

### Life cycle of a task's state

![image.png](attachment:image.png)

### Advantages

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Key principles of Airflow pipelines

![image.png](attachment:image.png)


Apache Airflow pipelines are
built on four main principles.
- __Scalable__.
Airflow has a modular architecture and uses
a message queue to orchestrate
an arbitrary number of workers.
It is ready to scale to infinity.
- __Dynamic__. Airflow pipelines are defined in Python,
and allow dynamic pipeline generation.
Thus, your pipelines can
contain multiple simultaneous tasks.
- __Extensible__. You can easily define
your own operators and
extend libraries to suit your environment.
- __Lean__. Airflow pipelines are lean and explicit.
Parameterization is built into
its core using the powerful Jinja templating engine. 

### Companies using Apache Airflow

![image-2.png](attachment:image-2.png)

### Summing up

- Apache Airflow is a platform to programmatically author,
schedule, and monitor workflows.
- The five main features of Airflow are its use of Python,
its intuitive and useful user interface,
extensive plug and play integrations,
ease of use, and the fact that it is open source.
- Apache Airflow is scalable,
dynamic, extensible, and lean.
- Defining and
organizing machine learning and pipeline dependencies
with Apache Airflow is one of the common use cases. 

## Representing data pipelines as DAGs in Apache Airflow

__DAG - a Directed Acyclic Graph__

![image.png](attachment:image.png)

![image.png](attachment:image.png)

- The simplest non-trivial DAG has
a single directed edge.
It has a single root node,
which is connected to a single terminal node.
 - A tree,
is a commonly used graph for
representing family trees or directory structures.
- All trees are DAGs,
but not all DAGs are trees. 

### DAG in Airflow

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Summing up


- In Apache Airflow,
DAGs are workflows defined as Python code.
- Tasks, which are nodes in your DAG, are
created by implementing air flows built-in operators.
- Pipelines are specified as dependencies between tasks,
which are the directed edges between nodes in your DAG.
- Airflow Scheduler schedules and deploys your DAGs.
- The key advantage of
Apache Airflow's approach to representing
data pipelines as DAGs is the fact that they are expressed as code.
Accordingly, it makes your data pipelines more
maintainable, testable, and collaborative. 