**Coursebook: Airflow Introduction**

- Part of Data Engineering Airflow Specialization
- Course Length: 3 Hours
- Last Updated: May 2024

---


- Developed by [Algoritma](https://algorit.ma/)'s product division and instructors team

# Background
 
The coursebook is part of the **Data Engineering Airflow Specialization** prepared by Algoritma. The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

## Training Objectives

This coursebook is intended to perform how to manage, prepare and run airflow app. This coursebook also focused on how to create DAG and task in Airflow.

This coursebook focused on:

- Apache Airflow Introduction
- Directed Acrylic Graph
- Task and It's Dependencies



# Apache Airflow

## Introduction to Apache Airflow 

Apache Airflow is an open-source platform designed to create, schedule, and manage data workflows automatically. Initially developed by the team at Airbnb in 2014, Apache Airflow has since become one of the most popular tools in the data analysis and Big Data processing ecosystem.

Apache Airflow in data analysis brings several significant benefits, including:

- Operational Efficiency: By automating workflows, the time and effort required to execute routine tasks can be significantly reduced, enhancing overall operational efficiency.

- Consistency: Apache Airflow ensures that workflows are executed consistently according to predefined definitions, reducing the risk of human errors and improving result accuracy.

- Scalability: With the ability to handle complex workflows and large scales, Apache Airflow enables organizations to grow alongside their data expansion without sacrificing performance.

- Flexibility: This platform allows users to easily customize workflows to meet changing business needs, facilitating quick adaptation to environmental changes.






## Apache Airflow Components

To support the workflow management, Apache airflow has many components around. 

![airflow component.png](https://airflow.apache.org/docs/apache-airflow/2.0.1/_images/arch-diag-basic.png)

The configuration of Apache Airflow and the creation of data pipelines often fall under the responsibility of a data engineer. The configuration process of Airflow can be done through Airflow.cfg, while the data pipeline (DAG) can be managed via the Airflow User Interface. Additionally, a DAG has direct associations with the following components:

- Scheduler: Responsible for scheduling and triggering the execution of tasks.
- Worker: Responsible for executing tasks.
- Web Server: Allows users to view and manage DAGs.
- Metadata: Stores information about the entire workflow that has been created.

By understanding these fundamental elements, you will see that Apache Airflow plays a crucial role in organizing scheduled workflows and tasks.

## Directed Acrylic Graph(DAG)

In Apache Airflow, Task workflows is defined as a DAG (Directed Acrylic Graph). Simply put, a DAG is a collection of tasks to be executed. By using a DAG, you can arrange the sequence and dependencies of each task.


![basic dag](https://airflow.apache.org/docs/apache-airflow/stable/_images/basic-dag.png)

Here's a simple DAG example with 4 tasks — a, b, c, d — indicating the sequence and dependencies of each task:

- Task A will be executed first.
- Tasks B and C will be executed after Task A is completed.
- Task D will be executed after both Tasks B and C are completed.

Before we jump into DAG creation, note that apahce airflow will automate the scheduling process. Scheduling process in apache airflow will performed in CRON syntax. We will discuss about this first. 

## CRON-Syntax

Cron syntax is a format used to define schedules for executing commands or scripts automatically on Unix-like operating systems. It consists of five fields that represent minute, hour, day of the month, month, and day of the week, respectively. Each field is separated by spaces. Here's a simple explanation of each field:

1. Minute (0–59): Defines the minute of the hour when the command will be executed.
2. Hour (0–23): Defines the hour of the day when the command will be executed.
3. Day of the month (1–31): Defines the day of the month when the command will be executed.
4. Month (1–12): Defines the month of the year when the command will be executed.
5. Day of the week (0–6 or 7): Defines the day of the week when the command will be executed. (0 or 7 represents Sunday, and 1 represents Monday, and so on.)

Using these fields, users can specify precise schedules for running tasks at specific times or intervals.

#### ┌───────────── minute (0–59)
#### │ ┌───────────── hour (0–23)
#### │ │ ┌───────────── day of the month (1–31)
#### │ │ │ ┌───────────── month (1–12)
#### │ │ │ │ ┌───────────── day of the week (0–6) (Sunday to Saturday;7 is also Sunday on some systems)
#### │ │ │ │ │                                   
#### │ │ │ │ │
#### │ │ │ │ │
#### * * * * * 


For example: 

- 0 2 * * *: Executes the task every day at 2 AM.
- 30 8 * * 1–5: Executes the task every Monday to Friday at 8:30 AM.
- 0 0 1 * *: Executes the task at midnight on the first day of every month.
- 0 0 * * 6: Executes the task every Saturday at midnight.

After know the scheduling syntax used in apache airflow. Let's move into DAG creation.

### DAG instance

There are several ways to create DAG. Most common way is using constructor. One thing you have to know that when you declaring a DAG, you can customize the information, description or configuration in your DAG using `default_args`. `default_args` is parameter when when you declaring a DAG but it stores any setup or information regarding your DAG. You can see list argument you can set in the (official documentation)[https://airflow.apache.org/docs/apache-airflow/1.10.12/tutorial.html] . Since there are so many arguments you can set, to avoid redundancy, most common way is storing the default_args into dictionaries. 

---

```python
from airflow import DAG
from datetime import datetime, timedelta, timezone

default_args = {
    'owner': 'Algoritma',
    'retries': 3,
    'retry_delay': timedelta(minutes=5)
}
```

---

Code above is creating `default_args` dictionary. This `default_args` dictionary is used for configuring the default settings and behavior for a Directed Acyclic Graph (DAG) in Apache Airflow, which is a platform used for orchestrating complex workflows. Let's break down each key-value pair:

1. `'owner': 'Algoritma'`: This specifies the owner of the DAG, providing a way to attribute responsibility for the DAG to a specific person or team.

2. `'retries': 3`: This specifies the number of retries that should be attempted in case of task failures. In this case, it's set to 3, meaning a task will be retried three times if it fails.

3. `'retry_delay': timedelta(minutes=5)`: This determines the delay between retries in case of task failures. Here, it's set to 5 minutes, meaning there will be a 5-minutes pause between a failed task and its retry.

Let's continue to how to declaring DAG. 

When we came into declaring DAG, there are several ways in declaring dag. One of these is using @dag decorator. in simple way, decorator is design pattern in Python that allows a user to add new functionality to an existing object without modifying its structure. In previous module, you already learn what is function in python. 

Now let's move into the declaring DAG code. 

---

```python
@dag(dag_id='trending_youtube_dag_sqlite',
    default_args=default_args,
    description='A pipeline to fetch trending YouTube videos',
    start_date=datetime(2023, 5, 7, tzinfo=timezone(timedelta(hours=7))),
    schedule_interval='10 * * * *',
    catchup=False
)
def trending_youtube_dag():
    '''
    This is youtube trending dag, we will define the task in the next section
    '''


dag = trending_youtube_dag()
```

---

**Description:**

- **@dag(...)**: This is a decorator used to define a DAG in Apache Airflow. It takes several arguments**:
    - **dag_id**: Specifies the identifier for the DAG, which is used to uniquely identify it within Airflow.
    - **default_args**: This parameter specifies the default arguments for the DAG, which include settings like owner, start date, retries, etc. default_args is the dictionary defined earlier in the code snippet.
    - **description=Provides a description for the DAG, describing its purpose or functionality.
    - **start_date=**: Specifies the start date for the DAG. Here, it's set to May 7, 2023, with a timezone offset of 7 hours.
    - **`schedule_interval`**: Defines the schedule interval for the DAG. In this case, it's set to run at every 10 minutes ('10 * * * *' is a cron expression we have learn earlier).
    - **`catchup`**: Specifies whether Airflow should backfill or catch up on any missed DAG runs. Setting it to False means Airflow will only consider future scheduled runs from the start date.

- **`def trending_youtube_dag()`**:**: This defines the function trending_youtube_dag, which serves as the entry point for the DAG. Inside this function, you would define the tasks and their dependencies.

- **`dag = trending_youtube_dag()`**: This line calls the trending_youtube_dag function, creating an instance of the DAG. This instance is stored in the variable dag, which can then be used to interact with the DAG within the Airflow environment.

Now we already done declaring DAG. But DAGs are nothing without Tasks to run. 

## Tasks and It's Dependencies

In Airflow, a task serves as the fundamental unit of execution. These tasks are organized into Directed Acyclic Graphs (DAGs), and dependencies are established between them to indicate the desired order of execution. There are three basic types of tasks:

- Operators, predefined task templates that you can string together quickly to build most parts of your DAGs. example: PythonOperator, BaseOperator, EmailOperator.
- Sensors, a special subclass of Operators which are entirely about waiting for an external event to happen.
- A TaskFlow-decorated `@task`, which is a custom Python function packaged up as a Task.

We will declare a Task-Flow using @task decorator.  


---

```python
@task()
def fetch_trending_videos(region_code: str, file_path: str):
    '''
    function to be used for fetching trending videos
    '''
    
```

---

When you declaring a task using decorator, you have to concise that you are using decorator as-well for declaring a dag. Now let's see full code of our DAG. 

---

```python
@dag(dag_id='trending_youtube_dag_sqlite',
    default_args=default_args,
    description='A pipeline to fetch trending YouTube videos',
    start_date=datetime(2023, 5, 7, tzinfo=timezone(timedelta(hours=7))),
    schedule_interval='0 10 * * *',
    catchup=False
)
def trending_youtube_dag():
    '''
    This is youtube trending dag, we will define the task in the next section
    '''
    @task()
    def fetch_trending_videos(region_code: str, file_path: str):
        '''
        function to be used for fetching trending videos
        '''


dag = trending_youtube_dag()
```

---

> **NOTE**: It is important to arrange the indentations, since Python will detect indentation as execution block or inner function. 

From the code above we already set the task `fetch_trending_videos` for `trending_youtube_dag`. When you have more than one Task in one DAG, you can easily add new @task and the function with the correct indentation. 

---

```python
@dag(dag_id='trending_youtube_dag_sqlite',
    default_args=default_args,
    description='A pipeline to fetch trending YouTube videos',
    start_date=datetime(2023, 5, 7, tzinfo=timezone(timedelta(hours=7))),
    schedule_interval='0 10 * * *',
    catchup=False
)
def trending_youtube_dag():
    '''
    This is youtube trending dag, we will define the task in the next section
    '''
    @task()
    def fetch_trending_videos(region_code: str, file_path: str):
        '''
        function to be used for fetching trending videos
        '''
    
    @task()
    def data_processing(source_file_path: str, target_file_path: str):
        '''
        Function to be used for preprocess the data.
        '''

dag = trending_youtube_dag()
```

---

Next things to declare inside the DAG is Task-flow. You can add task-flow inside the @dag function. Usually it defined after all tasks is defined. In our case, The flow we want to create is 

1. Fetch the trending videos
2. Process fetched data. 

First things to do is create a new variable that save call our task function. 

---

```python
file_path = '/opt/airflow/dags/tmp_file.json'
fetch_trending_videos_task = fetch_trending_videos(region_code='ID', max_results=200, target_file_path=file_path)
processed_file_path = '/opt/airflow/dags/tmp_file_processed.json'
data_processing_task = data_processing(source_file_path=file_path, target_file_path=processed_file_path)
```

---

We already set the path we want. Now we have to declare the dependency between task. In Airflow, dependency of our task is defined using `>>` operator. In our case, we want the `data_preprocessing_task` executed after `fetch_trensing video_task`. So the dependency should be.

```
fetch_trending_videos_task >> data_processing_task
```

Now our code should be look like this. 

---

```python
@dag(dag_id='trending_youtube_dag_sqlite',
    default_args=default_args,
    description='A pipeline to fetch trending YouTube videos',
    start_date=datetime(2023, 5, 7, tzinfo=timezone(timedelta(hours=7))),
    schedule_interval='0 10 * * *',
    catchup=False
)
def trending_youtube_dag():
    '''
    This is youtube trending dag, we will define the task in the next section
    '''
    @task()
    def fetch_trending_videos(region_code: str, file_path: str):
        '''
        function to be used for fetching trending videos
        '''
    
    @task()
    def data_processing(source_file_path: str, target_file_path: str):
        '''
        Function to be used for preprocess the data.
        '''
    
    file_path = '/opt/airflow/dags/tmp_file.json'
    fetch_trending_videos_task = fetch_trending_videos(region_code='ID', max_results=200, target_file_path=file_path)
    processed_file_path = '/opt/airflow/dags/tmp_file_processed.json'
    data_processing_task = data_processing(source_file_path=file_path, target_file_path=processed_file_path)
    
    fetch_trending_videos_task >> data_processing_task
    
dag = trending_youtube_dag()
```

---