## Assignment: ETL and Data Pipelines with Shell, Airflow and Kafka (Week 3)

Credit: Coursera

This my first assignment of Apache Airflow DAG. The task is shared in [this URL](https://author-ide.skills.network/render?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJtZF9pbnN0cnVjdGlvbnNfdXJsIjoiaHR0cHM6Ly9jZi1jb3Vyc2VzLWRhdGEuczMudXMuY2xvdWQtb2JqZWN0LXN0b3JhZ2UuYXBwZG9tYWluLmNsb3VkL0lCTS1EQjAyNTBFTi1Ta2lsbHNOZXR3b3JrL2xhYnMvQXBhY2hlJTIwQWlyZmxvdy9CdWlsZCUyMGElMjBEQUclMjB1c2luZyUyMEFpcmZsb3cvQnVpbGQlMjBhJTIwREFHJTIwdXNpbmclMjBBaXJmbG93Lm1kIiwidG9vbF90eXBlIjoidGhlaWFkb2NrZXIiLCJhZG1pbiI6ZmFsc2UsImlhdCI6MTY3MjkxOTM5Nn0.eCSPeH0nmI8K2poct5jS0bHiQ2DhxBm_bhtThgqcmVQ). Basically, the task is to create a DAG that has the pipelines of Download, Extract, Transform, Load.

* Showing the Python version (ignore it)

In [19]:
!python --version

Python 3.10.4


* The very first step is to import the required modules.

In [33]:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
import pendulum

* Set the default arguments to create DAG

In [34]:
default_args = {
  'owner': 'Iron Man',
  'start_date': pendulum.now('Asia/Dhaka'),
  'email': ['ironman@somemail.com'],
  'email_on_failure': False,
  'email_on_retry': False,
  'retries': 1,
  'retry_delay': timedelta(minutes=5),
}

* Instantiate the DAG class

In [22]:
dag = DAG(
  'ETL-server-access-log-processing',
  default_args=default_args,
  description='ETL server access log processing assignment of coursera course',
  schedule=timedelta(days=1)
)

* download - the first task

In [23]:
 download = BashOperator(
  task_id='download',
  bash_command='wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Apache%20Airflow/Build%20a%20DAG%20using%20Airflow/web-server-access-log.txt',
  dag=dag,
)

* extract - second task

In [24]:
extract = BashOperator(
  task_id='extract',
  bash_command='cut -d"#" -f1,4 web-server-access-log.txt > /home/project/airflow/dags/extracted-data.txt',
  dag=dag,
)

* transform - third task

In [25]:
transform = BashOperator(
  task_id='transform_data',
  bash_command='tr [:lower:] [:upper:] < /home/project/airflow/dags/extracted-data.txt > /home/project/airflow/dags/transformed-data.txt',
  dag=dag,
)

* load - final task

In [26]:
load = BashOperator(
  task_id='load',
  bash_command='tar -czf /home/project/airflow/dags/ETL_server_access_log_processing.tar.gz /home/project/airflow/dags/extracted-data.txt /home/project/airflow/dags/transformed-data.txt',
  dag=dag,
)

* task pipeline

In [27]:
download >> extract >> transform >> load

<Task(BashOperator): load>

<br>

### Final Code

In [32]:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
import pendulum

default_args = {
  'owner': 'Iron Man',
  'start_date': pendulum.now('Asia/Dhaka'),
  'email': ['ironman@somemail.com'],
  'email_on_failure': False,
  'email_on_retry': False,
  'retries': 1,
  'retry_delay': timedelta(minutes=5),
}

dag = DAG(
  'ETL-server-access-log-processing',
  default_args=default_args,
  description='ETL server access log processing assignment of coursera course',
  schedule=timedelta(days=1)
)

download = BashOperator(
  task_id='download',
  bash_command='wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Apache%20Airflow/Build%20a%20DAG%20using%20Airflow/web-server-access-log.txt',
  dag=dag,
)

extract = BashOperator(
  task_id='extract',
  bash_command='cut -d"#" -f1,4 web-server-access-log.txt > /home/project/airflow/dags/extracted-data.txt',
  dag=dag,
)

transform = BashOperator(
  task_id='transform',
  bash_command='tr [:lower:] [:upper:] < /home/project/airflow/dags/extracted-data.txt > /home/project/airflow/dags/transformed-data.txt',
  dag=dag,
)

load = BashOperator(
  task_id='load',
  bash_command='tar -czf /home/project/airflow/dags/ETL_server_access_log_processing.tar.gz /home/project/airflow/dags/extracted-data.txt /home/project/airflow/dags/transformed-data.txt',
  dag=dag,
)

download >> extract >> transform >> load

<Task(BashOperator): load>

<br>

To add the DAG in the Airflow, follow the below steps.

1. Save the above file as ETL_Server_Access_Log_Processing.py
2. Run the below command in terminal where Airflow is installed

You will get the DAG in the webserver of the Airflow.