<a href="https://colab.research.google.com/github/buaindra/gcp_utility/blob/main/gcp/composer/GoogleCloud_Composer_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Airflow and Composer

#### Airflow Summit 2022: https://www.crowdcast.io/e/airflowsummit2022/


### Airflow Ref
1. Airflow official doc: https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
2. Google doc: https://cloud.google.com/composer/docs/composer-2/trigger-dags

## Airflow Components
1. **A scheduler,** which handles both triggering scheduled workflows, and submitting Tasks to the executor to run.

2. **An executor,** which handles running tasks. In the default Airflow installation, this runs everything inside the scheduler, but most production-suitable executors actually push task execution out to workers.

3. **A webserver,** which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.

4. **A folder of DAG files,** read by the scheduler and executor (and any workers the executor has)

5. **A metadata database,** used by the scheduler, executor and webserver to store state.



## DAG
> Directed Acyclic Graph
>
1. DAG Ref:
  1. Airflow official doc: https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html

### Dag Run
1. Dag Run Status
  1. success
  2. failed
  3. skipped

2. DAG argument:
  1. **schedule_interval**
        1. it can be a cron expression as str, cron preset, datetime.timedelta(minutes=3)

        2. **Data Interval**: 
        1. means if dag scheduled for **@daily**, then data interval start at midnight of each day and end at midnight of next day.
        2.  **execution date/logical date**:
          1. denotes the start of the data interval, not when dag is actually executed.
  2. **start_date**:
        1. it also points to same logical date.
        1. DAG run will only be scheduled one interval after start_date.  
  3. **catchup:**
        1. the scheduler by default kick of a dag run for any data interval that has not been run since the last data interval. this concept is called Catchup
        2. if catchup = False in DAG argument, then scheduler creates dag run only for latest interval.

  4. **depends_on_past:**

  5. **trigger_rule:**
  
  6. **default_args:** its default argument to every task. "Owner" argument must needed for every task, which can be pass through default_arg.


#### Sample DAG code

In [None]:
# from airflow import DAG
from airflow.models.dag import DAG  # DAG object to initiate the DAG
from airflow.operators.bash import BashOperator  # operator to operate as task

import datetime
import pendulum

dag = DAG(
    "tutorial",  # dag_id: unique identifier of your dag
    default_args={
        "owner": "Airflow",
        "depends_on_past": True,
        "retries": 1
        "retry_delay": datetime.timedelta(minutes=3)  # datetime.timedelta(days=1)
    },
    start_date=pendulum.datetime(2015, 12, 1, tz="UTC"),
    description="a simple tutorial dag",
    schedule_interval = "@daily",
    catchup=False,
)

t1 = EmptyOperator(task_id="task", dag=dag)

In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the command line), a single DAG Run will be created with a data between 2016-01-01 and 2016-01-02, and the next one will be created just after midnight on the morning of 2016-01-03 with a data interval between 2016-01-02 and 2016-01-03.

In [None]:
from airflow import DAG
from airflow.operators.bash import BashOperator
import datetime


with DAG(
    "sample_dag",
    description="dample dag for learning"
    default_args={
        "owner": "Airflow",
    },
    start_date=datetime.datetime.now(),
    schedule_interval=datetime.timedelta(days=1),  
) as dag:

  t1 = BashOperator(
      task_id="print_date",
      bash_command="date",
  )

  t2 = BashOperator(
      task_id='sleep',
      depends_on_past=False,
      bash_command='sleep 5',
      retries=3,
  )

The actual tasks defined here will run in a different context from the context of this script. Different tasks run on different workers at different points in time, which means that this script cannot be used to cross communicate between tasks. Note that for this purpose we have a more advanced feature called **XComs**.

## XComs

## Templating with Jinja
#### Ref:
1. Airflow template ref: https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#templates-ref
2. Jinja: https://jinja.palletsprojects.com/en/latest/api/#custom-filters 

### Uses:
1. Jinja template only can be accessible on operators' templated field.
2. python function also can accept jinja template
3. use config values inside dag:
```python
var = "{{ dag_run.conf['key'] }}"
```

In [None]:
templated_command = dedent(
    """
{% for i in range(5) %}
    echo "{{ ds }}"
    echo "{{ macros.ds_add(ds, 7)}}"
{% endfor %}
"""
)

t3 = BashOperator(
    task_id='templated',
    depends_on_past=False,
    bash_command=templated_command,
)

In [None]:
def run_this_func(**context):
    """
    Print the payload "message" passed to the DagRun conf attribute.
    :param context: The execution context
    :type context: dict
    """
    print("context", context)
    print("Remotely received value of {} for key=message".format(context["dag_run"].conf["key"]))

#PythonOperator usage
run_this = PythonOperator(task_id="run_this", python_callable=run_this_func, dag=dag, provide_context=True)


In [None]:
#BashOperator usage
bash_task = BashOperator(
    task_id="bash_task",
    bash_command='echo "Here is the message: \'{{ dag_run.conf["key"] if dag_run else "" }}\'"',
    dag=dag
)

In [None]:
#SparkSubmitOperator usage
spark_task = SparkSubmitOperator(
        task_id="task_id",
        conn_id=spark_conn_id,
        name="task_name",
        application="example.py",
        application_args=[
            '--key', '\'{{ dag_run.conf["key"] if dag_run else "" }}\''
        ],
        num_executors=10,
        executor_cores=5,
        executor_memory='30G',
        #driver_memory='2G',
        conf={'spark.yarn.maxAppAttempts': 1},
        dag=dag)

## Sensors
##### ref: https://www.youtube.com/watch?v=fgm3BZ3Ubnw

In [None]:
from airflow.sensors.filesystem import FileSensor

waiting_for_file = FileSensor(
    task_id="waiting_for_file",
    poke_interval=30,  # in every 30 secs, sensor will check for the file
    timeout= 60 * 5,  # 5 mints timeout, best practice to use to avoid deadlock. else your task will work on worker and wait for file and will not be finished.
    mode="reschedule",  # default is "poke". reschedule helps sensor to release the worker during the interval time so other task can use that worker.
    soft_fail=True  # default is False. If its true and execution time is greater than sensor timeout, then it will skip the task.
)

### ExternalTaskSensor
##### Ref: https://www.youtube.com/watch?v=Oemg-3aiAiI

## Group tasks inside DAGs