### Airflow

 - Data engineering: Taking any action involving data and turning it into a reliable, repeatable and maintanable process
 - Workflow: A set of steps to accomplish a given data engineering task e.g. copying, downloading files, filtering information, writing to a database
 - Varying levels of complexity for a workflow (from 2-3 steps to 100s)
 - Airflow is a platform to program workflows including, creation, scheduling and monitoring of tasks. It can handle complex data engineering pipelines in production
 - Airflow adds scheduling, error handling, and reporting to workflows
 - It implements workflows as DAGs - Directed Acyclic Graphs which is a set of tasks and dependencies between them
 - It is accessed via code, command-line or via web interface
 - Other workflow tools include Luigi, SSIS or Bash scripting

Airflow run command
 - airflow run dag_id task_id start_date
 - airflow run etl_pipeline download_file 2020-01-08

Airflow help
 - " airflow -h " obtains further information about any Airflow command
 - " airflow list_dags " shows a list of the available DAGs

Airflow port
 - " airflow webserver -p PORT " runs the server workers on PORT
 - " airflow webserver -p 9880 " runs the server workers on 9880

#### DAG

 - Directed: an inherent flow representing dependencies between components 
 - These dependencies even implicit ones provide context on how to order the running of components
 - Acyclic: does not loop or repeat, the individual components are only executed once per run
 - Graph: a graph represents the components and their relationships between them
 - In Airflow DAGs are written in python but can use components written in other languages
 - DAGs are made up of components (typically tasks) to be executed such as operators, sensors etc
 - Dependencies are defined either explicitly or implicitly (?) so that Airflow knows which components should be run at what point within a workflow

<img src="assets/airflow/command_line_vs_python.png" style="width: 600px;"/>

#### Simple DAG examples

In [2]:
# Import the DAG object
from airflow.models import DAG

# Define the default_args dictionary
default_args = {
  'owner': 'dsmith',
  'start_date': datetime(2020, 1, 14),
  'retries': 2
  'email':'bla@blabla.com'
}

# Instantiate the DAG object
etl_dag = DAG(dag_id='example_etl', default_args=default_args)

In [None]:
from airflow.models import DAG

default_args = {
  'owner': 'jdoe',
  'email': 'jdoe@datacamp.com'
}
dag = DAG( 'refresh_data', default_args=default_args )

In [None]:
from airflow.models import DAG
default_args = {
  'owner': 'jdoe',
  'start_date': '2019-01-01'
}
dag = DAG( dag_id="etl_update", default_args=default_args )

#### Airflow web interface
 - A web interface that should make it easier to schedule tasks, review processes and correct issues
 - The Tree View lists the tasks and any ordering between them in a tree structure, with the ability to compress / expand the nodes.
 - The Graph View shows any tasks and their dependencies in a graph structure, along with the ability to access further details about task runs.
- The Code view provides full access to the Python code that makes up the DAG.

#### Operators
 - Represent a single task in a workflow
 - Run independently (usually), meaning that all resources needed to complete the task are contained within the operator
 - Generally do not share information (to simplify the workflows)
 - There are various operators to perform different tasks
 - For instance, the DummyOperator(task_id='example', dag=dag) can be used to represent a task for trubleshooting or a task that has not yet been implemented
 - The BashOperator executes a given Bash task or script, it requires 3 arguments and is defined as BashOperator(task_id='example', bash_command='script.sh', dag=dag)
 - Can specify environment variables for the command
 - The BashOperator allows you to specify any given Shell command or script and add it to an Airflow workflow. This can be a great start to implementing Airflow in your environment

Operator "gotchas":
 - Not guaranteed to run in the same location (directory)
 - May require extensive use of env variables
 - Can be difficult to run tasks with elevated privileges (different user access)

In [None]:
# Import the BashOperator
from airflow.operators.bash_operator import BashOperator

# Define the BashOperator 
cleanup = BashOperator(
    task_id='cleanup_task',
    # Define the bash_command
    bash_command='cleanup.sh',
    # Add the task to the dag
    dag=analytics_dag)

#### Multiple BashOperators

Airflow DAGs can contain many operators, each performing their defined tasks.

In [None]:
# Define a second operator to run the `consolidate_data.sh` script
consolidate = BashOperator(
    task_id='consolidate_task',
    bash_command='consolidate_data.sh',
    dag=analytics_dag)

# Define a final operator to execute the `push_data.sh` script
push_data = BashOperator(
    task_id='pushdata_task',
    bash_command='push_data.sh',
    dag=analytics_dag)

#### Tasks
 - Instances of operators
 - Assigned to a variable in Python
 - Within Airflow tasks are defined by their task id not the variable name
 
 
 - Tasks are either upstream or downstream
  - Upstream tasks are those that must be completed prior other any downstream tasks
 - Dependencies can be defined using the bitshift operators
  - ">>" is the upstream operator
  - "<<" is the downstream operator
 - Upstreams means "before", downstream means "after". Upstream tasks must be completed before downstream tasks
 
 
 - Multiple dependencies can be set like this:
  - task1 >> task2 >> task3 >> task4
  
  
 - Task dependencies in the Airflow UI:
 <img src="assets/airflow/task_dependencies.png" style="width: 600px;"/>
 
 
  - Chained and mixed-dependencies:
 <img src="assets/airflow/chained_dependencies.png" style="width: 600px;"/>

In [None]:
 # Define the tasks
task1 = BashOperator(task_id='first_task',
                     bash_command='echo 1',
                     dag=example_dag)
task2 = BashOperator(task_id='second_task',
                     bash_command='echo 2',
                     dag=example_dag)
# Set first_task to run before second_task
task1 >> task2   # or task2 << task1

 - Define a BashOperator called pull_sales with a bash command of wget https://salestracking/latestinfo?json.
 - Set the pull_sales operator to run before the cleanup task.
 - Configure consolidate to run next, using the downstream operator.
 - Set push_data to run last using either bitshift operator.

In [None]:
# Define a new pull_sales task
pull_sales = BashOperator(
    task_id='pullsales_task',
    bash_command='wget https://salestracking/latestinfo?json',
    dag=analytics_dag
)

# Set pull_sales to run prior to cleanup
pull_sales >> cleanup

# Configure consolidate to run after cleanup
consolidate << cleanup

# Set push_data to run last
consolidate >> push_data

#### PythonOperator
 - Executes a Python function / callable
 - Operates similarly to BashOperator with more options
 - Can pass in arguments to the Python code
 - Arguments can be positional or keyword
 - Use the op_kwargs dictionary to pass keyword arguments

In [None]:
from airflow.operators.python_operator import PythonOperators

def printme():
    print("This goes in the logs!")
python_task = PythonOperator(
    task_id='simple_print',
    python_callable=printme,
    dag=example_dag
)

In [None]:
def sleep(length_of_time):
    time.sleep(length_of_time)

sleep_task = PythonOperator(
    task_id='sleep',
    python_callable=sleep,
    op_kwargs={'length_of_time': 5}
    dag=example_dag
)

#### Additional operators
 - Found in airflow.operators or airflow.contrib.operators libraries

In [None]:
from airflow.operators.email_operator import EmailOperator

email_task = EmailOperator(
    task_id='email_sales_report',
    to='sales_manager@example.com',
    subject='Automated Sales Report',
    html_content='Attached is the latest sales report',
    files='latest_sales.xlsx',
    dag=example_dag
)

#### Example with PythonOperator and EmailOperator

In [None]:
# Example

def pull_file(URL, savepath):
    r = requests.get(URL)
    with open(savepath, 'wb') as f:
        f.write(r.content)   
    # Use the print method for logging
    print(f"File pulled from {URL} and saved to {savepath}")


    
from airflow.operators.python_operator import PythonOperator

# Create the task
pull_file_task = PythonOperator(
    task_id='pull_file',
    # Add the callable
    python_callable=pull_file,
    # Define the arguments
    op_kwargs={'URL':'http://dataserver/sales.json', 'savepath':'latestsales.json'},
    dag=process_sales_dag
)


# Add another Python task
parse_file_task = PythonOperator(
    task_id='parse_file',
    # Set the function to call
    python_callable=parse_file,
    # Add the arguments
    op_kwargs={'inputfile':'latestsales.json', 'outputfile':'parsedfile.json'},
    # Add the DAG
    dag=process_sales_dag
)
    

In [None]:
# Import the Operator
from airflow.operators.email_operator import EmailOperator

# Define the task
email_manager_task = EmailOperator(
    task_id='email_manager',
    to='manager@datacamp.com',
    subject='Latest sales JSON',
    html_content='Attached is the latest sales JSON file as requested.',
    files='parsedfile.json',
    dag=process_sales_dag
)

# Set the order of tasks
pull_file_task >> parse_file_task >> email_manager_task

#### DAG Runs

 - A specific instance of a workflow at a point in time
 - Can be run manually or via schedule_interval
 - Maintain state for each workflow and the tasks within
     - running, failed, success
 - In the web interface you can find them at "Browse Dag Runs"
 
 - Scheduling details:
     - start_date: datetime Python object for initial schedule of the DAG run
     - end_date: optional attribute to stop running new DAG instances
     - max_tries: optional attribute for how many attempts to make
     - schedule_interval: how often to run / schedule the DAG for execution, it occurs between the start_date and end_date
     - scheduler presets: @once, @hourly, @daily, @weekly or you can use cron format

Example:
 - Set the start date of the DAG to November 1, 2019.
 - Configure the retry_delay to 20 minutes. You will learn more about the timedelta object in Chapter 3. For now, you just need to know it expects an integer value.
 - Use the cron syntax to configure a schedule of every Wednesday at 12:30pm.

In [None]:
# Update the scheduling arguments as defined

default_args = {
  'owner': 'Engineering',
  'start_date': datetime(2019, 11, 1),
  'email': ['airflowresults@datacamp.com'],
  'email_on_failure': False,
  'email_on_retry': False,
  'retries': 3,
  'retry_delay': timedelta(minutes=20)
}

dag = DAG('update_dataflows',
          default_args=default_args,
          schedule_interval='30 12 * * 3')

#### Full Example
 - Note that this will not be triggered before a month has passed after the start_date of Feb 15, 2020.

In [None]:
import requests
import json
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.email_operator import EmailOperator


default_args = {
    'owner':'sales_eng',
    'start_date': datetime(2020, 2, 15),
}

process_sales_dag = DAG(dag_id='process_sales', default_args=default_args, schedule_interval='@monthly')


def pull_file(URL, savepath):
    r = requests.get(URL)
    with open(savepath, 'w') as f:
        f.write(r.content)
    print(f"File pulled from {URL} and saved to {savepath}")
    

pull_file_task = PythonOperator(
    task_id='pull_file',
    # Add the callable
    python_callable=pull_file,
    # Define the arguments
    op_kwargs={'URL':'http://dataserver/sales.json', 'savepath':'latestsales.json'},
    dag=process_sales_dag
)

def parse_file(inputfile, outputfile):
    with open(inputfile) as infile:
        data=json.load(infile)
        with open(outputfile, 'w') as outfile:
            json.dump(data, outfile)
        
parse_file_task = PythonOperator(
    task_id='parse_file',
    # Set the function to call
    python_callable=parse_file,
    # Add the arguments
    op_kwargs={'inputfile':'latestsales.json', 'outputfile':'parsedfile.json'},
    # Add the DAG
    dag=process_sales_dag
)

email_manager_task = EmailOperator(
    task_id='email_manager',
    to='manager@datacamp.com',
    subject='Latest sales JSON',
    html_content='Attached is the latest sales JSON file as requested.',
    files='parsedfile.json',
    dag=process_sales_dag
)

pull_file_task >> parse_file_task >> email_manager_task

#### Sensors
 - Use sensors when:
     - uncertain when a condition will be true
     - if you don't want to fail the intire DAG immediately but want to continue checking if a condition has been met
     - add task repetition wihout loops
     
     
 - Sensors are special operators that wait for a certain condition to be true
 - Conditions can include the creation of a file, upload of a database record, certain response from a web request
 - Can define how often to check for the condition to be true
     - mode determines how to check for the condition
     - mode = 'poke' means it checks repeatedly
     - mode = 'reschedule' means it give up task slot and try again later
 - Are assigned to tasks like normal operators
 - Derived from " airflow.sensors.base_sensor_operator "
 - poke_interval refers to how often to wait between checks
 - timeout refers to how long to wait before failing task (timeout must be significantly shorter than the schedule interval)

#### Useful sensors

File sensor:
 - checks for the existence of a file at a certain location
 - can check if any files exist within a directory

In [4]:
from airflow.contrib.sensors.file_sensor import FileSensor

file_sensor_task = FileSensor(task_id='file_sense',
                              filepath='salesdata.csv',
                              poke_interval=300,
                              dag=sales_report_dag)

init_sales_cleanup >> file_sensor_task >> generate_report

ExternalTaskSensor
 - waits for a task in another DAG to complete

HttpSensor
 - makes requests to a web URL and checks for content

SqlSensor
 - runs a SQL query to check for content

Find more sensor operators under
 - airflow.sensors
 - airflow.contrib.sensors


#### Full sensor example
 - Note that this DAG is waiting for the file salesdata_ready.csv to be present before it can start

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.contrib.sensors.file_sensor import FileSensor

dag = DAG(
   dag_id = 'update_state',
   default_args={"start_date": "2019-10-01"}
)

precheck = FileSensor(
   task_id='check_for_datafile',
   filepath='salesdata_ready.csv',
   dag=dag)

part1 = BashOperator(
   task_id='generate_random_number',
   bash_command='echo $RANDOM',
   dag=dag
)

import sys
def python_version():
    return sys.version

part2 = PythonOperator(
   task_id='get_python_version',
   python_callable=python_version,
   dag=dag)
   
part3 = SimpleHttpOperator(
   task_id='query_server_for_external_ip',
   endpoint='https://api.ipify.org',
   method='GET',
   dag=dag)
   
precheck >> part3 >> part2

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime

report_dag = DAG(
    dag_id = 'execute_report',
    schedule_interval = "0 0 * * *"
)

precheck = FileSensor(
    task_id='check_for_datafile',
    filepath='salesdata_ready.csv',
    start_date=datetime(2020,2,20),
    mode='reschedule',
    dag=report_dag
)

generate_report_task = BashOperator(
    task_id='generate_report',
    bash_command='generate_report.sh',
    start_date=datetime(2020,2,20),
    dag=report_dag
)

precheck >> generate_report_task

#### Airflow executors
 - An executor is the component that runs the tasks defined in a workflow
 - Different executors handle running the tasks differently, some may run a single task at a time on a local system, while others might split individual tasks among all the systems in a cluster
 - This is oftenr referred to as the number of worker slots available
 - Example executors:
   - SequentialExecutor
   - LocalExecutor
   - CeleryExecutor

SequentialExecutor:
 - default Airflow executor
 - runs one task at a time
 - useful for debugging
 - not recommended for production due to its limitations of task resources

LocalExecutor:
 - runs on a single system
 - treats each task as a process on the local system and can start as many concurrent tasks as desired, requested and permitted by the system resources (CPU cores, memory etc)
 - concurrency allows for parallelism as defined by the user either unlimited or limited to a certain number of simultaneous tasks
 - can utilise all the resources of a given host system

CeleryExecutor
 - uses a Celery backend as task manager
 - Celery is a general queuing system written in Python that allows multiple systems to communicate as a basic cluster
 - using a CeleryExecutor multiple Airflow systems can be configured as workflows for a given set of tasks
 - is difficult to setup and configure
 - it is a powerful choice however when one expects to have large number of DAGs or expects their processing needs to grow

Determining which executor is being used:
 - in the command line use << cat airflow/airflow.cfg | grep "executor=" >>
 - in the command line use << airflow list_dags >> and look for the INFO output

#### Troubleshooting

Common issues
 - DAGs won't run on schedule
     - Check if the scheduler is running (the Airflow scheduler handles DAG run and task scheduling, if it is not running no tasks can run)
     - An Error that says " the scheduler does not appear to be running " will normally show up
     - Fix by running " airflow scheduler " from the command line
     - Another reason might be that the " schedule_interval " argument hasn't passed.
     - Modify accordingly to meet your requirements
     - Lastly, it might be that there are not enought free tasks within the executor to run
     - If so change the executor type (or add more resources!)
     
     
 - DAGs won't load
     - DAG not in web UI
     - DAG not in " airflow list_dags "
     - Verify DAG files are in the correct folder
     - Determine the DAG folder by examining the airflow.cfg file
     - Use " head airflow/airflow.cfg ", the dags_folder shows the path
     
     <img src="assets/airflow/airflow_cfg.png" style="width: 600px;"/>


 - Syntax errors
     - Most common reason a DAG file won't appear
     - Kinda difficult to find errors in DAG
     - Run " airflow list_dags " -> python3 <dag_file.py>

In [None]:
# In the command line run:
airflow list_dags

# you'll find out what is the dags_folder
cd /home/repl/workspace/dags/

# open the dag to figure out what is going on
nano dag_file.py

#### Missing DAG

Lol: Your manager calls you before you're about to leave for the evening and wants to know why a new DAG workflow she's created isn't showing up in the system. She needs this DAG called execute_report to appear in the system so she can properly schedule it for some tests before she leaves on a trip.

 - Airflow is configured using the ~/airflow/airflow.cfg file.

 - Examine the DAG for any errors and fix those.
 - Determine if the DAG has loaded after fixing the errors.
 - If not, determine why the DAG has not loaded and fix the final issue. - 

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

sample_dag = DAG(
    dag_id = 'sample_dag',
    schedule_interval = "0 0 * * *"
)

sample_task = BashOperator(
    task_id='sample',
    bash_command='generate_sample.sh',
    start_date=datetime(2020,2,20),
    dag=sample_dag
)


from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime

report_dag = DAG(
    dag_id = 'execute_report',
    schedule_interval = "0 0 * * *"
)

precheck = FileSensor(
    task_id='check_for_datafile',
    filepath='salesdata_ready.csv',
    start_date=datetime(2020,2,20),
    mode='poke',
    dag=report_dag)

generate_report_task = BashOperator(
    task_id='generate_report',
    bash_command='generate_report.sh',
    start_date=datetime(2020,2,20),
    dag=report_dag
)

precheck >> generate_report_task


#### SLAs
 - SLA = service level agreement
 - Within Airflow, this is the amount of time a task or a DAG should require to run
 - An SLA miss is any time the task or the dag does not meet the expected timing
 - If an SLA is missed, an email is sent out and a log is stored
 - SLA misses can be viewed in the web UI (Browse -> SLA Misses)

In [None]:
# How to define an SLA
task1 = BashOperator(task_id='sla_task',
                   bash_command='runcode.sh',
                   sla=timedelta(seconds=30), dag=dag)

default_args={
 'sla': timedelta(minutes=20)
 'start_date': datetime(2020,2,20)
}

dag = DAG('sla_dag', default_args=default_args)

#### SLA example 1

In [None]:
# Import the timedelta object
from datetime import timedelta
### SLA example 1
# Create the dictionary entry
default_args = {
  'start_date': datetime(2020, 2, 20),
  'sla': timedelta(minutes=30)
}

# Add to the DAG
test_dag = DAG('test_workflow', default_args=default_args, schedule_interval='@None')

#### SLA example 2

In [None]:
# Import the timedelta object
from datetime import timedelta

test_dag = DAG('test_workflow', start_date=datetime(2020,2,20), schedule_interval='@None')

# Create the task with the SLA
task1 = BashOperator(task_id='first_task',
                     sla=timedelta(hours=3),
                     bash_command='initialize_data.sh',
                     dag=test_dag)

#### SLA example 3

In [None]:
# Define the email task
email_report = EmailOperator(
        task_id='email_report',
        to='airflow@datacamp.com',
        subject='Airflow Monthly Report',
        html_content="""Attached is your monthly workflow report - please refer to it for more detail""",
        files=['monthly_report.pdf'],
        dag=report_dag
)

# Set the email task to run after the report is generated
email_report << generate_report

#### SLA example 4

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime

default_args={
    'email': ['airflowalerts@datacamp.com','airflowadmin@datacamp.com'],
    'email_on_failure': True,
    'email_on_success': True,
}

report_dag = DAG(
    dag_id = 'execute_report',
    schedule_interval = "0 0 * * *",
    default_args=default_args
)

precheck = FileSensor(
    task_id='check_for_datafile',
    filepath='salesdata_ready.csv',
    start_date=datetime(2020,2,20),
    mode='reschedule',
    dag=report_dag)

generate_report_task = BashOperator(
    task_id='generate_report',
    bash_command='generate_report.sh',
    start_date=datetime(2020,2,20),
    dag=report_dag
)

precheck >> generate_report_task


### Templates
 - Airflow templates are created using the Jinja template language
 - Allow for substituting information during a DAG run
 - Provide added flexibility when defining tasks

In [None]:
templated_command="""
  echo "Reading {{ params.filename }}"
"""

t1 = BashOperator(task_id='template_task',
       bash_command=templated_command,
       params={'filename': 'file1.txt'}
       dag=example_dag)

t2 = BashOperator(task_id='template_task',
       bash_command=templated_command,
       params={'filename': 'file2.txt'}
       dag=example_dag)

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

default_args = {
  'start_date': datetime(2020, 4, 15),
}

cleandata_dag = DAG('cleandata',
                    default_args=default_args,
                    schedule_interval='@daily')

# Create a templated command to execute
# 'bash cleandata.sh datestring'
templated_command = """ bash cleandata.sh {{ ds_nodash }} """


# Modify clean_task to use the templated command
clean_task = BashOperator(task_id='cleandata_task',
                          bash_command=templated_command,
                          dag=cleandata_dag)

 - Modify the templated command to handle a second argument called filename
 - Change the first BashOperator to pass the filename salesdata.txt to the command
 - Add a new BashOperator called clean_task2 to use a second filename supportdata.txt
 - Set clean_task2 downstream of clean_task

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

default_args = {
  'start_date': datetime(2020, 4, 15),
}

cleandata_dag = DAG('cleandata',
                    default_args=default_args,
                    schedule_interval='@daily')

# Modify the templated command to handle a
# second argument called filename.
templated_command = """
  bash cleandata.sh {{ ds_nodash }} {{params.filename}}
"""

# Modify clean_task to pass the new argument
clean_task = BashOperator(task_id='cleandata_task',
                          bash_command=templated_command,
                          params={'filename': 'salesdata.txt'},
                          dag=cleandata_dag)

# Create a new BashOperator clean_task2
clean_task2 = BashOperator(task_id='cleandata_task2',
                           bash_command=templated_command,
                          params={'filename': 'supportdata.txt'},
                          dag=cleandata_dag)
                           
# Set the operator dependencies
clean_task >> clean_task2

#### More complex templates

In [None]:
templated_command="""
{% for filename in params.filenames %}
  echo "Reading {{ filename }}"
{% endfor %}
"""
t1 = BashOperator(task_id='template_task',
       bash_command=templated_command,
       params={'filenames': ['file1.txt', 'file2.txt']}
       dag=example_dag)

In [None]:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

filelist = [f'file{x}.txt' for x in range(30)]

default_args = {
  'start_date': datetime(2020, 4, 15),
}

cleandata_dag = DAG('cleandata',
                    default_args=default_args,
                    schedule_interval='@daily')

# Modify the template to handle multiple files in a 
# single run.
templated_command = """
  <% for filename in params.filenames %>
  bash cleandata.sh {{ ds_nodash }} {{ filename }};
  <% endfor %>
"""

# Modify clean_task to use the templated command
clean_task = BashOperator(task_id='cleandata_task',
                          bash_command=templated_command,
                          params={'filenames': filelist},
                          dag=cleandata_dag)


#### Variables
 - Airflow built-in runtime variables
 - Provides assorted information about DAG runs, tasks and even the system configuration
 - Examples:

In [None]:
# YYYY-MM-DD
Execution Date: {{ ds }}

# YYYYMMDD
Execution Date, no dashes: {{ ds_nodash }}

# YYYY-MM-DD
Previous Execution date: {{ prev_ds }}

# YYYYMMDD
Prev Execution date, no dashes: {{ prev_ds_nodash }}  

DAG object: {{ dag }}

Airflow config object: {{ conf }}

#### Macros

<img src="assets/airflow/macros.png" style="width: 600px;"/>

 - Create a Python string that represents the email content you wish to send. Use the substitutions for the current date string (with dashes) and a variable called username.
 - Create the EmailOperator task using the template string for the html_content.
 - Set the subject field to a macro call using macros.uuid.uuid4(). This simply provides a string of a universally unique identifier as the subject field.
 - Assign the params dictionary as appropriate with the username of testemailuser.

In [None]:
from airflow.models import DAG
from airflow.operators.email_operator import EmailOperator
from datetime import datetime

# Create the string representing the html email content
html_email_str = """
Date: {{ ds }}
Username: {{ params.username }}
"""

email_dag = DAG('template_email_test',
                default_args={'start_date': datetime(2020, 4, 15)},
                schedule_interval='@weekly')
                
email_task = EmailOperator(task_id='email_task',
                           to='testuser@datacamp.com',
                           subject="{{ macros.uuid.uuid4() }}",
                           html_content=html_email_str,
                           params={'username': 'testemailuser'},
                           dag=email_dag)

#### Branching
 - Provides conditional logic
 - Uses BranchPythonOperator
 - from airflow.operators.python_operator import BranchPythonOperator

In [None]:
def branch_test(**kwargs):
    if int(kwargs['ds_nodash']) % 2 == 0:
        return 'even_day_task'
    else:
        return 'odd_day_task'
    
branch_task = BranchPythonOperator(task_id='branch_task',dag=dag,
       provide_context=True,
       python_callable=branch_test)

start_task >> branch_task >> even_day_task >> even_day_task2
branch_task >> odd_day_task >> odd_day_task2

#### Example

In [None]:
# Create a function to determine if years are different
def year_check(**kwargs):
    current_year = int(kwargs['ds_nodash'][0:4])
    previous_year = int(kwargs['prev_ds_nodash'][0:4])
    if current_year == previous_year:
        return 'current_year_task'
    else:
        return 'new_year_task'

# Define the BranchPythonOperator
branch_task = BranchPythonOperator(task_id='branch_task', dag=branch_dag,
                                   python_callable=year_check, provide_context=True)
# Define the dependencies
branch_dag >> current_year_task
branch_dag >> new_year_task

<img src="assets/airflow/running_dags.png" style="width: 600px;"/>
<img src="assets/airflow/operators.png" style="width: 600px;"/>
<img src="assets/airflow/templates.png" style="width: 600px;"/>

#### Pipeline Demo
 - Update the DAG in pipeline.py to import the needed operators.
 - Run the sense_file task from the command line and look for any errors. Use the command airflow test and the appropriate arguments to run the command. For the last argument, use a -1 instead of a specific date.
 - Determine why the sense_file task does not complete and remedy this using the editor.
 - Re-test the sense_file task and verify the problem is fixed.

In [None]:
# Run the following to test the etl_update dag
# repl:~/workspace$ airflow test etl_update sense_file -1

# Fix by creating the missing startprocess.txt file

# Then run pipeline.py

from airflow.models import DAG
from airflow.contrib.sensors.file_sensor import FileSensor

# Import the needed operators
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import date, datetime

def process_data(**context):
    file = open('/home/repl/workspace/processed_data.tmp', 'w')
    file.write(f'Data processed on {date.today()}')
    file.close()

    
dag = DAG(dag_id='etl_update', default_args={'start_date': datetime(2020,4,1)})

sensor = FileSensor(task_id='sense_file', 
                    filepath='/home/repl/workspace/startprocess.txt',
                    poke_interval=5,
                    timeout=15,
                    dag=dag)

bash_task = BashOperator(task_id='cleanup_tempfiles', 
                         bash_command='rm -f /home/repl/*.tmp',
                         dag=dag)

python_task = PythonOperator(task_id='run_processing', 
                             python_callable=process_data,
                             dag=dag)

sensor >> bash_task >> python_task


 - Add an SLA of 90 minutes to the DAG.
 - Update the FileSensor object to check for files every 45 seconds.
 - Modify the python_task to send Airflow variables to the callable. Note that the callable is configured to accept the variables using the provide_context argument.

In [None]:
from airflow.models import DAG
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from dags.process import process_data
from datetime import timedelta, datetime

# Update the default arguments and apply them to the DAG
default_args = {
  'start_date': datetime(2019,1,1),
  'sla':timedelta(minutes=90)
}

dag = DAG(dag_id='etl_update', default_args=default_args)

sensor = FileSensor(task_id='sense_file', 
                    filepath='/home/repl/workspace/startprocess.txt',
                    poke_interval=45,
                    dag=dag)

bash_task = BashOperator(task_id='cleanup_tempfiles', 
                         bash_command='rm -f /home/repl/*.tmp',
                         dag=dag)

python_task = PythonOperator(task_id='run_processing', 
                             python_callable=process_data,
                             provide_context=True,
                             dag=dag)

sensor >> bash_task >> python_task

 - Import the necessary operators.
 - Configure the EmailOperator to provide the specific data to the callable.
 - Complete the branch callable as necessary to point to the email_report_task or no_email_task.
 - Configure the branch operator to properly check for the condition.

In [None]:
# process.py

from datetime import date

def process_data(**kwargs):
    file = open("/home/repl/workspace/processed_data-" + kwargs['ds'] + ".tmp", "w")
    file.write(f"Data processed on {date.today()}")
    file.close()

In [None]:
# pipeline.py

from airflow.models import DAG
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.email_operator import EmailOperator
from dags.process import process_data
from datetime import datetime, timedelta

# Update the default arguments and apply them to the DAG.

default_args = {
  'start_date': datetime(2019,1,1),
  'sla': timedelta(minutes=90)
}
    
dag = DAG(dag_id='etl_update', default_args=default_args)

sensor = FileSensor(task_id='sense_file', 
                    filepath='/home/repl/workspace/startprocess.txt',
                    poke_interval=45,
                    dag=dag)

bash_task = BashOperator(task_id='cleanup_tempfiles', 
                         bash_command='rm -f /home/repl/*.tmp',
                         dag=dag)

python_task = PythonOperator(task_id='run_processing', 
                             python_callable=process_data,
                             provide_context=True,
                             dag=dag)

email_subject="""
  Email report for {{ params.department }} on {{ ds_nodash }}
"""

email_report_task = EmailOperator(task_id='email_report_task',
                                  to='sales@mycompany.com',
                                  subject=email_subject,
                                  html_content='',
                                  params={'department': 'Data subscription services'},
                                  dag=dag)

no_email_task = DummyOperator(task_id='no_email_task', dag=dag)

def check_weekend(**kwargs):
    dt = datetime.strptime(kwargs['execution_date'],"%Y-%m-%d")
    # If dt.weekday() is 0-4, it's Monday - Friday. If 5 or 6, it's Sat / Sun.
    if (dt.weekday() < 5):
        return 'email_report_task'
    else:
        return 'no_email_task'
    
branch_task = BranchPythonOperator(task_id='check_if_weekend',
                                   python_callable=check_weekend,
                                   provide_context=True,
                                   dag=dag)

    
sensor >> bash_task >> python_task

python_task >> branch_task >> [email_report_task, no_email_task]
