# Introduction to Airflow
purposes: data engineering work
### What is a workflow?
A workflow is:
* A set of steps to accomplis a given data engineering task
    * Such as: ETL stuff. Download, copying, filtering data, then pushing to database
* Of varying levels of complexity (2-3 steps or hundreds)

### What is Airflow?
*Airflow* is a platform to program workflows, including:
* Creation
* Scheduling
* Monitoring
* Airflow can use various tools/languages but the actual workflow code is written with Python
* Implements workflows as DAGs: Directed Acyclic Graphs (For now, think of this like a set of tasks and the dependencies between them)
* Accessed via code, command-line, or via web interface

### Other workflow tools
* Luigi (Spotify's tool)
* SSIS (Microsoft Sql Server Integration Services)
* Bash scripting (we'll use some of this in this tutorial)

In [7]:
#  DAG Code example 
#  Simple DAG defintion:
etl_dag = DAG(
    dag_id='etl_pipeline',
    default_args={'start_date': '2020-01-08'}
)

#  Note: within any Python code this is referred to via the variable "etl_dag" but via command line, need to use the dag_id

## Running a workflow in Airflow
Running a simple Airflow task in bash terminal

``` airflow run <dag_id> <task_id> <start_date> ```

Using a DAG named example-etl, a task named download-file, and a start date of '2020-01-10':

``` airflow run example-etl download-file 2020-01-10 ```

you can't edit DAGs from Airflow subcommands in the terminal. Use airflow -h for more information.

## What is a DAG?
DAG, or Directed Acyclic Graph:
* Directed, there is an inherent flow representing dependencies between components
* Acyclic, does not loop/cycle/repeat
* Graph, the actual set of components
* Seen in Airflow, Apache Spark, Luigi

Within Airflow, DAGs:
* Are written in Python (but can use components written in other languages)
* Are made up of components (typically, tasks) to be executed, such as operators, sensors, etc.
* Contain dependencies defined explicity or implicity
    * ie, Copy the file to the server before trying to import it to the database service

## Define a DAG

In [9]:
from airflow.models import DAG
from datetime import datetime

#  These arguments are optional but provide a lot of power to define the runtime behavior of Airflow
default_arguments = {
    'owner': 'Fausto Rodriguez',
    'email': 'cozyscripts@github.io',
    'start_date': datetime(2020, 8, 5)
}

#  Define DAG object with the first argument using a name for the DAG, etl_workflow
#  Assign the default arguments dictionary to the default_args argument
etl_dag = DAG('etl_workflow', default_args=default_arguments)

Note: The DAG is assigned to variable "etl_dag". This variable will be used when defining components of the DAG. However, "etl_dag" will not appear in the Airflow interfaces.

### DAGs on the command line:
Generally, you'll want to use the airflow command line tool

Using ```airflow```:
* The ```airflow``` command line program contains many subcommands
* ```airflow -h``` for descriptions
* Many subcommands are related to DAGs
* ```airflow list_dags``` to show all recognized DAGs in an installation

## Command line vs. Python
Use the command line tool to:
* Start Airflow processes
* Manually run DAGs/Tasks
* Get logging information from Airflow

Use Python to:
* Create a DAG
* Edit the individual properties of a DAG
* Edit the actual data processing code itself