# ETL Principles applied to airflow 

### Load data incrementally

 When a table or dataset is small, you can afford to extract it as a whole and write it to the destination. As an organization grows however, you’ll need to extract data incrementally at regular intervals and only load data for an hour, day, week, etc. Airflow makes it very easy to schedule jobs such that they process specific intervals with job parameters that allow you select data.

### Process historic data

There are cases when you just finished a new workflow and need data that goes back further than the date push your new code into production. In this situation you can simply use the start_date parameter in a DAG to specify the start date. Airflow will then back-fill tasks to process that data all the way back to that start date. It may not be appropriate or desirable to have so many execution runs to get data up-to-date, so there are some other strategies that you can use to process weeks, months or years of data through better parametrization of the DAGS. That will be documented in a separate story on this site. I’ve personally seen cases where a lot of effort went into dealing with historical data loads and a lot of manual coding and workarounds to achieve this. The idea is to make this effort repeatable and simple.

### Partition ingested data

By partitioning data being ingested at the destination, you can parallellize dag runs, avoid write locks on data being ingested and optimize performance when that same data is being read. It will also serve as a historical snapshot of what the data looked like at specific moments in time for audit purposes. Partitions that are no longer relevant can be archived and removed from the database.

### Enforce the idempotency constraint

The result of a DAG run should always have idempotency characteristics. **This means that when you run a process multiple times with the same parameters (even on different days), the outcome is exactly the same**. You do not end up with multiple copies of the same data in your environment or other undesirable side effects. This is obviously only valid when the processing itself has not been modified. If business rules change within the process, then the target data will be different. It’s a good idea here to be aware of auditors or other business requirements on reprocessing historic data, because it’s not always allowed. Also, some processes require anonimization of data after a certain number of days, because it’s not always allowed to keep historical customer data on record forever.

### Enforce deterministic properties

A function is said to be deterministic if **for a given input, the output produced is always exactly the same **. Examples of cases where behavior of a function can be non-deterministic:

* Using external state within the function, like global variables, random values, stored disk data, hardware timers.
* Operating in time-sensitive ways, like multi-threaded programs that are incorrectly sequenced or mutexed.
* Relying on order of input variables and not explicitly ordering the input variables, but relying on accidental ordering (this can happen when you write results to a database in order and select without explicit ORDER BY statements)
* Implementation issues with structures inside the function (implicitly relying on order in python dicts for example)
* Improper exception handling and post-exception behavior
* Intermediate commits and unexpected conditions



### Execute conditionally

Airflow has some options to control how tasks within DAGs are run based on the success of the instance that came before it. For example, **the depends_on_past parameter specifies that all task instances before the one being executed must have succeeded before it executes the current one**. The recently introduced LatestOnlyOperator allows you to conditionally skip tasks downstream in the DAG if it’s not the most recent execution. There’s also the BranchPythonOperator, which can select which branch of execution proceeds in the DAG based on some decision function.

## Sources


https://gtoonstra.github.io/etl-with-airflow/principles.html