Setting up the sandbox from the start
section was easy, now working towards a production grade environment is a bit more work.
As of August 2015, Airflow has experimental support for Python 3. Any issues should be reported (or fixed!). The only major regression is that HDFSHooks
do not work (due to a snakebite
dependency)
The airflow
PyPI basic package only installs what's needed to get started. Subpackages can be installed depending on what will be useful in your environment. For instance, if you don't need connectivity with Postgres, you won't have to go through the trouble of installing the postgres-devel
yum package, or whatever equivalent applies on the distribution you are using.
Behind the scenes, we do conditional imports on operators that require these extra dependencies.
Here's the list of the subpackages and what they enable:
subpackage | install command | enables |
---|---|---|
|
|
MySQL operators and hook, support as an Airflow backend |
|
|
Postgres operators and hook, support as an Airflow backend |
|
|
Hive2SambaOperator |
|
|
All Hive related operators |
|
|
JDBC hooks and operators |
|
|
HDFS hooks and operators |
|
pip install airflow[s3] |
S3KeySensor , S3PrefixSensor |
|
pip install airflow[druid] |
Druid.io related operators & hooks |
|
|
Microsoft SQL operators and hook, support as an Airflow backend |
|
|
Vertica hook support as an Airflow backend |
|
pip install airflow[slack] |
SlackAPIPostOperator |
|
pip install airflow[all] |
All Airflow features known to man |
|
pip install airflow[devel] |
All Airflow features + useful dev tools |
|
pip install airflow[crypto] |
Encrypt connection passwords in metadata db |
|
pip install airflow[celery] |
CeleryExecutor |
|
pip install airflow[async] |
Async worker classes for gunicorn |
|
pip install airflow[ldap] |
ldap authentication for users |
|
pip install airflow[kerberos] |
kerberos integration for kerberized hadoop |
|
pip install airflow[password] |
Password Authentication for users |
The first time you run Airflow, it will create a file called airflow.cfg
in your $AIRFLOW_HOME
directory (~/airflow
by default). This file contains Airflow's configuration and you can edit it to change any of the settings. You can also set options with environment variables by using this format: $AIRFLOW__{SECTION}__{KEY}
(note the double underscores).
For example, the metadata database connection string can either be set in airflow.cfg
like this:
[core]
sql_alchemy_conn = my_conn_string
or by creating a corresponding environment variable:
AIRFLOW__CORE__SQL_ALCHEMY_CONN=my_conn_string
If you want to take a real test drive of Airflow, you should consider setting up a real database backend and switching to the LocalExecutor.
As Airflow was built to interact with its metadata using the great SqlAlchemy library, you should be able to use any database backend supported as a SqlAlchemy backend. We recommend using MySQL or Postgres.
Note
If you decide to use Postgres, we recommend using the psycopg2
driver and specifying it in your SqlAlchemy connection string. Also note that since SqlAlchemy does not expose a way to target a specific schema in the Postgres connection URI, you may want to set a default schema for your role with a command similar to ALTER ROLE username SET search_path = airflow, foobar;
Once you've setup your database to host Airflow, you'll need to alter the SqlAlchemy connection string located in your configuration file $AIRFLOW_HOME/airflow.cfg
. You should then also change the "executor" setting to use "LocalExecutor", an executor that can parallelize task instances locally.
# initialize the database
airflow initdb
Airflow needs to know how to connect to your environment. Information such as hostname, port, login and passwords to other systems and services is handled in the Admin->Connection
section of the UI. The pipeline code you will author will reference the 'conn_id' of the Connection objects.
By default, Airflow will save the passwords for the connection in plain text within the metadata database. The crypto
package is highly recommended during installation. The crypto
package does require that your operating system have libffi-dev installed.
Connections in Airflow pipelines can be created using environment variables. The environment variable needs to have a prefix of AIRFLOW_CONN_
for Airflow with the value in a URI format to use the connection properly. Please see the concepts
documentation for more information on environment variables and connections.
CeleryExecutor is the way you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, ...) and change your airflow.cfg
to point the executor parameter to CeleryExecutor and provide the related Celery settings.
For more information about setting up a Celery broker, refer to the exhaustive Celery documentation on the topic.
To kick off a worker, you need to setup Airflow and kick off the worker subcommand
airflow worker
Your worker should start picking up tasks as soon as they get fired in its direction.
Note that you can also run "Celery Flower", a web UI built on top of Celery, to monitor your workers.
Users can specify a logs folder in airflow.cfg
. By default, it is in the AIRFLOW_HOME
directory.
In addition, users can supply an S3 location for storing log backups. If logs are not found in the local filesystem (for example, if a worker is lost or reset), the S3 logs will be displayed in the Airflow UI. Note that logs are only sent to S3 once a task completes (including failure).
[core]
base_log_folder = {AIRFLOW_HOME}/logs
s3_log_folder = s3://{YOUR S3 LOG PATH}
MesosExecutor allows you to schedule airflow tasks on a Mesos cluster. For this to work, you need a running mesos cluster and perform following steps -
- Install airflow on a machine where webserver and scheduler will run, let's refer this as Airflow server.
- On Airflow server, install mesos python eggs from mesos downloads.
- On Airflow server, use a database which can be accessed from mesos slave machines, for example mysql, and configure in
airflow.cfg
. - Change your
airflow.cfg
to point executor parameter to MesosExecutor and provide related Mesos settings. - On all mesos slaves, install airflow. Copy the
airflow.cfg
from Airflow server (so that it uses same sql alchemy connection). - On all mesos slaves, run
airflow serve_logs
for serving logs.
- On Airflow server, run
airflow scheduler -p
to start processing DAGs and scheduling them on mesos. We need -p parameter to pickle the DAGs.
You can now see the airflow framework and corresponding tasks in mesos UI. The logs for airflow tasks can be seen in airflow UI as usual.
For more information about mesos, refer mesos documentation. For any queries/bugs on MesosExecutor, please contact @kapil-malik.
Airflow can integrate with systemd based systems. This makes watching your daemons easy as systemd can take care restarting a daemon on failure. In the scripts/systemd
directory you can find unit files that have been tested on Redhat based systems. You can copy those /usr/lib/systemd/system
. It is assumed that Airflow will run under airflow:airflow
. If not (or if you are running on a non Redhat based system) you probably need adjust the unit files.
- Environment configuration is picked up from
/etc/sysconfig/airflow
. An example file is supplied . Make sure to specify the
SCHEDULER_RUNS
variable in this file when you run the schduler. You can also define here, for example,AIRFLOW_HOME
orAIRFLOW_CONFIG
.