airflow-local

Source code - Github
Author - Gavin Noronha - gavinln@hotmail.com

About

This project provides a Ubuntu (16.04) Vagrant Virtual Machine (VM) with Airflow, a data workflow management system from Airbnb.

There are Ansible scripts that automatically install the software when the VM is started.

1. Run Airflow

Connect to the VM

To start the virtual machine(VM) type
```
vagrant up
```
Setup private keys
```
copy-private-keys.bat
```
Connect to the VM
```
vagrant ssh airflow-master
```
Setup private keys
```
/vagrant/scripts/setup-private-keys.sh
```

Initialize Airflow

Initialize Airflow home
```
export AIRFLOW_HOME=~/airflow
```
Setup the sqlite database
```
airflow initdb
```
Change to the Airflow directory
```
cd $AIRFLOW_HOME
```
Reduce the load on the Airflow system by setting values in airflow.cfg
```
parallelism = 4
dag_concurrency = 2
celeryd_concurrency = 4
```
Start the web server
```
airflow webserver -p 8080
```
Open a web browser to the UI at http://192.168.33.10:8080

Run a task

List DAGS
```
airflow list_dags
```

List tasks for example_bash_operator DAG

airflow list_tasks example_bash_operator

List tasks for example_bash_operator in a tree view
```
airflow list_tasks example_bash_operator -t
```

Run the runme_0 task on the example_bash_operator DAG today

airflow run example_bash_operator runme_0 `date +%Y-%m-%d`

Backfill a DAG

export START_DATE=$(date -d "-2 days" "+%Y-%m-%d")
airflow backfill -s $START_DATE example_bash_operator

Clear the history of DAG runs
```
airflow clear example_bash_operator
```

Add a new task

Go to the Airflow config directory
```
cd ~/airflow
```
Set the airflow dags directory in airflow.cfg by change the line:
```
dags_folder = /vagrant/airflow/dags
```
Remove example dags
```
load_examples = False
```
Restart the web server
```
airflow webserver -p 8080
```

Run a task

Run the dynamic_dags task
```
airflow list_dags
```
Run the dag
```
airflow trigger_dag dynamic_dags
```
Run the scheduler to actually run the dag
```
airflow scheduler
```

Disable logging

Change to the airflow directory
```
cd /vagrant/airflow
```
Set airflow environment
```
source set_airflow_env.sh
```
Run airflow without any logging messages

Setup airflow dags directory

Edit file ~/airflow/airflow.cfg

Set the following:

dags_folder = /vagrant/airflow/dags
load_examples = False

Start the webserver & scheduler by running the following

tmuxp load /vagrant/scripts/tmux-webserver-scheduler.yaml

Setup Airflow in Pseudo-distributed mode using Local Executor

Follow the instructions here

Setup Airflow in distributed mode using Celery Executor

Follow the instructions here

2. Run RabbitMQ

Start RabbitMQ

Start the RabbitMQ in a Docker container

export RMQ_IMG=rabbitmq:3.6.10-management
docker run -d --rm --hostname airflow-rmq \
    --name airflow-rmq -p 192.168.33.10:15672:15672 -p 5672:5672 $RMQ_IMG

Display the list of running Docker instances
```
docker ps
```
Go to the RabbitMQ dashboard at http://192.168.33.10:15672/
Login using guest/guest

Stop RabbitMQ

Connect to the RabbitMQ Docker container

export RMQ=$(docker ps -aq --filter name=airflow-rmq)

List queues

docker exec -ti $RMQ rabbitmqctl list_queues

Stop RabbitMQ
```
docker stop $RMQ
```

3. Connect to RabbitMQ using Python

Connect to RabbitMQ only using Python (no Celery)

The RabbitMQ web site demonstrates how to connect using Python and the Pika library.

List queues

docker exec -ti $RMQ rabbitmqctl list_queues

Send a message to a RabbitMQ queue called hello
```
python rmq-send.py
```
Receive a message from RabbitMQ queue called hello
```
python rmq-receive.py
```

List queues displaying the hello queue

docker exec -ti $RMQ rabbitmqctl list_queues

Stop the app

docker exec -ti $RMQ rabbitmqctl stop_app

Start the app

docker exec -ti $RMQ rabbitmqctl start_app

List queues and the hello queue is not displayed
```
docker exec -ti $RMQ rabbitmqctl list_queues
```

Connect to RabbitMQ using Celery

Start the Celery worker

export PYTHONPATH=/vagrant/scripts
celery -A tasks worker --loglevel=info

Call the task

export PYTHONPATH=/vagrant/scripts
python -c "from tasks import add; add.delay(2, 3)"

4. Run Postgres

Start Postgres

Start the Postgres in the Docker container with the name

export PG_IMG=postgres:9.6.3
export PGPASSWORD=airflow_pg_pass
docker run -d --rm --name airflow-pg -p 0.0.0.0:5432:5432 \
    -e POSTGRES_PASSWORD=$PGPASSWORD $PG_IMG
export PG=$(docker ps -aq --filter name=airflow-pg)

Stop Postgres

List the Docker container
```
docker ps --filter id=$PG
```
Stop Postgres
```
docker stop $PG
```

5. Connect to Postgres using Psycopg2 and SQLAlchemy

Start the Postres database before running these steps

Connect to the database using psql and create the database test

docker exec -ti $PG psql -U postgres -c "create database test"

Create table test in database test only using Psycopg2

export PGHOST=localhost
python /vagrant/scripts/pg-psycopg2.py

Connect to database test using Psycopg2 and SQLAlchemy
```
python pg-sqlalchemy-read.py
```
Connect to the postgres database again
```
docker exec -ti $PG psql -U postgres
```
List the databases
```
\l
```
Connect to the test database
```
\c test
```
List objects in the test database
```
\d
```
Select all rows from the test database
```
select id, num, data from test;
```
Quit the psql utility
```
\q
```

Drop database test

docker exec -ti $PG psql -U postgres -c "drop database test"

6. Setup Airflow with Postgres

1. Using the LocalExecutor

Change the executor in ~/airflow.cfg file to the following values
```
executor = LocalExecutor
```

Change the sql_alchemy_conn in ~/airflow.cfg file to the following values

# Change the meta db configuration
sql_alchemy_conn = postgresql+psycopg2://postgres:airflow_pg_pass@localhost/test

2. Using the CeleryExecutor with Redis

Change the executor in ~/airflow.cfg file to the following values
```
executor = CeleryExecutor
```

Set the following two values in ~/airflow.cfg file

broker_url = redis://localhost:6379/0
celery_result_backend = redis://localhost:6379/0

3. Initialize the database with Airflow data

Initialze the test database as the Airflow database
```
airflow initdb
```

4. Start airflow daemons with Tmux

```
tmuxp load /vagrant/scripts/tmux-airflow-daemons.yaml
```

5. View the Airflow web server at http://192.168.33.10:8080

6. View the Airflow flower server at http://192.168.33.10:5555

7. Setup Airflow worker machine

Copy Airflow configuration

rsync -zvh airflow/airflow*.cfg worker:~/airflow/

Initialize Airflow home export AIRFLOW_HOME=~/airflow
Run airflow worker airflow worker
Stop redis sudo systemctl stop redis
Modify Redis configuration sudo vim /etc/redis/redis.conf
Change bind line to the following bind 0.0.0.0
Start Redis sudo systemctl start redis

8. Start airflow daemons with supervisord

Create postgres Docker container
Create test database
Run airflow init
Run sudo supervisord
Need to setup logs in airflow-scheduler.conf, airflow-webserver.conf, airflow-worker.conf

9. Setup netdata for monitoring

Edit netdata configuration

vim /opt/netdata/etc/netdata/netdata.conf

Show status of netdata
```
sudo systemctl status netdata
```
Start netdata
```
sudo systemctl start netdata
```
View the netdata at http://192.168.33.10:19999/
Stop netdata
```
sudo systemctl stop netdata
```
Netdata stores data in memory and updates every second. To store hours of data without using up memory add the following
```
[global]
update every = 10
```

Documentation

Requirements

The following software is needed to get the software from github and run Vagrant to set up the Python development environment. The Git environment also provides an SSH client for Windows.

Oracle VM VirtualBox
Vagrant version 1.9 or higher
Git

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
airflow		airflow
ansible		ansible
doc		doc
python		python
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile
copy-private-keys.bat		copy-private-keys.bat
ssh-vm.bat		ssh-vm.bat

gavinln/airflow-local

Folders and files

Latest commit

History

Repository files navigation

airflow-local

About

1. Run Airflow

Connect to the VM

Initialize Airflow

Run a task

Add a new task

Run a task

Disable logging

Setup airflow dags directory

Setup Airflow in Pseudo-distributed mode using Local Executor

Setup Airflow in distributed mode using Celery Executor

2. Run RabbitMQ

Start RabbitMQ

Stop RabbitMQ

3. Connect to RabbitMQ using Python

Connect to RabbitMQ only using Python (no Celery)

Connect to RabbitMQ using Celery

4. Run Postgres

Start Postgres

Stop Postgres

5. Connect to Postgres using Psycopg2 and SQLAlchemy

6. Setup Airflow with Postgres

1. Using the LocalExecutor

2. Using the CeleryExecutor with Redis

3. Initialize the database with Airflow data

4. Start airflow daemons with Tmux

5. View the Airflow web server at http://192.168.33.10:8080

6. View the Airflow flower server at http://192.168.33.10:5555

7. Setup Airflow worker machine

8. Start airflow daemons with supervisord

9. Setup netdata for monitoring

Documentation

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages