# Airflow Note2

#### 回顧

* 最簡單安裝 
> pip install apache-airflow  
> pip install "apache-airflow[crypto, password]"  
    
    
*  初始化數據庫 `airflow initdb` [必須的步驟]
*  啟動web服務器 `airflow webserver -p 8080` [方便可視化管理dag]
*  啟動任務 `airflow scheduler` [scheduler啟動後，DAG目錄下的dags就會根據設定的時間定時啟動]
*  此外我們還可以直接測試單個DAG，如測試文章末尾的DAG `airflow test ct1 print_date 2016-05-14`


---

#### 配置 mysql以啟用 LocalExecutor 和 CeleryExecutor

安裝 MySQL 及設定 root 密碼  

> apt-get install mysql-server  

調整 MySQL 設定檔

> vim /etc/my.cnf：  

~~~
[client]  
default-character-set=utf8  


[mysql]
default-character-set=utf8

[mysqld]
collation-server=utf8_general_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
# Recommended in standard MySQL setup
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES,NO_BACKSLASH_ESCAPES  
explicit_defaults_for_timestamp = 1
~~~

**對於MYSQL 大於 5.6**
> show global variables like '%timestamp%';  

> set global explicit_defaults_for_timestamp = 1;

啟動服務:  

> service mysql start


4.登入 MySQL

> mysql -u root -p

5.創資料庫 
> CREATE DATABASE airflow;  
  
6.新建用戶`airflow`，密碼為`12345`，該用戶對數據庫`airflow`有完全操作權限
> GRANT all privileges on airflow.* TO 'airflow'@'localhost'  IDENTIFIED BY '12345';  
> FLUSH PRIVILEGES; 

7.更改Airflow DB的encode  
> ALTER DATABASE `airflow` CHARACTER SET utf8; 


---




##### 安裝 airflow mysql: 
> pip install apache-airflow[mysql]   


##### 修改airflow配置文件支持mysql
airflow.cfg文件通常在~/airflow目錄下

更改DB鏈接: `sql_alchemy_conn = mysql://airflow:12345@localhost/airflow`   
對應字段解釋如下： `dialect+driver://username:password@host:port/database` 

初始化數據庫 `airflow initdb`  

初始化數據庫成功後，可進入mysql查看新生成的數據表。  
> mysql -u airflow -p 12345   

> USE airflow;    

> SHOW TABLES;     

~~~
+-------------------+  
| Tables_in_airflow |  
+-------------------+
| alembic_version   |
| chart             |
| connection        |
| dag               |
| dag_pickle        |
| dag_run           |
| import_error      |
| job               |
| known_event       |
| known_event_type  |
| log               |
| sla_miss          |
| slot_pool         |
| task_instance     |
| users             |
| variable          |
| xcom              |
+-------------------+
~~~

#### 配置LocalExecutor  

修改airflow配置文件: `airflow.cfg`  
* 更改executor為executor = LocalExecutor  

測試  
* airflow webserver -- debug & 

~~~  
**server_only**  

    ps -ef | grep -Ei '(airflow-webserver)' | grep master | awk '{print $2}' | xargs -i kill {}

    cd ~/airflow/
    nohup airflow webserver >webserver.log 2>&1 &

**resart_all**  

    ps -ef | grep -Ei 'airflow' | grep -v 'grep' | awk '{print $2}' | xargs -i kill {}

    cd ~/airflow/
    nohup airflow webserver -p 8080 >>webserver.log 2>&1 &
    #nohup airflow worker >>worker.log 2>&1 &
    nohup airflow scheduler >>scheduler.log 2>&1 &

**start_all**

    cd ~/airflow/
    nohup airflow webserver -p 8080 >>webserver.log 2>&1 &
    #nohup airflow worker >>worker.log 2>&1 &
    nohup airflow scheduler >>scheduler.log 2>&1 &  
    
~~~

---
#### airflow.cfg 其它配置

`dags_folder`  


##### dags_folder目錄支持子目錄和軟連接，因此不同的dag可以分門別類的存儲起來。  


* 設置郵件發送服務
~~~
smtp_host = smtp.163.com
smtp_starttls = True
smtp_ssl = False
smtp_user = username@163.com
smtp_port = 25
smtp_password = userpasswd
smtp_mail_from = username@163.com
~~~

* 多用戶登錄設置(似乎只有CeleryExecutor支持)

  - 修改airflow.cfg中的下面3行配置  

~~~
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
filter_by_owner = True
~~~   

    - 增加一個用戶(在airflow所在服務器的python下運行)
~~~
import airflow
from airflow import models,   settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'ehbio'
user.email = 'mail@ehbio.com'
user.password = 'ehbio'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
~~~

---
### TASK  

*參數解釋*

* depends_on_past  


Airflow assumes idempotent tasks that operate on immutable data chunks.   
It also assumes that all task instance (each task for each schedule) needs to run.  

If your tasks need to be executed sequentially, you need to tell Airflow:   
use the `depends_on_past=Trueflag` on the tasks that require sequential execution.)  



如果在TASK本該運行卻沒有運行時，或者設置的interval為@once時，推薦使用depends_on_past=False。   
在運行dag時，有時會出現，明明上游任務已經運行結束，下游任務卻沒有啟動，整個dag就卡住了。   
這時設置depends_on_past=False可以解決這類問題。   

~~~
default_args = {
    'owner': 'airflow',          
    'start_date': datetime(2016, 5, 29, 8, 30), 
    #'email': ['chentong_biology@163.com'],
    #'email_on_failure': False, 
    #'email_on_retry': False, 
    'depends_on_past': False, 
    'retries': 1, 
    'retry_delay': timedelta(minutes=5), 
    #'queue': 'bash_queue',
    #'pool': 'backfill', 
    #'priority_weight': 10, 
	  #'end_date': datetime(2016, 5, 29, 11, 30), 
}  
~~~




* timestamp in format like 2016-01-01T00:03:00

* Task中調用的命令出錯後需要在網站Graph view中點擊run手動重啟。為了方便任務修改後的順利運行，有個折衷的方法是：

  * 設置 email_on_retry: True
  * 設置較長的retry_delay，方便在收到郵件後，能有時間做出處理
  * 然後再修改為較短的retry_delay，方便快速啟動  
    
   
* 在特定情況下，修改後的一天，為了避免當前日期之前任務的運行，可以使用回填填補特定時間段的任務
> airflow backfill -s START -e END --mark_success DAG_ID    


* 對於不想要被 backfill 和 startdate 坑 --> prevent airflow from backfilling dag runs
> catchup=False

---
####  airflow- 外部檔案

* 可以在 web 上 admin->Variables 添加 key與value
  之後便可以透過參數 Variable.get(key)讀取
 

---
#### airflow 管理帳號



> pip install flask-bcrypt  

~~~
$ python
Python 2.7.9 (default, Feb 10 2015, 03:28:08)
Type "help", "copyright", "credits" or "license" for more information.
>>> import airflow
>>> from airflow import models, settings
>>> from airflow.contrib.auth.backends.password_auth import PasswordUser
>>> user = PasswordUser(models.User())
>>> user.username = 'new_user_name'
>>> user.email = 'new_user_email@example.com'
>>> user.password = 'set_the_password'
>>> session = settings.Session()
>>> session.add(user)
>>> session.commit()
>>> session.close()
>>> exit()
~~~

然後透過 web 更改 user權限給剛剛創立的帳號(admin->users)

最後Check the following in your airflow.cfg file:  
[webserver]  
authenticate = True  
auth_backend = airflow.contrib.auth.backends.password_auth  



---  

### SUBDAG 用法  

* 可以在其他檔案寫入:  

~~~
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator

def func():
    ....


def subdag(parent_dag_name, child_dag_name, args):
    dag_subdag = DAG(  
        
        #這裡的 child_dag_name 要跟之後的task_id 一樣
        #-------------------------------------------------------------
        dag_id='%s.%s' % (parent_dag_name, child_dag_name),
        #-------------------------------------------------------------
        default_args=args,
        schedule_interval="@daily",
    )

 test1 = PythonOperator(  
        task_id='sleep_for_1',  
        python_callable=func,  
        op_args = [param],  
        #op_kwargs={'random_base': float(i) / 10},  
        dag=dag_subdag,  
    )

 test2 = PythonOperator(  
        task_id='sleep_for_2',  
        python_callable=func,  
        op_args = [param],  
        #op_kwargs={'random_base': float(i) / 10},  
        dag=dag_subdag,  
    )

    #------- 這裡要回傳DAG ------------------
    return dag_subdag
    #---------------------------------------
~~~






* 主要構建DAG 檔案的地方


~~~ 
import airflow
#-----------------這裡import剛剛的檔案----------------------
from airflow.example_dags.subdags.subdag import subdag
#----------------------------------------------------------
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator


DAG_NAME = 'example_subdag_operator'

args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2),
}

with DAG( dag_id=DAG_NAME, default_args=args, schedule_interval="@once",) as dag:

    start = DummyOperator(
        task_id='start',
        default_args=args,
        dag=dag,
    )
#----------------這裡使用SubDagOperator----------------
    section_1 = SubDagOperator(
        task_id='section-1',
        subdag=subdag(DAG_NAME, 'section-1', args),
        default_args=args,
        dag=dag,
    )
#------------------------------------------------------
    some_other_task = DummyOperator(
        task_id='some-other-task',
        default_args=args,
        dag=dag,
    )
    
    
#----------------這裡使用SubDagOperator----------------
    section_2 = SubDagOperator(
        task_id='section-2',
        subdag=subdag(DAG_NAME, 'section-2', args),
        default_args=args,
        dag=dag,
    )
#------------------------------------------------------
    end = DummyOperator(
        task_id='end',
        default_args=args,
        dag=dag,
    )


~~~
start >> section_1 >> some_other_task >> section_2 >> end