Can you add a slightly more realistic example of a data pipeline in the cloud? #55

r39132 · 2015-06-21T23:58:07Z

For example, the Luigi example walks through a case that involves importing data into a DB. It would be cool if there were some examples that read from one location (e.g. S3) and wrote to another (e.g. DB).

artwr · 2015-06-22T03:57:25Z

We have an operator that transfers files from s3 to Hive, as well as a Hive to Mysql operator, which is unfortunately missing from the documentation. The hooks being already in place, a s3 to Postgres operator should not be too difficult. Here is an example of how our s3 to Hive operator would work :

from collections import OrderedDict
from airflow.operators import S3ToHiveTransfer

S3ToHiveTransfer(
    task_id='s3_to_hive',
    s3_key='s3://bucket-name/user_list_{{ ds }}.tsv',
    field_dict=OrderedDict([("user_id", "BIGINT")
                            , ("first_name", "STRING")
                            , ("last_name", "STRING")
                            , ("registered_at", "TIMESTAMP")]),
    hive_table='{{ params.db_name }}.user_list_{{ ds_nodash }}',
    create=True,
    recreate=True,
    delimiter='\t',
    s3_conn_id='s3_connection_name',
    dag=dag)

Usually, the transfer is a separate operation of any transform and has its own operator.
I hope this helps.
Best,
Arthur

mistercrunch · 2015-06-22T04:27:33Z

Agreed, the current tutorial is really just focused on the mechanics of Airflow with very foobar-y examples. I didn't want to write a pipeline that was too stack specific (MySQL / Hive / ...) and wanted to make sure it would work for anyone, regardless of the stack they might have.

Maybe using a SqliteOperator to do some analytics on some data scraped from the Internet would be a good example. It could be interesting to re-write the Luigi example for comparison :)

But yeah, it's on the TODO list.

r39132 · 2015-06-22T05:25:14Z

Cool. I'll close this for now.

Signed-off-by: wslulciuc <willy@datakin.com>

r39132 closed this as completed Jun 22, 2015

mobuchowski pushed a commit to mobuchowski/airflow that referenced this issue Jan 4, 2022

Bump marquez-python to 0.7.3 (apache#55)

dfc7a89

Signed-off-by: wslulciuc <willy@datakin.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you add a slightly more realistic example of a data pipeline in the cloud? #55

Can you add a slightly more realistic example of a data pipeline in the cloud? #55

r39132 commented Jun 21, 2015

artwr commented Jun 22, 2015

mistercrunch commented Jun 22, 2015

r39132 commented Jun 22, 2015

Can you add a slightly more realistic example of a data pipeline in the cloud? #55

Can you add a slightly more realistic example of a data pipeline in the cloud? #55

Comments

r39132 commented Jun 21, 2015

artwr commented Jun 22, 2015

mistercrunch commented Jun 22, 2015

r39132 commented Jun 22, 2015