-
Notifications
You must be signed in to change notification settings - Fork 16.7k
Description
This github issue is meant to track the big roadmap items ahead of use and for the community to provide ideas, feedback, steer and comment
Lineage Annotations
Airflow knows a lot about your job dependencies, but how much does it know about your data objects (databases, tables, reports, ...)? Lineage annotations would expose a way for users to define lineage between data objects explicitly and tie that to tasks and/or DAG. The framework may include hooks for programatic inference (HiveOperator could introspect the HQL and guess table names and lineage), and a way to override or complement these. The framework will most likely be fairly agnostic about your data objects, letting you namespace them however you want, and simply consider them as an array of parent/child relationship between strings. It may be nice to use dot notation and reserve the first part of the expression to object_type, allowing for color coding in a graph view and tying actions (links) and the like. This will course ship with nice graph visualization, browsing and navigation features.
Picklin'
stateless/codeless web servers and workers
Backfilll UI
Trigger a backfill from the UI
REST API
Essentially offering features similar to what is available in the CLI through a REST API, there may be some automagic solutions here that can figure out how to build REST specs from argparse, it would also insure consistency between the two
[done!]
Continuous integration with Travis-CI
Systematically run all unit tests against CDH and HDP on Python 2.7 and 3.5
Externally Triggered DAGs
Airflow currently assumes that you run your workflows on a fixed schedule interval. This is perfect for hourly, daily and weekly jobs. When thinking about "analytics as a service" and "analysis automation", many use cases are more of the "on demand" nature. For instance if you have a workflow that processes someone's genome when ordered to, or a workflow that builds a dataset on demand for data scientists based on parameters they provide, .... This requires a new class of DAGs that are triggered externally, not on a fixed schedule interval.