Skip to content

Commit

Permalink
docs: Update whole OpenLineage Provider docs. (#37620)
Browse files Browse the repository at this point in the history
  • Loading branch information
kacpermuda committed Feb 28, 2024
1 parent 1eb3bfe commit fb65112
Show file tree
Hide file tree
Showing 8 changed files with 693 additions and 103 deletions.
37 changes: 20 additions & 17 deletions airflow/providers/openlineage/provider.yaml
Expand Up @@ -58,65 +58,68 @@ config:
openlineage:
description: |
This section applies settings for OpenLineage integration.
For backwards compatibility with `openlineage-python` one can still use
`openlineage.yml` file or `OPENLINEAGE_` environment variables. However, below
configuration takes precedence over those.
More in documentation - https://openlineage.io/docs/client/python#configuration.
More about configuration and it's precedence can be found at
https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#transport-setup
options:
disabled:
description: |
Set this to true if you don't want OpenLineage to emit events.
Disable sending events without uninstalling the OpenLineage Provider by setting this to true.
type: boolean
example: ~
default: "False"
version_added: ~
disabled_for_operators:
description: |
Semicolon separated string of Airflow Operator names to disable
Exclude some Operators from emitting OpenLineage events by passing a string of semicolon separated
full import paths of Operators to disable.
type: string
example: "airflow.operators.bash.BashOperator;airflow.operators.python.PythonOperator"
default: ""
version_added: 1.1.0
namespace:
description: |
OpenLineage namespace
Set namespace that the lineage data belongs to, so that if you use multiple OpenLineage producers,
events coming from them will be logically separated.
version_added: ~
type: string
example: "food_delivery"
example: "my_airflow_instance_1"
default: ~
extractors:
description: |
Semicolon separated paths to custom OpenLineage extractors.
Register custom OpenLineage Extractors by passing a string of semicolon separated full import paths.
type: string
example: full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass
default: ~
version_added: ~
config_path:
description: |
Path to YAML config. This provides backwards compatibility to pass config as
`openlineage.yml` file.
Specify the path to the YAML configuration file.
This ensures backwards compatibility with passing config through the `openlineage.yml` file.
version_added: ~
type: string
example: ~
example: "full/path/to/openlineage.yml"
default: ""
transport:
description: |
OpenLineage Client transport configuration. It should contain type
and additional options per each type.
Pass OpenLineage Client transport configuration as JSON string. It should contain type of the
transport and additional options (different for each transport type). For more details see:
https://openlineage.io/docs/client/python/#built-in-transport-types
Currently supported types are:
* HTTP
* Kafka
* Console
* File
type: string
example: '{"type": "http", "url": "http://localhost:5000"}'
example: '{"type": "http", "url": "http://localhost:5000", "endpoint": "api/v1/lineage"}'
default: ""
version_added: ~
disable_source_code:
description: |
If disabled, OpenLineage events do not contain source code of particular
operators, like PythonOperator.
Disable the inclusion of source code in OpenLineage events by setting this to `true`.
By default, several Operators (e.g. Python, Bash) will include their source code in the events
unless disabled.
default: ~
example: ~
type: boolean
Expand Down
Expand Up @@ -15,4 +15,7 @@
specific language governing permissions and limitations
under the License.
.. _configuration:openlineage:

.. include:: ../exts/includes/providers-configurations-ref.rst
412 changes: 389 additions & 23 deletions docs/apache-airflow-providers-openlineage/guides/developer.rst

Large diffs are not rendered by default.

51 changes: 43 additions & 8 deletions docs/apache-airflow-providers-openlineage/guides/structure.rst
Expand Up @@ -17,16 +17,51 @@
under the License.
Structure of OpenLineage Airflow integration
OpenLineage Airflow integration
--------------------------------------------

OpenLineage integration implements AirflowPlugin. This allows it to be discovered on Airflow start and
register Airflow Listener.
OpenLineage is an open framework for data lineage collection and analysis.
At its core it is an extensible specification that systems can use to interoperate with lineage metadata.
`Check out OpenLineage docs <https://openlineage.io/docs/>`_.

The listener is then called when certain events happen in Airflow - when DAGs or TaskInstances start, complete or fail.
For DAGs, the listener runs in Airflow Scheduler.
For TaskInstances, the listener runs on Airflow Worker.
Quickstart
==========

To instrument your Airflow instance with OpenLineage, see :ref:`guides/user:openlineage`.

To implement OpenLineage support for Airflow Operators, see :ref:`guides/developer:openlineage`.

What's in it for me ?
=====================

The metadata collected can answer questions like:

- Why did specific data transformation fail?
- What are the upstream sources feeding into certain dataset?
- What downstream processes rely on this specific dataset?
- Is my data fresh?
- Can I identify the bottleneck in my data processing pipeline?
- How did the latest code change affect data processing times?
- How can I trace the cause of data inaccuracies in my report?
- How are data privacy and compliance requirements being managed through the data's lifecycle?
- Are there redundant data processes that can be optimized or removed?
- What data dependencies exist for this critical report?

Understanding complex inter-DAG dependencies and providing up-to-date runtime visibility into DAG execution can be challenging.
OpenLineage integrates with Airflow to collect DAG lineage metadata so that inter-DAG dependencies are easily maintained
and viewable via a lineage graph, while also keeping a catalog of historical runs of DAGs.

For OpenLineage backend that will receive events, you can use `Marquez <https://marquezproject.ai/>`_

How it works under the hood ?
=============================

OpenLineage integration implements `AirflowPlugin <https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/plugins.html>`_.
This allows it to be discovered on Airflow start and register
`Airflow Listener <https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/listeners.html>`_.

The ``OpenLineageListener`` is then called by Airflow when certain events happen - when DAGs or TaskInstances start, complete or fail.
For DAGs, the listener runs in Airflow Scheduler. For TaskInstances, the listener runs on Airflow Worker.

When TaskInstance listener method gets called, the ``OpenLineageListener`` constructs metadata like event's unique ``run_id`` and event time.
Then, it tries to find valid Extractor for given operator. The Extractors are a framework
for external extraction of metadata from
Then, it tries to extract metadata from Airflow Operators as described in :ref:`extraction_precedence:openlineage`.

0 comments on commit fb65112

Please sign in to comment.