Skip to content

Commit

Permalink
docs: Update whole OpenLineage Provider docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
kacpermuda committed Feb 23, 2024
1 parent 185e158 commit 5264af8
Show file tree
Hide file tree
Showing 8 changed files with 709 additions and 102 deletions.
34 changes: 18 additions & 16 deletions airflow/providers/openlineage/provider.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,65 +58,67 @@ config:
openlineage:
description: |
This section applies settings for OpenLineage integration.
For backwards compatibility with `openlineage-python` one can still use
`openlineage.yml` file or `OPENLINEAGE_` environment variables. However, below
configuration takes precedence over those.
More in documentation - https://openlineage.io/docs/client/python#configuration.
More about configuration and it's precedence can be found at
https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#transport-setup
options:
disabled:
description: |
Set this to true if you don't want OpenLineage to emit events.
Disable sending events without uninstalling the OpenLineage Provider by setting this to true.
type: boolean
example: ~
default: "False"
version_added: ~
disabled_for_operators:
description: |
Semicolon separated string of Airflow Operator names to disable
Exclude some Operators from emitting OpenLineage events by passing a string of semicolon separated
full import paths of Operators to disable.
type: string
example: "airflow.operators.bash.BashOperator;airflow.operators.python.PythonOperator"
default: ""
version_added: 1.1.0
namespace:
description: |
OpenLineage namespace
Set namespace that the lineage data belongs to, so that if you use multiple OpenLineage producers,
events coming from them will be logically separated.
version_added: ~
type: string
example: "food_delivery"
example: "my_airflow_instance_1"
default: ~
extractors:
description: |
Semicolon separated paths to custom OpenLineage extractors.
Register custom OpenLineage Extractors by passing a string of semicolon separated full import paths.
type: string
example: full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass
default: ~
version_added: ~
config_path:
description: |
Path to YAML config. This provides backwards compatibility to pass config as
Provide path to YAML config file. This provides backwards compatibility to pass config as
`openlineage.yml` file.
version_added: ~
type: string
example: ~
example: "full/path/to/openlineage.yml"
default: ""
transport:
description: |
OpenLineage Client transport configuration. It should contain type
and additional options per each type.
Pass OpenLineage Client transport configuration as JSON string. It should contain type of the
transport and additional options (different for each transport type). For more details see:
https://openlineage.io/docs/client/python/#built-in-transport-types
Currently supported types are:
* HTTP
* Kafka
* Console
* File
type: string
example: '{"type": "http", "url": "http://localhost:5000"}'
example: '{"type": "http", "url": "http://localhost:5000", "endpoint": "api/v1/lineage"}'
default: ""
version_added: ~
disable_source_code:
description: |
If disabled, OpenLineage events do not contain source code of particular
operators, like PythonOperator.
Disable including source code in OpenLineage events by setting this to true. Several Operators (f.e.
Python, Bash) will by default include their source code in their OpenLineage events if not disabled.
default: ~
example: ~
type: boolean
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,7 @@
specific language governing permissions and limitations
under the License.
.. _configuration:openlineage:

.. include:: ../exts/includes/providers-configurations-ref.rst
421 changes: 398 additions & 23 deletions docs/apache-airflow-providers-openlineage/guides/developer.rst

Large diffs are not rendered by default.

60 changes: 52 additions & 8 deletions docs/apache-airflow-providers-openlineage/guides/structure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,60 @@
under the License.
Structure of OpenLineage Airflow integration
OpenLineage Airflow integration
--------------------------------------------

OpenLineage integration implements AirflowPlugin. This allows it to be discovered on Airflow start and
register Airflow Listener.
OpenLineage is an open framework for data lineage collection and analysis.
At its core is an extensible specification that systems can use to interoperate with lineage metadata.
`Check out OpenLineage docs <https://openlineage.io/docs/>`_.

The listener is then called when certain events happen in Airflow - when DAGs or TaskInstances start, complete or fail.
For DAGs, the listener runs in Airflow Scheduler.
For TaskInstances, the listener runs on Airflow Worker.
Quickstart
==========

To instrument your Airflow instance with OpenLineage, see :ref:`guides/user:openlineage`.

To implement OpenLineage support for Airflow Operators, see :ref:`guides/developer:openlineage`.

What's in it for me ?
=====================

The metadata collected can answer questions like:

- Why did specific data transformation fail?
- What are the upstream sources feeding into certain dataset?
- What downstream processes rely on this specific dataset?
- Is my data fresh?
- Can I identify the bottleneck in my data processing pipeline?
- How did the latest code change affect data processing times?
- How can I trace the cause of data inaccuracies in my report?
- How are data privacy and compliance requirements being managed through the data's lifecycle?
- Are there redundant data processes that can be optimized or removed?
- What data dependencies exist for this critical report?

Understanding complex inter-DAG dependencies and providing up-to-date runtime visibility into DAG execution can be challenging.
OpenLineage integrates with Airflow to collect DAG lineage metadata so that inter-DAG dependencies are easily maintained
and viewable via a lineage graph, while also keeping a catalog of historical runs of DAGs.

.. image:: https://openlineage.io/assets/images/af-schematic-ad8c295a182cb32b94ee27b96727fa98.svg
:alt: airflow_lineage
:width: 1792

For OpenLineage backend that will receive events, you can use `Marquez <https://marquezproject.ai/>`_

.. image:: https://marquezproject.ai/img/screenshot.png
:alt: marquez_lineage
:width: 1440
:align: center

How it works under the hood ?
=============================

OpenLineage integration implements `AirflowPlugin <https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/plugins.html>`_.
This allows it to be discovered on Airflow start and register
`Airflow Listener <https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/listeners.html>`_.

The ``OpenLineageListener`` is then called by Airflow when certain events happen - when DAGs or TaskInstances start, complete or fail.
For DAGs, the listener runs in Airflow Scheduler. For TaskInstances, the listener runs on Airflow Worker.

When TaskInstance listener method gets called, the ``OpenLineageListener`` constructs metadata like event's unique ``run_id`` and event time.
Then, it tries to find valid Extractor for given operator. The Extractors are a framework
for external extraction of metadata from
Then, it tries to extract metadata from Airflow Operators as described in :ref:`extraction_precedence:openlineage`.
Loading

0 comments on commit 5264af8

Please sign in to comment.