Skip to content

Latest commit

 

History

History
444 lines (332 loc) · 19.3 KB

MANAGING_PROVIDERS_LIFECYCLE.rst

File metadata and controls

444 lines (332 loc) · 19.3 KB
local

Creating a new community provider

This document gathers the necessary steps to create a new community provider and also guidelines for updating the existing ones. You should be aware that providers may have distinctions that may not be covered in this guide. The sequence described was designed to meet the most linear flow possible in order to develop a new provider.

Another recommendation that will help you is to look for a provider that works similar to yours. That way it will help you to set up tests and other dependencies.

First, you need to set up your local development environment. See Contribution Quick Start if you did not set up your local environment yet. We recommend using breeze to develop locally. This way you easily be able to have an environment more similar to the one executed by GitHub CI workflow.

./breeze

Using the code above you will set up Docker containers. These containers your local code to internal volumes. In this way, the changes made in your IDE are already applied to the code inside the container and tests can be carried out quickly.

In this how-to guide our example provider name will be <NEW_PROVIDER>. When you see this placeholder you must change for your provider name.

Initial Code and Unit Tests

Most likely you have developed a version of the provider using some local customization and now you need to transfer this code to the Airflow project. Below is described all the initial code structure that the provider may need. Understand that not all providers will need all the components described in this structure. If you still have doubts about building your provider, we recommend that you read the initial provider guide and open a issue on GitHub so the community can help you.

The folders are optional: example_dags, hooks, links, logs, notifications, operators, secrets, sensors, transfers, triggers, waiters (and the list changes continuously).

airflow/
├── providers/<NEW_PROVIDER>/
│   ├── __init__.py
│   ├── example_dags/
│   │   ├── __init__.py
│   │   └── example_<NEW_PROVIDER>.py
│   ├── executors/
│   │   ├── __init__.py
│   │   └── <NEW_PROVIDER>.py
│   ├── hooks/
│   │   ├── __init__.py
│   │   └── <NEW_PROVIDER>.py
│   ├── operators/
│   │   ├── __init__.py
│   │   └── <NEW_PROVIDER>.py
....
│   ├── transfers/
│   │   ├── __init__.py
│   │   └── <NEW_PROVIDER>.py
│   └── triggers/
│       ├── __init__.py
│       └── <NEW_PROVIDER>.py
└── tests/providers/<NEW_PROVIDER>/
    ├── __init__.py
    ├── executors/
    │   ├── __init__.py
    │   └── test_<NEW_PROVIDER>.py
    ├── hooks/
    │   ├── __init__.py
    │   └── test_<NEW_PROVIDER>.py
    ├── operators/
    │   ├── __init__.py
    │   ├── test_<NEW_PROVIDER>.py
    │   └── test_<NEW_PROVIDER>_system.py
    ...
    ├── transfers/
    │   ├── __init__.py
    │   └── test_<NEW_PROVIDER>.py
    └── triggers/
        ├── __init__.py
        └── test_<NEW_PROVIDER>.py

Considering that you have already transferred your provider's code to the above structure, it will now be necessary to create unit tests for each component you created. The example below I have already set up an environment using breeze and I'll run unit tests for my Hook.

root@fafd8d630e46:/opt/airflow# python -m pytest tests/providers/<NEW_PROVIDER>/hook/<NEW_PROVIDER>.py

Integration tests

See Airflow Integration Tests

Documentation

An important part of building a new provider is the documentation. Some steps for documentation occurs automatically by pre-commit see Installing pre-commit guide

├── INSTALL
├── CONTRIBUTING.rst
├── setup.py
├── airflow/
│   └── providers/
│       └── <NEW_PROVIDER>/
│           ├── provider.yaml
│           └── CHANGELOG.rst
│
└── docs/
    ├── spelling_wordlist.txt
    ├── apache-airflow/
    │   └── extra-packages-ref.rst
    ├── integration-logos/<NEW_PROVIDER>/
    │   └── <NEW_PROVIDER>.png
    └── apache-airflow-providers-<NEW_PROVIDER>/
        ├── index.rst
        ├── commits.rst
        ├── connections.rst
        └── operators/
            └── <NEW_PROVIDER>.rst

Files automatically updated by pre-commit:

  • INSTALL in provider

Files automatically created when the provider is released:

  • docs/apache-airflow-providers-<NEW_PROVIDER>/commits.rst
  • /airflow/providers/<NEW_PROVIDER>/CHANGELOG

There is a chance that your provider's name is not a common English word. In this case is necessary to add it to the file docs/spelling_wordlist.txt. This file begin with capitalized words and lowercase in the second block.

Namespace
Neo4j
Nextdoor
<NEW_PROVIDER> (new line)
Nones
NotFound
Nullable
...
neo4j
neq
networkUri
<NEW_PROVIDER> (new line)
nginx
nobr
nodash

Add your provider dependencies into provider.yaml under dependencies key.. If your provider doesn't have any dependency add a empty list.

In the docs/apache-airflow-providers-<NEW_PROVIDER>/connections.rst:

  • add information how to configure connection for your provider.

In the docs/apache-airflow-providers-<NEW_PROVIDER>/operators/<NEW_PROVIDER>.rst:

  • add information how to use the Operator. It's important to add examples and additional information if your Operator has extra-parameters.

    .. _howto/operator:NewProviderOperator:
    
    NewProviderOperator
    ===================
    
    Use the :class:`~airflow.providers.<NEW_PROVIDER>.operators.NewProviderOperator` to do something
    amazing with Airflow!
    
    Using the Operator
    ^^^^^^^^^^^^^^^^^^
    
    The NewProviderOperator requires a ``connection_id`` and this other awesome parameter.
    You can see an example below:
    
    .. exampleinclude:: /../../airflow/providers/<NEW_PROVIDER>/example_dags/example_<NEW_PROVIDER>.py
        :language: python
        :start-after: [START howto_operator_<NEW_PROVIDER>]
        :end-before: [END howto_operator_<NEW_PROVIDER>]

Copy from another, similar provider the docs: docs/apache-airflow-providers-new_provider/*.rst:

At least those docs should be present

  • security.rst
  • changelog.rst
  • commits.rst
  • index.rst
  • installing-providers-from-sources.rst
  • configurations-ref.rst - if your provider has config element in provider.yaml with configuration options specific for your provider

Make sure to update/add all information that are specific for the new provider.

In the airflow/providers/<NEW_PROVIDER>/provider.yaml add information of your provider:

package-name: apache-airflow-providers-<NEW_PROVIDER>
name: <NEW_PROVIDER>
description: |
  `<NEW_PROVIDER> <https://example.io/>`__
versions:
  - 1.0.0

integrations:
  - integration-name: <NEW_PROVIDER>
    external-doc-url: https://www.example.io/
    logo: /integration-logos/<NEW_PROVIDER>/<NEW_PROVIDER>.png
    how-to-guide:
      - /docs/apache-airflow-providers-<NEW_PROVIDER>/operators/<NEW_PROVIDER>.rst
    tags: [service]

operators:
  - integration-name: <NEW_PROVIDER>
    python-modules:
      - airflow.providers.<NEW_PROVIDER>.operators.<NEW_PROVIDER>

hooks:
  - integration-name: <NEW_PROVIDER>
    python-modules:
      - airflow.providers.<NEW_PROVIDER>.hooks.<NEW_PROVIDER>

sensors:
  - integration-name: <NEW_PROVIDER>
    python-modules:
      - airflow.providers.<NEW_PROVIDER>.sensors.<NEW_PROVIDER>

connection-types:
  - hook-class-name: airflow.providers.<NEW_PROVIDER>.hooks.<NEW_PROVIDER>.NewProviderHook
  - connection-type: provider-connection-type

hook-class-names:  # deprecated in Airflow 2.2.0
  - airflow.providers.<NEW_PROVIDER>.hooks.<NEW_PROVIDER>.NewProviderHook

Note

Defining your own connection types

You only need to add connection-types in case you have some hooks that have customized UI behavior. However, it is only supported for Airflow 2.2.0. If your providers are also targeting Airflow below 2.2.0 you should provide the deprecated hook-class-names array. The connection-types array allows for optimization of importing of individual connections and while Airflow 2.2.0 is able to handle both definition, the connection-types is recommended.

For more information see Custom connection types

After changing and creating these files you can build the documentation locally. The two commands below will serve to accomplish this. The first will build your provider's documentation. The second will ensure that the main Airflow documentation that involves some steps with the providers is also working.

breeze build-docs --package-filter apache-airflow-providers-<NEW_PROVIDER>
breeze build-docs --package-filter apache-airflow

Suspending providers

As of April 2023, we have the possibility to suspend individual providers, so that they are not holding back dependencies for Airflow and other providers. The process of suspending providers is described in description of the process

Technically, suspending a provider is done by setting suspended : true, in the provider.yaml of the provider. This should be followed by committing the change and either automatically or manually running pre-commit checks that will either update derived configuration files or ask you to update them manually. Note that you might need to run pre-commit several times until all the static checks pass, because modification from one pre-commit might impact other pre-commits.

If you have pre-commit installed, pre-commit will be run automatically on commit. If you want to run it manually after commit, you can run it via breeze static-checks --last-commit some of the tests might fail because suspension of the provider might cause changes in the dependencies, so if you see errors about missing dependencies imports, non-usable classes etc., you will need to build the CI image locally via breeze build-image --python 3.8 --upgrade-to-newer-dependencies after the first pre-commit run and then run the static checks again.

If you want to be absolutely sure to run all static checks you can always do this via pre-commit run --all-files or breeze static-checks --all-files.

Some of the manual modifications you will have to do (in both cases pre-commit will guide you on what to do.

  • You will have to run breeze setup regenerate-command-images to regenerate breeze help files
  • you will need to update extra-packages-ref.rst and in some cases - when mentioned there explicitly -setup.py to remove the provider from list of dependencies.

What happens under-the-hood as a result, is that generated/providers.json file is updated with the information about available providers and their dependencies and it is used by our tooling to exclude suspended providers from all relevant parts of the build and CI system (such as building CI image with dependencies, building documentation, running tests, etc.)

Additional changes needed for cross-dependent providers

Those steps above are usually enough for most providers that are "standalone" and not imported or used by other providers (in most cases we will not suspend such providers). However some extra steps might be needed for providers that are used by other providers, or that are part of the default PROD Dockerfile:

  • Most of the tests for the suspended provider, will be automatically excluded by pytest collection. However, in case a provider is dependent on by another provider, the relevant tests might fail to be collected or run by pytest. In such cases you should skip the whole test module failing to be collected by adding pytest.importorskip at the top of the test module. For example if your tests fail because they need to import apache.airflow.providers.google and you have suspended it, you should add this line at the top of the test module that fails.

Example failing collection after google provider has been suspended:

_____ ERROR collecting tests/providers/apache/beam/operators/test_beam.py ______
ImportError while importing test module '/opt/airflow/tests/providers/apache/beam/operators/test_beam.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/providers/apache/beam/operators/test_beam.py:25: in <module>
    from airflow.providers.apache.beam.operators.beam import (
airflow/providers/apache/beam/operators/beam.py:35: in <module>
    from airflow.providers.google.cloud.hooks.dataflow import (
airflow/providers/google/cloud/hooks/dataflow.py:32: in <module>
    from google.cloud.dataflow_v1beta3 import GetJobRequest, Job, JobState, JobsV1Beta3AsyncClient, JobView
E   ModuleNotFoundError: No module named 'google.cloud.dataflow_v1beta3'
_ ERROR collecting tests/providers/microsoft/azure/transfers/test_azure_blob_to_gcs.py _

The fix is to add this line at the top of the tests/providers/apache/beam/operators/test_beam.py module:

pytest.importorskip("apache.airflow.providers.google")
  • Some of the other providers might also just import unconditionally the suspended provider and they will fail during the provider verification step in CI. In this case you should turn the provider imports into conditional imports. For example when import fails after amazon provider has been suspended:

    Traceback (most recent call last):
      File "/opt/airflow/scripts/in_container/verify_providers.py", line 266, in import_all_classes
        _module = importlib.import_module(modinfo.name)
      File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name, package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 728, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/usr/local/lib/python3.8/site-packages/airflow/providers/mysql/transfers/s3_to_mysql.py", line 23, in <module>
        from airflow.providers.amazon.aws.hooks.s3 import S3Hook
    ModuleNotFoundError: No module named 'airflow.providers.amazon'

or:

Error: The airflow.providers.microsoft.azure.transfers.azure_blob_to_gcs object in transfers list in airflow/providers/microsoft/azure/provider.yaml does not exist or is not a module: No module named 'gcloud.aio.storage'

The fix for that is to turn the feature into an optional provider feature (in the place where the excluded airflow.providers import happens:

try:
    from airflow.providers.amazon.aws.hooks.s3 import S3Hook
except ImportError as e:
    from airflow.exceptions import AirflowOptionalProviderFeatureException

    raise AirflowOptionalProviderFeatureException(e)
  • In case we suspend an important provider, which is part of the default Dockerfile you might want to update the tests for PROD docker image in docker_tests/test_prod_image.py.
  • Some of the suspended providers might also fail breeze unit tests that expect a fixed set of providers. Those tests should be adjusted (but this is not very likely to happen, because the tests are using only the most common providers that we will not be likely to suspend).

Resuming providers

Resuming providers is done by reverting the original change that suspended it. In case there are changes needed to fix problems in the reverted provider, our CI will detect them and you will have to fix them as part of the PR reverting the suspension.

Removing providers

When removing providers from Airflow code, we need to make one last release where we mark the provider as removed - in documentation and in description of the PyPI package. In order to that release manager has to add "removed: true" flag in the provider yaml file and include the provider in the next wave of the providers (and then remove all the code and documentation related to the provider).

The "removed: true" flag will cause the provider to be available for the following commands (note that such provider has to be explicitly added as selected to the package - such provider will not be included in the available list of providers):

  • breeze build-docs
  • breeze release-management prepare-provider-documentation
  • breeze release-management prepare-provider-packages
  • breeze release-management publish-docs

For all those commands, release manager needs to specify such to-be-removed provider explicitly as extra command during the release process. Except the changelog that needs to be maintained manually, all other documentation (main page of the provider documentation, PyPI README), will be automatically updated to include removal notice.