Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add airflow package catalog connector #2437

Merged
merged 7 commits into from Feb 7, 2022
Merged

Add airflow package catalog connector #2437

merged 7 commits into from Feb 7, 2022

Conversation

ptitzler
Copy link
Member

@ptitzler ptitzler commented Feb 2, 2022

This PR adds a catalog connector for Apache Airflow packages. Connector instances (of which there should typically be only one) require the user to configure a download URL for the Apache Airflow package that is used in the cluster.

Requires #2418

What changes were proposed in this pull request?

  • Add new catalog-connectors directory to the repository root, containing in the airflow subdirectory the newly introduced connector
  • Update Makefile to include new lint-connectors target, which was also added as a dependency to the lint task
  • Add new lint-connectors task to Github's build.yaml
  • Update installation topic in documentation
  • Add connector to extras install dependency in setup.py (The connector is also installed if one runs pip install elyra[all])

Notes:

  • The connector is not included in the Elyra release process and needs to be published independently, as necessary.

  • The connector declares file name (e.g. operators/bash_operator.py) as hash keys, which are used to internally identify operators in the palette. The archive name, e.g.apache_airflow-1.10.15-py2.py3-none-any.whl, is currently not part of the key to avoid potential versioning issues. For example, assume user A adds operators from archive apache_airflow-1.10.15-py2.py3-none-any.whl to the Elyra deployment and creates a pipeline using some of the operators. User B adds operators from an older archive, such as apache_airflow-1.10.12-py2.py3-none-any.whl . If we were to include the archive name as is as a key, user B would not be able to run pipelines that user A created (and vice versa) because (pseudo code)

    "apache_airflow-1.10.15-py2.py3-none-any.whl:operators/bash_operator.py:BashOperator" != "apache_airflow-1.10.12-py2.py3-none-any.whl:operators/bash_operator.py:BashOperator"
    

    We need to decide whether the implemented behavior is sufficient (archive version numbers are completely ignored, even though this might lead to issues if the loaded operator signatures in Elyra's pipeline editor are significantly different from those of the operators that are installed in the Airflow cluster) or if semver support is required. The latter could be accomplished by using only parts of the archive name as key, e.g. by omitting/masking minor and patch version numbers. It does require though that archive names follow a constant naming pattern to allow for the extraction of version strings.

How was this pull request tested?

  • Install connector from source, as documented in the connector's README
  • Enable the connector as documented in the connector's README
  • Review the installation topic in the 'getting started' guide

Notes:

  • There are unresolved Elyra Airflow component parser issues that need to be addressed before the Airflow 1.10.15 package can be used.
  • The connector should already support Airflow 2.x packages but they have not been tested because Elyra does not support Airflow 2.x.

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

@ptitzler ptitzler added kind:enhancement New feature or request platform: pipeline-Airflow Related to usage of Apache Airflow as pipeline runtime labels Feb 2, 2022
@ptitzler ptitzler added this to the 3.6.0 milestone Feb 2, 2022
@elyra-bot
Copy link

elyra-bot bot commented Feb 2, 2022

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@ptitzler
Copy link
Member Author

ptitzler commented Feb 2, 2022

This PR replaces #2409, which had issues.

@ptitzler ptitzler added the area:documentation Improvements or additions to documentation label Feb 2, 2022
Copy link
Member

@kiersten-stokes kiersten-stokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some NITs below, but otherwise this is looking good and working for me!

That being said, I'm also wondering what would be the best way to handle adding the relevant strings to the available_airflow_operators configurable trait for both the Airflow packages and the provider packages. (As a reminder, we use that list to render the appropriate import statement in the DAG template during processing).

I'll start looking into some options. @kevin-bates might have some ideas here as well

@ptitzler ptitzler changed the title [HOLD] Add airflow package catalog connector Add airflow package catalog connector Feb 4, 2022
Copy link
Member

@kiersten-stokes kiersten-stokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working well for me! I think the only thing left is to remove the catalog-connectors directory and it's files

{
"$schema": "https://raw.githubusercontent.com/elyra-ai/elyra/master/elyra/metadata/schemas/meta-schema.json",
"$id": "https://raw.githubusercontent.com/elyra-ai/elyra/master/catalog-connectors/airflow/airflow-package-catalog-connector/airflow-package-catalog.json",
"title": "Apache Airflow package operator catalog",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment! Though I guess this won't really apply until we have the related documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a README in the directory. Can you please take a look?

Copy link
Member

@kiersten-stokes kiersten-stokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just had the comment regarding the utility of 'package' in all the naming in an attempt to perhaps shorten things. If the comment is accepted, it does affect naming in several other locations of this PR.

setup.py Show resolved Hide resolved
@akchinSTC akchinSTC merged commit 9de4439 into elyra-ai:master Feb 7, 2022
@ptitzler ptitzler deleted the add-package-connector branch February 8, 2022 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:documentation Improvements or additions to documentation kind:enhancement New feature or request platform: pipeline-Airflow Related to usage of Apache Airflow as pipeline runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants