Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Looker PDT operators #20882

Merged
merged 12 commits into from
Mar 7, 2022
Merged

Conversation

alekseiloginov
Copy link
Contributor

@alekseiloginov alekseiloginov commented Jan 14, 2022

Description

This PR adds new Airflow operators that make use of the new Looker's Start/Stop API for Persistent Derived Tables (PDTs). These operators allow to manage PDT materialization via Looker API.

Currently supported operations are:

  • Start PDT build
  • Check PDT build status
  • Stop PDT build

Under the hood these new operators use Looker SDK - a python client for Looker API.

Note: Looker PDT operators are being released as a part of Google Cloud operators. While Looker is a part of GCP now, it's still in the process of migrating to the GCP infrastructure. That's why for now we can't use Google Cloud connection, and a custom Looker-specific Airflow connection needs to be created.

Changes

This PR adds a new Looker operator, sensor, hook, example DAGs, tests, docs, and logo.
It also updates provider and setup files.

New classes:

  • LookerStartPdtBuildOperator
    • submits a job to start a PDT build in a synchronous/asynchronous way
    • checks the state of the PDT build submitted in a synchronous way
  • LookerCheckPdtBuildSensor
    • checks the state of the PDT build submitted in a asynchronous way
  • LookerHook
    • configures a connection to Looker SDK using Airflow connection settings
    • handles communication b/w the operator/sensor and Looker SDK

Tests

This PR adds the following unit tests:

  • TestLookerStartPdtBuildOperator
    • tests synchronous/asynchronous execution of LookerStartPdtBuildOperator
    • tests cancel_on_kill flag that specifies whether to cancel the PDT build or not, when a task is manually killed
  • TestLookerCheckPdtBuildSensor
    • tests possible build states: done, error, wait, cancelled
  • TestLookerHook
    • tests wait_for_job used by LookerStartPdtBuildOperator in synchronous mode
    • tests check_pdt_build used to get a build status from Looker SDK
    • tests start_pdt_build used to start a build via Looker SDK
    • tests stop_pdt_build used to stop a build via Looker SDK

Note: There are no system tests since Looker doesn't have an automated process to create a test environment.

Documentation

The main documentation is in looker.rst file as well as in docstrings.

looker.rst contains:

  • High level information about Looker and Looker API
  • Prerequisite tasks with examples
  • How to start a PDT materialization job

Screenshots

Example of using Looker PDT operator in synchronous mode (start and status are combined in one task)
image

Example of using Looker PDT operator in asynchronous mode (start and status are separate tasks)
image

@boring-cyborg boring-cyborg bot added area:providers kind:documentation provider:google Google (including GCP) related issues labels Jan 14, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 14, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Jan 22, 2022

Can you please rebase to latest main ?

@alekseiloginov alekseiloginov force-pushed the add-looker-pdt-operators branch 2 times, most recently from 41bd24d to ace75c3 Compare January 24, 2022 16:59
@alekseiloginov
Copy link
Contributor Author

Can you please rebase to latest main ?

Thanks for reminder @potiuk, rebased!

@potiuk potiuk closed this Jan 29, 2022
@potiuk potiuk reopened this Jan 29, 2022
@potiuk
Copy link
Member

potiuk commented Jan 29, 2022

I think another reabase (and fixes to failing checks) are needed

airflow/providers/google/cloud/operators/looker.py Outdated Show resolved Hide resolved
from airflow.providers.google.cloud.hooks.looker import LookerHook


class LookerStartPdtBuildOperator(BaseOperator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the process to add Links to the GCP resources for most of the operators. Maybe this a good time to add this also to the Looker operators.

Example how you can do this you can find in the Dataproc Metastore:
#21267

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Is it the same as the extra links?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

@alekseiloginov alekseiloginov force-pushed the add-looker-pdt-operators branch 3 times, most recently from ea4b1ed to a8e65ec Compare February 16, 2022 14:28
@alekseiloginov alekseiloginov marked this pull request as ready for review February 16, 2022 14:57
@alekseiloginov alekseiloginov changed the title [WIP] Add Looker PDT operators Add Looker PDT operators Feb 24, 2022
class LookerHook(BaseHook):
"""Hook for Looker APIs."""

def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if Looker is available for selection during connection setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by Looker is available for selection during connection setup. Do you mean the drop down list with connection types (Looker connection has HTTP type)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I mean this dropdown list.
I think you should create a new connection type and then add a new field in the connection form (timeout, veriify_ssl).

conn_name_attr = 'snowflake_conn_id'
default_conn_name = 'snowflake_default'
conn_type = 'snowflake'
hook_name = 'Snowflake'
supports_autocommit = True
@staticmethod
def get_connection_form_widgets() -> Dict[str, Any]:
"""Returns connection widgets to add to connection form"""
from flask_appbuilder.fieldwidgets import BS3TextFieldWidget
from flask_babel import lazy_gettext
from wtforms import BooleanField, StringField
return {
"extra__snowflake__account": StringField(lazy_gettext('Account'), widget=BS3TextFieldWidget()),
"extra__snowflake__warehouse": StringField(
lazy_gettext('Warehouse'), widget=BS3TextFieldWidget()
),
"extra__snowflake__database": StringField(lazy_gettext('Database'), widget=BS3TextFieldWidget()),
"extra__snowflake__region": StringField(lazy_gettext('Region'), widget=BS3TextFieldWidget()),
"extra__snowflake__role": StringField(lazy_gettext('Role'), widget=BS3TextFieldWidget()),
"extra__snowflake__insecure_mode": BooleanField(
label=lazy_gettext('Insecure mode'), description="Turns off OCSP certificate checks"
),
}
@staticmethod
def get_ui_field_behaviour() -> Dict[str, Any]:
"""Returns custom field behaviour"""
import json
return {
"hidden_fields": ['port'],
"relabeling": {},
"placeholders": {
'extra': json.dumps(
{
"authenticator": "snowflake oauth",
"private_key_file": "private key",
"session_parameters": "session parameters",
},
indent=1,
),
'schema': 'snowflake schema',
'login': 'snowflake username',
'password': 'snowflake password',
'extra__snowflake__account': 'snowflake account name',
'extra__snowflake__warehouse': 'snowflake warehouse name',
'extra__snowflake__database': 'snowflake db name',
'extra__snowflake__region': 'snowflake hosted region',
'extra__snowflake__role': 'snowflake role',
'extra__snowflake__insecure_mode': 'insecure mode',
},
}

This will also allow you to rename the "Login" field to "Client ID" and the "Password" field to "Client Secret".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same thoughts first, but then I decided not to create a new connection as Looker in the process of transitioning to GCP and ideally it should use GCP connection soon. So I didn't want for users to get used to this temporary Looker connection. Hope that does make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, users will have to consciously migrate to the new authentication method, which is likely to take some time. As part of the migration, we can start accepting GCP connections and add "(Deprecated)" to the name of the old connection type. I don't see why we should now make it difficult to configure a connection by not adding this connection to the UI now.

If you don't want to add the ability to easily create connections via UI, we can limit ourselves to documentation only, but we should make sure that everything is precisely described.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @potiuk and he said that the next google provider release is happening tomorrow. I will try to implement it by then, but we would really like to launch it with this release to avoid waiting for another month for the next one.

@alekseiloginov
Copy link
Contributor Author

@potiuk - just rebased!

@alekseiloginov
Copy link
Contributor Author

I see a dependency conflict for typing-extensions

  #33 79.18 ERROR: Cannot install apache-airflow and apache-airflow[devel-ci]==2.3.0.dev0 because these package versions have conflicting dependencies.
  #33 79.18 
  #33 79.18 The conflict is caused by:
  #33 79.18     apache-airflow[devel-ci] 2.3.0.dev0 depends on typing-extensions<4.0.0 and >=3.7.4
  #33 79.18     mypy 0.910 depends on typing-extensions>=3.7.4
  #33 79.18     apache-beam 2.33.0 depends on typing-extensions<4 and >=3.7.0
  #33 79.18     looker-sdk 22.2.1 depends on typing-extensions>=4.1.1

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Then. I think you will need to removetyping-extensions limit from setup.py. It causes Kubernetes tests failure if it is removed, but this would have to be solved.

While the dask limitation was clearly a dask problem, I think solving the typing-extensions problem would be something that maybe you can help with since you would like to upgrade it :).

Just removing the limit should trigger the update and you might be able to see the problems

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Ah no I looked clower. It seems that there is still some other reason to hold Beam from upgrading to latest version but what it is I do not know yet - I will look tomorrow.

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Ah. likely constraints not refreshed yet :). Let's see tomorrow morning.

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Right. Constraints refreshed but indeed typing-extensions are too limiting. Attempting to see what the problems are and fix it in #22046

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Al right - @alekseiloginov - I rebased your change on top of the latest main (which should already have typing-extenasions and constraints fixed).

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

🤞 !

@alekseiloginov
Copy link
Contributor Author

Thanks a lot, @potiuk ! Most of the tests are passing now! 👍
I see only one failing test:

MSSQL2017-latest, Py3.9: Always API Core Other CLI Providers WWW Integration

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

That was a flaky testf for mssql: ERROR: for mssqlsetup Container "130ef72be1ab" is unhealthy. Merging.

@potiuk potiuk merged commit 6db9b00 into apache:main Mar 7, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 7, 2022

Awesome work, congrats on your first merged pull request!

@mik-laj
Copy link
Member

mik-laj commented Mar 7, 2022

🎉

@potiuk
Copy link
Member

potiuk commented Mar 7, 2022

Preparing the release now. Good stuff. I just checked that Last Python 3.10 obstacle is removed and we just need to release all providers now with Python 3.10 support to be able to add 3.10 :)

@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge kind:documentation provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants