Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-18] Scheduling queries from SQL Lab #7425

Closed
betodealmeida opened this issue May 1, 2019 · 12 comments
Closed

[SIP-18] Scheduling queries from SQL Lab #7425

betodealmeida opened this issue May 1, 2019 · 12 comments
Labels
enhancement:request Enhancement request submitted by anyone from the community sip Superset Improvement Proposal sqllab Namespace | Anything related to the SQL Lab
Projects

Comments

@betodealmeida
Copy link
Member

[SIP] Proposal for scheduling queries from SQL Lab

Motivation

A common approach for building dashboards involves:

  1. User writes a complex query in SQL Lab, often with joins, to get the data they need.
  2. User clicks "visualize", to explore the data.
  3. User builds a visualization. This step is usually slow, since the SQL query has to be recomputed every time.
  4. User builds a dashboard using the visualization. It's often slow, since the SQL query has to be recomputed every time.
  5. In order to get fresh data in the dashboard, user has to either update the underlying SQL query or write expensive queries using macros (scanning 7 days of data, for example).

We want to optimize that process, allowing the user to write one query in SQL Lab that runs periodically. The query should scan data only for its interval (1 day of data for a daily schedule, for example). This way dashboards can be kept up-to-date with cheap queries.

In this SIP, I propose a way of scheduling queries from within SQL Lab. The actual scheduling of the query is left to an external service (like Apache Airflow, for example). Superset will simply enrich saved queries with additional metadata for the external scheduler.

The proposal is scheduler-agnostic, and can be used with Apache Airflow, Luigi or any other scheduler, since the form for collecting the metadata needed is defined in config.py using react-jsonschema-form.

Proposed Change

In this SIP we propose adding a new feature flag called SCHEDULED_QUERIES. Instead of a boolean, the feature flag would be a dictionary with two keys, JSONSCHEMA and UISCHEMA (see discussion here for using a dict as a feature flag). As an example:

FEATURE_FLAGS = {
    # Configuration for scheduling queries from SQL Lab. This information is
    # collected when the user clicks "Schedule query", and saved into the `extra`
    # field of saved queries.
    # See: https://github.com/mozilla-services/react-jsonschema-form
    'SCHEDULED_QUERIES': {
        'JSONSCHEMA': {
            'title': 'Schedule',
            'description': (
                'In order to schedule a query, you need to specify when it '
                'should start running, when it should stop running, and how '
                'often it should run. You can also optionally specify '
                'dependencies that should be met before the query is '
                'executed. Please read the documentation for best practices '
                'and more information on how to specify dependencies.'
            ),
            'type': 'object',
            'properties': {
                'output_table': {
                    'type': 'string',
                    'title': 'Output table name',
                },
                'start_date': {
                    'type': 'string',
                    'format': 'date-time',
                    'title': 'Start date',
                },
                'end_date': {
                    'type': 'string',
                    'format': 'date-time',
                    'title': 'End date',
                },
                'schedule_interval': {
                    'type': 'string',
                    'title': 'Schedule interval',
                },
                'dependencies': {
                    'type': 'array',
                    'title': 'Dependencies',
                    'items': {
                        'type': 'string',
                    },
                },
            },
        },
        'UISCHEMA': {
            'schedule_interval': {
                'ui:placeholder': '@daily, @weekly, etc.',
            },
            'dependencies': {
                'ui:help': (
                    'Check the documentation for the correct format when '
                    'defining dependencies.'
                ),
            },
        },
    },
}

The configuration is used to dynamically generate a form for collecting the extra metadata needed in order to schedule the query. The example above should work for many schedulers, but it can also be easily adapted (or completely changed) depending on the needs.

If this flag is present, SQL Lab will show a button label "Schedule Query":

Screen Shot 2019-05-01 at 11 29 46 AM

Clicking it pops up a modal:

Screen Shot 2019-05-01 at 11 31 05 AM

When the user clicks "Submit" the query is saved (just like a saved query) with the schedule information stored in its JSON metadata. The user can edit the query, like any saved query, and the scheduler can fetch the scheduled queries using the API provided by FAB.

New or Changed Public Interfaces

None.

New dependencies

react-jsonschema-form is an Apache 2 licensed project created by Mozilla. It was last updated 14 days ago, and has ~35k weekly downloads.

Migration Plan and Compatibility

None.

Rejected Alternatives

We consider using celery workers to run the queries, but this would add a lot of complexity for backfills, alerting, etc. The proposed approach leverages existing schedulers, leaving to Superset only the task of annotating queries with extra metadata.

@betodealmeida betodealmeida added the sip Superset Improvement Proposal label May 1, 2019
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.96. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the enhancement:request Enhancement request submitted by anyone from the community label May 1, 2019
@betodealmeida betodealmeida changed the title [SIP] Scheduling queries from SQL Lab [SIP-18] Scheduling queries from SQL Lab May 1, 2019
@betodealmeida betodealmeida added the sqllab Namespace | Anything related to the SQL Lab label May 1, 2019
@mistercrunch mistercrunch added this to pre-DISCUSS in SIPs Feb 24, 2020
@mmuru
Copy link
Contributor

mmuru commented Apr 18, 2020

@betodealmeida: I got the message "Your query has been scheduled. To see details of your query, navigate to Saved Queries message". I could see "scheduled_info" stored in saved_query.extra_json column
{"schedule_info":{"output_table":"mm_schedule_query_test","start_date":"2020-04-18T01:00:00.000Z","end_date":"2020-04-18T02:00:00.000Z","schedule_interval":"@hourly"}}
but the query was not running. Please, can you help me to understand how it works? What is the output_table for? Will this table get created automatically or must be exist? Can schedule_interval have a valid cron type syntax? Please, can you share me a sample config file?

@betodealmeida
Copy link
Member Author

@mmuru we have a custom Airflow pipeline that fetches the query information and schedules it. Let me check with Lyft's open source team if we can share it.

@mistercrunch
Copy link
Member

There was an ongoing conversation about this here #7416 with @ArgentFalcon

@mmuru
Copy link
Contributor

mmuru commented Apr 22, 2020

@betodealmeida and @mistercrunch: Thanks for the response. The sample config file is sorta misleading because its Hive specify use case. I tried to use /savedqueryviewapi/api/read endpoint, it returns everything in saved_query. Is there way to specify url parameters?

@mistercrunch
Copy link
Member

mistercrunch commented Jul 4, 2020

@mmuru , there's a mini language for all ModelView from FAB (/savedqueryviewapi/api/read is one of them). The easiest way is to reverse engineer it by using the UI in the list view at /savedqueryview/list/, getting the filter you want, and copy pasting the querystring from the URL, applying it to /savedqueryviewapi/api/read?{HERE!}

Screen Shot 2020-07-04 at 10 48 09 AM

Short term we're working on a state of the art REST API under /api/v1/ that will be fully documented and type safe.

@ktmud
Copy link
Member

ktmud commented Oct 14, 2020

Re: adding objects to FEATURE_FLAGS

I think it's one thing to manage feature configs, it's another to manage whether a feature is turned on or not. If you look at LaunchDarkly's API, their feature flags are created in a very structured manner. And the information stored is only metadata about the variants, nothing related to how the feature is implemented.

It's one thing to store information about variants in feature configs, it's another to store objects required to run the feature. It'd be unscalable and quite dangerous to blur the line here.

Ideally you'd want schema/type enforcement on everything in your program, allowing people to store arbitrary stuff in FEATURE_FLAGS seems to be the opposite of that direction.

@edvinas-maciulis
Copy link

@betodealmeida is it possible to edit saved query's scheduling details? How could I disable/stop query from being run in the future?

@srinify srinify moved this from pre-DISCUSS to IMPLEMENTED / DONE in SIPs Feb 4, 2022
@MateusCastello
Copy link

@mmuru we have a custom Airflow pipeline that fetches the query information and schedules it. Let me check with Lyft's open source team if we can share it.

Can you share this pipeline definition? I'm having a little trouble with this part

@betodealmeida
Copy link
Member Author

@mmuru we have a custom Airflow pipeline that fetches the query information and schedules it. Let me check with Lyft's open source team if we can share it.

Can you share this pipeline definition? I'm having a little trouble with this part

I've tried to do it during the years, but unfortunately I left Lyft in 2020 and was never able to get the pipeline source code.

Let me revisit this again and see if I can remember how we did it.

@MateusCastello
Copy link

@mmuru we have a custom Airflow pipeline that fetches the query information and schedules it. Let me check with Lyft's open source team if we can share it.

Can you share this pipeline definition? I'm having a little trouble with this part

I've tried to do it during the years, but unfortunately I left Lyft in 2020 and was never able to get the pipeline source code.

Let me revisit this again and see if I can remember how we did it.

Thank you Beto! I'm thinking of a way to get these queries in a Python script and build separate Airflow Dag's for each one, and then call Trino to dispatch the queries, as I don't already find a way to schedule Celery Async Queries via the Superset API or mething like that.

@kongsyuen
Copy link

kongsyuen commented Feb 24, 2023

Have you found any solution to fix this issue? @mmuru

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement:request Enhancement request submitted by anyone from the community sip Superset Improvement Proposal sqllab Namespace | Anything related to the SQL Lab
Projects
Status: Implemented / Done
SIPs
IMPLEMENTED / DONE
Development

No branches or pull requests

7 participants