Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(explore): dataset macro: dttm filter context #25950

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

giftig
Copy link
Contributor

@giftig giftig commented Nov 10, 2023

SUMMARY

Pass from_dttm and to_dttm as additional context to the dataset macro, allowing any jinja templating in the underlying dataset query text to refer to this context.

This is useful in situations where virtual dataset A incorporates from_dttm and/or to_dttm into the query, in order to make the underlying query aware of chart filter contexts, and then virtual dataset B is built on top of dataset A and used in charts.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before

before-no-dttm-ref

After

after-dttm-ref

TESTING INSTRUCTIONS

Make sure ENABLE_TEMPLATE_PROCESSING is turned on, and use a standard dev superset environment with the samples database available. You can also tweak these steps to use other data or backends as long as you apply a time filter in your dashboard.

  1. Using the samples database, run the following query in SQL Lab:
SELECT ds, gender, name, num_boys, num_girls, '{{ from_dttm or "empty" }}' from_dttm FROM public.birth_names

and save it as a virtual dataset called from_dttm. Currently the from_dttm field will show empty because it won't be populated until a filter is applied on a chart in a later step.

  1. Create a table chart from this dataset, select all fields, and add a time filter on ds. Add this chart to a new dashboard, and configure a time filter for ds on that dashboard. Set that filter to the past 30 years so we can see some data. Notice that now from_dttm displays a datetime value.

  2. In SQL Lab again, run the following query:

SELECT * FROM {{ dataset(25) }};

you'll need to make sure 25 here is the ID of your virtual dataset from_dttm in step 1. Create a chart exactly like the first one, and add it to the same dashboard so you can see the two charts side by side.

  1. Compare the two tables and you will see from_dttm is correctly populated in both charts; prior to this change the second one, using the dataset macro, will show empty instead as it does not propagate sufficient context from the chart through the macro.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags: ENABLE_TEMPLATE_PROCESSING
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@giftig
Copy link
Contributor Author

giftig commented Nov 10, 2023

Note that I tried to capture this scenario in a test case, but since it's quite complex, with one dataset referring to another and needing to render jinja in both, I couldn't see a sensible way of doing it in a unit test -- other unit tests for the dataset macro mock the method which would be rendering the jinja in the underlying dataset to begin with. I also found that the integration tests for jinja were much simpler and didn't cover the dataset macro.

The test coverage for dataset_macro is sufficient to show that there isn't a regression in that macro, though; all I've done is add some extra context which may be used. I think that's ok in terms of coverage, but happy for feedback on that.

Edit: I also initially broke a whole slew of integration tests by getting the time format wrong, so general coverage seems quite good here.

@giftig giftig force-pushed the dataset-macro-context branch 2 times, most recently from 1e5f843 to f0b719b Compare November 10, 2023 15:32
@pull-request-size pull-request-size bot added size/M and removed size/S labels Nov 10, 2023
@giftig
Copy link
Contributor Author

giftig commented Nov 10, 2023

There's a slight wrinkle here in that the datetime format actually varies depending on the engine, which the integration tests have highlighted. Unfortunately to preserve the information to the underlying dataset we need to parse it back into a datetime regardless, since we only have a stringified version in the context here, and when going direct, we have a datetime instead.

That means I've had to be a little bit flexible in how the datetime is parsed, so I've handled the most major formats and done it in a safe way, falling back to omitting it if not present.

Not sure if we can somehow preserve this information better; I think this is the best we can do without considering some major changes to the way this is working.

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giftig for the PR. One question I had is why restrict this to only from_dttm and to_dttm, i.e., should the entire context be exposed?

Additionally would you mind sharing in your PR description a screenshot of how your dataset is configured with Jinja2 templating? I sense this would help to provide more context for people to review.

superset/jinja_context.py Outdated Show resolved Hide resolved
superset/jinja_context.py Outdated Show resolved Hide resolved
@giftig
Copy link
Contributor Author

giftig commented Nov 13, 2023

@john-bodley

Thanks for taking a look at this so quickly.

re exposing the entire context: I'll take a look what else is available which would make sense to pass in; not sure how easy it'd be to expose + test the whole context, since with these parameters I had to do the date parsing to get it back to its original form so it worked seamlessly, but if there are a couple of extra obvious parameters to include I'll add those too. If I add the whole context without being aware of what's in those fields, we might see the wrong types for other fields too, and that may be more confusing than just not having the context.

re dataset configuration: in the testing instructions I gave instructions for a (fairly contrived) example dataset to show the principle of when and why this causes issues. Hopefully that provides enough context. Our actual usecase is a custom macro for querying a table which applies a sensible default filter to prevent accidentally querying too much data, and it uses the dashboard filter instead of a short default when available, but I think the simpler example illustrates what's currently missing more clearly and also makes it easy to reproduce / test the change.

Pass from_dttm and to_dttm as additional context to the dataset macro,
allowing any jinja templating in the underlying dataset query text to
refer to this context.

This is useful in situations where virtual dataset A incorporates
from_dttm and/or to_dttm into the query, in order to make the underlying
query aware of chart filter contexts, and then virtual dataset B is
built on top of dataset A and used in charts
@giftig
Copy link
Contributor Author

giftig commented Nov 14, 2023

@john-bodley I tried passing the rest of the context alongside these parameters but we're actually using this data to add to a query_obj which we use to call this method:

    def get_sqla_query(  # pylint: disable=too-many-arguments,too-many-locals,too-many-branches,too-many-statements
        self,
        apply_fetch_values_predicate: bool = False,
        columns: Optional[list[Column]] = None,
        extras: Optional[dict[str, Any]] = None,
        filter: Optional[  # pylint: disable=redefined-builtin
            list[utils.QueryObjectFilterClause]
        ] = None,
        from_dttm: Optional[datetime] = None,
        granularity: Optional[str] = None,
        groupby: Optional[list[Column]] = None,
        inner_from_dttm: Optional[datetime] = None,
        inner_to_dttm: Optional[datetime] = None,
        is_rowcount: bool = False,
        is_timeseries: bool = True,
        metrics: Optional[list[Metric]] = None,
        orderby: Optional[list[OrderBy]] = None,
        order_desc: bool = True,
        to_dttm: Optional[datetime] = None,
        series_columns: Optional[list[Column]] = None,
        series_limit: Optional[int] = None,
        series_limit_metric: Optional[Metric] = None,
        row_limit: Optional[int] = None,
        row_offset: Optional[int] = None,
        timeseries_limit: Optional[int] = None,
        timeseries_limit_metric: Optional[Metric] = None,
        time_shift: Optional[str] = None,

so we can't pass everything; in my case it contained time_column, for example, which isn't a valid argument to this method. I guess what we really need to do is intelligently pass all arguments of this method through which we might have in the context, and that might also involve fixing some types as it did in my case. I captured some context from my chart, and saw:

{'columns': ['ds', 'gender', 'name', 'num_boys', 'num_girls', 'from_dttm'],
 'filter': [{'col': 'ds',
             'op': 'TEMPORAL_RANGE',
             'val': 'DATEADD(DATETIME("now"), -30, year) : now'}],
 'from_dttm': '1993-11-14T09:16:20',
 'groupby': None,
 'metrics': None,
 'row_limit': 1000,
 'row_offset': 0,
 'table_columns': ['ds',
                   'gender',
                   'name',
                   'num_boys',
                   'num_girls',
                   'from_dttm'],
 'time_column': None,
 'time_grain': 'P1D',
 'to_dttm': '2023-11-14T09:16:20'}

So at a glance I see basic properties of the dataset are present, e.g. columns, filter, groupby, but are in a serialised form so we'd need to recover these somehow. I don't want to creep beyond my original scope too much here, as I don't have good test cases for expected behaviour if I try to pass in some of these fields and this is a huge method, several hundred lines long, so I'd suggest we raise this as a possible future improvement.

@john-bodley john-bodley merged commit f6ba75a into apache:master Nov 14, 2023
29 checks passed
josedev-union pushed a commit to Ortege-xyz/studio that referenced this pull request Jan 22, 2024
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants