Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add support for partition elimination in BigQuery #712

Closed
3 tasks done
Jstein77 opened this issue Aug 7, 2023 · 5 comments · Fixed by #724
Closed
3 tasks done

[Feature] Add support for partition elimination in BigQuery #712

Jstein77 opened this issue Aug 7, 2023 · 5 comments · Fixed by #724
Assignees
Labels
bug Something isn't working

Comments

@Jstein77
Copy link
Contributor

Jstein77 commented Aug 7, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, MetricFlow casts time filters in --start-time and --end-time to a DATETIME data type. For example a query like this:
mf query --metrics revenue --dimensions time --start-time 2023-08-06 --end-time 2023-08-06

Will render the following SQL in BigQuery:
WHERE time BETWEEN CAST('2023-08-06' AS DATETIME) AND CAST('2023-08-06' AS DATETIME)

This clashes with BigQuery's time partition table requirements which require a timestamp. For example, It's common to require a time filter on a large, time partitioned table. If a user tries to query without a time partition filter the query will fail. Therefore, if a user tries to specify a time filter query with MetricFlow it will fail because we are casting time filters as DATETIME when timestamp was expected. The error message is below.

Cannot query over table 'X' without a filter over column(s) 'time' that can be used for partition elimination

Additional BQ documentation is here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables

Proposed solution

For the purposes of this specific issue we will alter the literal time comparison expression rendering to avoid casting the time dimension wherever possible. i.e., for BigQuery and any other engine supporting implicit literal type coercion we'd shift from something like this:

WHERE CAST(input_time_colum AS DATETIME) = CAST('2021-01-01' AS DATETIME)

To something like this:

WHERE input_time_column = '2021-01-01'

The specific rendering behavior in these scenarios will depend on the engine, but for BigQuery specifically it appears any valid string-literal representation of a date or time can be compared with a date, timestamp, or datetime column type without issue.

Describe alternatives you've considered

A community member tried a workaround of casting the time dimensions to a date:

dimensions:
      - name: date
        type: time
        expr: DATE(time)
        type_params:
          time_granularity: day 

However, we're still casting the start and end time as DATETIME which doesn't work for partition pruning.

Who will this benefit?

BigQuery users.

Are you interested in contributing this feature?

No response

Anything else?

No response

@Jstein77 Jstein77 added enhancement New feature or request triage Tasks that need to be triaged labels Aug 7, 2023
@tlento
Copy link
Contributor

tlento commented Aug 7, 2023

I think we have a few options.

  1. Global timezone-aware configuration, where users set a project-level time zone and we use that in all of our tz conversions. There's an open question here of whether we would coerce everything to UTC by default, or be timezone-agnostic by default (the latter is probably better, except Snowflake does some odd stuff there).
  2. Engine-specific timezone-aware type handling configuration. This is a mess but may end up being necessary
  3. Dimension-by-dimension type specification. I think that's super confusing and best avoided, as users can do that coercion in dbt models (even via views without changing their on disk formats), but again, might be necessary.
  4. Treat time partitions as special (similar to 3, but only applicable to partition dimensions).

I'm sure there are other approaches I'm missing.

None of these feel great to me, although I freely admit I would strongly prefer to live in a world free of time zones, but regardless I think we'll need a bit of a brainstorming/design session

@Prokos
Copy link

Prokos commented Aug 8, 2023

Without having any knowledge on the internals of this codebase, I'd like to add (as the person currently running into this problem), that for me simply removing the casting to DATETIME would solve the issue.

As far as I understand (timezones still hurt my brain sometimes though), doing the following:

WHERE DATE(time) = 'YYYY-MM-DD' -- or BETWEEN, same principle

Simply assumes that the YYYY-MM-DD is in the same timezone as the time. As the timezone is effectively removed from time casting it to a DATE.

For my problem to be solved, I do not need support for timezones. I simply need to have my date dimension as a DATE not as a DATETIME. I do realise that means I would lose any potential drilling into for example hourly data.

But there might be low hanging fruit to allow date-level querying for timestamp fields, without having to go into how to deal with timezones.

@tlento tlento added bug Something isn't working and removed enhancement New feature or request triage Tasks that need to be triaged labels Aug 8, 2023
@tlento
Copy link
Contributor

tlento commented Aug 8, 2023

@Prokos I have a hypothesis.

We currently render a CAST expression around the right hand side of that filter. The thing about CAST is it returns an anything. My guess is BigQuery isn't inspecting the inputs to CAST in their filter pruning check, and is falling back to a full table scan.

Their documentation suggests that doing TIMESTAMP('2021-01-01') will work. I'm wondering if we can do DATETIME(start_date) instead of CAST(start_date AS DATETIME) and trigger the partition pruning. If so, this becomes a very easy change for us to make, as we can simply update our cast as timestamp expressions. You'll still need to do some casting in your model until #714 is resolved, but it could unstick you.

@tlento
Copy link
Contributor

tlento commented Aug 9, 2023

Well, so much for that hypothesis. This filter expression works on a table partitioned by TIMESTAMP_TRUNC(ordered_at, DAY):

WHERE DATE(ordered_at) BETWEEN DATE('2016-11-15') AND DATE('2017-01-30');

This also works:

WHERE ordered_at BETWEEN TIMESTAMP('2016-11-15') AND TIMESTAMP('2017-01-30');

Amazingly enough, this also works, both on timestamp_trunc(x, day) and timestamp_trunc(x, hour):

WHERE ordered_at BETWEEN '2016-11-15' AND '2017-01-30';

So the issue is with the TIMESTAMP <-> DATETIME conversion. The mechanism we use doesn't matter. This makes sense to me, in as much as anything involving date types and BigQuery makes sense to me, but it's a bit of a bummer, because the rendering change we require to unblock you is now considerably more complicated.

This was a useful detour, though, because In going through this further I think this is an issue more specific to partition pruning - if your partitions are not date/time-based you can't use filter pruning at all except by accident, and that's a problem as well. So we have two issues:

  1. We probably need to support time zones in a more robust manner, which may solve the immediate issue but will not address the full scope of the partition filter problem
  2. We need to ensure partition pruning filters are applied correctly, which is a query-time issue (since partition labels are supported in the spec)

I think the second of these allows for a faster path to a working solution for you while we figure out the right approach to the first. I'll look into this more on our side and see what we can come up with. More soon!

@Jstein77 Jstein77 changed the title [Feature] Add support for timestamp data sources in BigQuery [Feature] Add support for partition elimination in BigQuery Aug 10, 2023
@tlento
Copy link
Contributor

tlento commented Aug 11, 2023

We've decided on the following:

  1. TimeZone support is out of scope here. We are discussing it separately as a longer term feature connected with formal support for sub-daily granularities.
  2. We are working on some mechanisms to handle more expansive predicate push down hinting for end users and partition filter rendering based on semantic model configuration specifications, which will address the issue you're running into here along with the broader problem of applying partition filters in a reasonable manner.
  3. We will also update our time dimension partition filter rendering to resolve the immediate problem here, at least for certain query types.

I'm going to update this issue to be narrowly focused on the last piece regarding time dimension partition filter rendering, which we can build and release independently as a first step towards providing the more robust filtering changes described in item 2 above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants