Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add a warehouse validation to detect if time granularity is configured correctly #758

Open
3 tasks done
tlento opened this issue Sep 5, 2023 · 1 comment
Open
3 tasks done
Labels
backlog enhancement New feature or request

Comments

@tlento
Copy link
Contributor

tlento commented Sep 5, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion

Describe the feature

As of this moment, it is possible to configure a time dimension with a granularity that does not match the granularity stored in the warehouse. You could have a column with, e.g., DAILY granularity while labeling the dimension as YEARLY.

Eventually we'd like to allow users to conform their time dimension inputs from, say, DAILY to YEARLY. Ideally we'd have some validation to catch when this is necessary. Note - this might be impossible to do well, so we might need to think hard about whether or not to build this at all.

Some notes from Slack discussion:

  1. Sampling is a challenge. If we scan everything, or do a row-based sample, we might over-scan a ton of data for this validator. If, on the other hand, we do simple LIMIT X or block-based sampling (where supported) we run the risk of picking a row block that matches the target granularity (in our example, all rows happen to be from January 1st).
  2. Truncation point cannot be assumed - if a user has a fiscal year that starts on January 30th, and they pre-normalize their YEARLY data to January 30th, their data might be valid but if we check against January 1st we'll say it is not. Allowing for this divergence in the validator by default means a more complicated query - we need to find two different types of granularity miss in our result set, which isn't trivial to generalize.

There's also the opposite scenario - YEARLY data stored as DAILY - which seems like kind of a nightmare to deal with.

In any case, this validator will likely have to be implemented with a tolerance for false negatives, and maybe should be a separate option with parameters so users can override the truncation point and sampling configuration. By default we do a complete scan and assert that every row's date_trunc value matches the input value.

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

@tlento tlento added enhancement New feature or request backlog labels Sep 5, 2023
@tlento
Copy link
Contributor Author

tlento commented Sep 5, 2023

I marked this as backlog for now, but it may be worth adding alongside the fix for #714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant