Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified #606

Open
3 tasks done
jaypeedevlin opened this issue May 12, 2023 · 6 comments · May be fixed by #917
Open
3 tasks done

[ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified #606

jaypeedevlin opened this issue May 12, 2023 · 6 comments · May be fixed by #917
Labels
enhancement New feature or request triage

Comments

@jaypeedevlin
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When we add cluster keys to an incremental model, there is an order by added to the model SQL that sorts the model before the merge.

With extremely large tables (ie those most likely to require clustering), this 'front-loads' some of the clustering into the model run itself. I find this undesirable for two reasons:

  1. It increases the length of our dbt runs as opposed to letting the entire cluster process happen async
  2. It 'conceals' some of the cost associated with clustering — this sort is not part of the clustering credits consumed and it's harder to recognise that cost when understanding the true cost of your clustering on an ongoing basis.

I am proposing to add a flag, where the default value is to keep things as is, and when the flag is True that the order by is not added to the model. I do actually think that a change in default behaviour would be desirable for most folks, but I feel like I'm more likely to get a greenlight on a version of this that doesn't alter behavior.

Describe alternatives you've considered

Overriding the macro or creating a custom materialisation.

Who will this benefit?

Folks with big data™.

Are you interested in contributing this feature?

Yes!

Anything else?

No response

@jaypeedevlin jaypeedevlin added enhancement New feature or request triage labels May 12, 2023
@github-actions github-actions bot changed the title [Feature] Flag to opt out of order by when cluster keys are specified [ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified May 12, 2023
@dbeatty10
Copy link
Contributor

The order by topic came up recently in #585, so very timely feature request @jaypeedevlin 🤩 !

💡 While we wouldn't take it lightly, we're open to changing the current behavior that adds order by when cluster keys are specified.

For context, discussion surrounding the original clustering implementation for dbt SQL models is in these pull requests:

A key thing for us to do would be to read those earlier discussions to thoroughly understand their rationale. In particular, I think we'd need to understand the performance implications of the (8) combinations of these scenarios:

Could you give a read though the comments in those pull requests (dbt-labs/dbt-core#1591 in particular) and let me know how it affects your thinking?

I need to do a more thorough reading over there as well.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 11, 2023
@jaypeedevlin
Copy link
Contributor Author

Commenting to leave this open, I intend to respond soon!

@MichaelGFagan
Copy link

MichaelGFagan commented Oct 13, 2023

Adding a comment here to bump this as we have run into the same issue as @jaypeedevlin.

@dbeatty10, I might be missing something, but why would the performance implications be a blocker if the suggestion is to add an optional flag that allows users to alter the default behavior? I think @jaypeedevlin is correct that removing the order by as the default is likely better for the use-cases that get the most benefit from clustering, but our team would much rather have the option to turn off the default behavior rather than having to carry the tech debt burden of overriding the current create_table_as macro/rolling our own materialization.

Edit: Also, from what I can tell, there's no ordering that's done when the incremental is done with the default temp view.

@Slanman3755
Copy link

Slanman3755 commented Jan 22, 2024

@dbeatty10 I have quickly reviewed the old pull requests and feature requests. I believe the decision to add the "order by" to the cluster key config was an over optimization that was influenced by old snowflake clustering behavior and best practices of other users. I do not see any reason the "order by" should be forced upon a normal DBT user and it should at the very least be optional.

I don't have historical Snowflake documentation in front of me but I believe at some point they recommended "clustering" your own tables by sorting data on insert. Around the time the cluster key feature was added to dbt for Snowflake, a big change was made in how Snowflake clustering works, which was the removal of manual reclustering. Review of the github discussion seems to suggest the "order by" was added in an attempt to supplement the removal of the manual "recluster" command by manually "clustering" the data before it was inserted into the new table by first sorting it.

Snowflake clustering now works through background tasks operated by Snowflake. They call this auto clustering. Snowflake does not expect data to be pre-sorted or clustered when creating or inserting into a clustered table. Snowflake's background process will recluster the table at an indeterminate time later. Theoretically there could be a use case where you could improve query performance between the insert/creation time and the reclustering process execution by pre-sorting. This is very niche though and I do not believe this should be forced upon the user.

The result of the current behavior is that DBT is effectively front-loading the work of clustering by first sorting the data. This can potentially be a very expensive operation. Snowflake auto clustering will automatically recluster and sort this data later (most likely more efficiently and cheaper) so there is no reason for DBT to do so. This is no longer a recommended practice by Snowflake and there is normally no reason for a user to do this unless they are very concerned about query performance before reclustering occurs.

Additionally from a user experience perspective, I believe adding an expensive operation such as an "order by" behind a seemingly unrelated feature is deceptive to the user. Maybe I am biased from my own experience but I am confident that there are many DBT users that have suffered from poor performance due to this "order by" and have no idea it was even added to their query.

@Slanman3755
Copy link

Snowflake automatic clustering documentation: https://docs.snowflake.com/en/user-guide/tables-auto-reclustering
tldr; There is no longer a need for DBT to try and cluster/sort for Snowflake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants