Implement Snowflake clustering #1591

bastienboutonnet · 2019-07-06T14:20:18Z

Aim

Brings clustering to snowflake.

Table materialisations will leverage an order by via wrapping the SQL in a select * from ( {{sql}} ) order by ( {{cluster_by_keys}}
Incremental materialisation will create table as above and followed by an alter statement alter table {{relation}} cluster by ({{cluster_by_keys}}) to leverage Snowflake's automatic clustering.

This approach was discussed in #634

merge 0.12.latest to master

Merge 0.12.latest into master

… dev/wilt-chamberlain

test to see if changes are taken into account

bastienboutonnet · 2019-07-06T14:25:52Z

Very much WIP. Just wanted to check this approach is in a good direction.
Also, I think I messed up the update of my fork 😞

I put it as part of the main materialisations, but also very much open to making explicit separate materialisations for those like incremental/table_clustered or something like that but that seems like it would introduce large amounts of code duplication.

drewbanin

Really nice PR @bastienboutonnet! Dropped a couple of comments in here - please let me know if you have any questions!

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

plugins/snowflake/dbt/include/snowflake/macros/materializations/incremental.sql

drewbanin · 2019-07-06T14:46:43Z

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

+{% macro snowflake__alter_cluster(relation, sql) -%}
+  {# TODO: Add support for more than one keys #}
+  {%- set cluster_by_keys = config.get('cluster_by') -%}
+    alter table {{relation}} cluster by ({{cluster_by_keys}})


This was news to me when I learned it: altering the table's clustering keys doesn't actually recluster the table. You'd either need to run alter table ... recluster (which is now deprecated), or you'd need to enable automatic reclustering for these changes to take effect

Damn that's awkward. Indeed I just checked the status of the test table I created and automatic_clustering is off.

The only option is running a alter table foo.bar resume recluster but that will only work if a cluster by key has been provided. So doing clustering solely via order by at table creation will not engage automatic housekeeping and will become stale eventually. Again, I may have gotten this completely wrong.

This kind of leaves us with the following option:

mimic clustering on tables that you create with order by.

for incremental tables hope you're using a fairly monotonic cluster by key because most likely Snowflake won't ever to any housekeeping for you later on.
Is that correct?

I didn't realize that either! I thought automatic clustering needed to be enabled explicitly.

So, that's kind of good news for us - it means that as long as we specify the clustering keys for the table, Snowflake will automatically recluster it as needed. This should never happen for table models since dbt doesn't run DML against those models.

For incremental models, each subsequent invocation of dbt should trigger a recluster of the table, which sounds like the right behavior to me!

In my experience, automatic reclustering can be pretty expensive! It might be worth adding a config to suspend automatic reclustering as shown in the docs. I'm not exactly sure if there's a use case for setting clustering keys and also suspending automatic reclustering... I think this is an area where I'd need to learn a little bit more, or we'd need to pull in a real expert (like @jtcohen6) :)

No no, you were right and maybe I expressed myself badly. Atomatic clustering does need to be called explicitly to have Snowflake monitor and perform housekeeping-like reclustering from what I understand and have seen in operation on our db.

I have no idea what happens, if you give clustering keys, but leave automatic reclustering off. Would it just trigger on DMLs? In that case, yeah we don't need to worry. We coulda alter the keys, and bank on the DMLs to trigger reclustering.

We have a bunch of tables on which we have automatic clustering on. Clustered by created_at kind of timestamps, and in which we load incrementally. At first run automatic clustering was "expensive" as it was the first run, but on subsequent updates it got a lot cheaper since it only had to do incremental housekeeping. Or at least that's what we see. But my feeling is that if autoclustering is off, having a cluster by key isn't going to do anything which means we'd have to also make sure autoclustering is turned on in the tables that we want it on.

But yeah I have no idea if I'm right and I'd love to hear from someone who knows Snowflake in and out a bit more.

to be honest re reading both the https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html and https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html confuses me. Somehow it's not clear whether you'll ever need automatic clustering but at the same time it seems strange that snowflake would keep things clustered without it turned on...

@drewbanin I just checked with my boss who has been using clustering a bit in Snowflake and his take on this is that, if you want to benefit from incremental re clustering, you need to specify the clustering keys and turn on reclustering on the table. The order by will take care of loading the data in the right order but if you do any further loading and you have clustering keys such as user_id, created_at incremental loads wouldn't take care of making the user_id be correctly placed so your data would become progressively un clustered and you'd have to do a full-refresh with the order by to re arrange things in the right place.

I just made some changes to reflect this logic. Note that we only need to turn on clustering on incremental tables. For simple tables since they are always fully refreshed the order by is sufficient.

Let me know if you find out conflicting information but, to me, what I did seems to make sense.

Sorry for the delay in weighing in! I don't think I'm going to say anything novel, but in the process of saying it, I lend my endorsement to this work, and significant appreciation to @bastienboutonnet.

I may be misremembering—because the last time I dived deeply into this was several months ago when Manual Reclustering was un-deprecated and Automatic Reclustering was still a preview feature—but I believe my memory echoes @drewbanin here:

it means that as long as we specify the clustering keys for the table, Snowflake will automatically recluster it as needed.

If Automatic Reclustering is enabled for a Snowflake account, as soon as a table is defined with cluster keys, Snowflake will start keeping tabs on it.

This should never happen for table models since dbt doesn't run DML against those models.

We found out that both DML (inserting/updating/deleting rows from an existing clustered table) and DDL (creating a new table with cluster keys) could trigger automatic clustering, if Snowflake identified that the newly created table was not optimally clustered, based on the keys defined for it post hoc. This ended up being the primary expense of our experimentation :) and it's avoided, as this PR does, by creating the tables with an equivalent order by. That's this line from the top of the docs:

Note that, after a clustered table is defined, reclustering does not necessarily start immediately. Snowflake only reclusters a clustered table if it will benefit from the operation.

The correctly ordered table has nothing to benefit from the operation; automatic clustering is still enabled, but finds itself with nothing to do.

For incremental models, each subsequent invocation of dbt should trigger a recluster of the table, which sounds like the right behavior to me!

Right on. As long as the original table is properly ordered, as are any subsequent temp tables for merging, the subsequent automatic clustering can be a fairly performant, fairly cheap interleaving.

So my only practical thought here is that, per my understanding of late 2018, the automatic_clustering config and alter table {{relation}} resume recluster DDL is not actually needed. When automatic clustering is turned on for the account, any table with cluster keys defined is presumed to have auto-clustering resumed until it is explicitly suspended.

That said, it's a nice idea for dbt to be verbose-and-beyond in its operations, especially ones this subtle, and so I've similarly included the post-hook in past implementations.

Thanks for your reply @jtcohen6 ❤. I think your comments make sense and indeed confirm most of the things we wanted to check. Regarding turning clustering on at the table level (after of course having it enabled for the account for it to work) we found that if we do not run it, the automatic_clustering column for a table for which we have given clustering keys is marked as OFF which suggests that automatic clustering may not be performed as you suggest. I'm curious how you had tested it previously.

As you say we don't risk a lot by being extra explicit and I actually like to be extra explicit but I'm curious to really make sure I understand before just hiding behind a "it won't hurt" logic. Always better not to implement stuff that's un-necessary.

@bastienboutonnet That could be! That morsel of knowledge was one I had picked up when this feature was in private release many months ago, and it struck me as odd at the time that automatic_clustering would be enabled on tables, by default, as soon as a cluster key was defined. So I'm not surprised and actually pleased to hear that the switch has flipped.

Snowflake's docs on the matter are still ambiguous IMO, so I think we err well on the side of being extra un-ambiguous.

Awesome! then I guess it all makes sense. Yes the documentation around that feature is hella ambiguous. I should raise it as an issue to our technical rep or on the snowflake community as I think it probably leaves more than us confused.

…er sql

drewbanin · 2019-07-18T14:34:55Z

I think you'll want to add cluster_by (and also automatic_clustering) to the list of AdapterSpecificConfigs for Snowflake.

dbt uses this list of configs to apply settings from the models: section of dbt_project.yml. After adding the configs to this list, any cluster_by and automatic_clustering configs specified in dbt_project.yml will be applied to the relevant models in the project.

drewbanin · 2019-07-18T14:25:07Z

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

@@ -1,14 +1,33 @@
 {% macro snowflake__create_table_as(temporary, relation, sql) -%}
  {%- set transient = config.get('transient', default=true) -%}
+  {# TODO: Add support for more than one keys #}
+  {%- set cluster_by_keys = config.get('cluster_by') -%}


I like using a pattern like this which allows users to either specify:

a string

a list of strings

It's a tiny bit of extra work to marshal a string into a list that contains one string, but it's worth the UX benefit IMO!

drewbanin · 2019-07-18T14:31:37Z

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

+          {{ sql }}
+        {%- endif %}
+      );
+    {% if cluster_by_keys is not none and is_incremental -%}


I don't think we need this check on is_incremental here. Instead, I think it might make more sense to add a Snowflake-specific config called automatic_clustering which controls whether or not dbt enables automatic clustering. In particular, I think this will be useful in conjunction with dev/prod environments. It's probably not super desirable to always enable automatic clustering in dev (it can be expensive!). I think it's less important to check that the model is incremental, as automatic clustering simply won't run for non-incremental models (like tables) because dbt never runs DML against them!

You buy this?

I'm not even sure that we need to add this check -- if the user specifies a cluster_by for a non-incremental model (like a table), then there shouldn't be any harm in turning on automatic clustering.

I am open to adding another config for Snowflake, like automatic_clustering which controls specifically when the alter table {{relation}} resume recluster; query is executed.

I think I buy it! It makes sense that if there is no DML then no automatic clustering should be triggered, but I'm not sure if it wouldn't actually try to cluster things when you turn it on, which could be costly but I'm not sure. We'd have to have a snowflake specialist confirm this.

So, yes, I think putting alter table {{relation}} resume recluster; behind a config is a good safety nest and it gives transparency to the user about what is going on.

Curious, how would you implement an override of automatic clustering if you were running your model against a --target: dev for example? You're hinting that that's configurable in the dbt_project.yml? or would you put an override in your profile.yml?

(we discussed on Slack, but for posterity:)

Yeah - that’s a separate issue that would be really good to address soon. When dbt evaluates the models: block in the dbt_project.yml file, it doesn’t include the specified target in the context. So, it would be great to do:

models: automatic_clustering: "{{ target.name == 'prod' }}"

but unfortunately, target is going to evaluate to none in this context currently, so that won’t work at the moment. Unsure if we have an issue for this at the moment, but I can create one

bastienboutonnet · 2019-07-20T10:43:20Z

@drewbanin I just implemented the changes you suggested. I think they make a lot of sense. Have a look and let me know whether it makes sense.

I just realised my python formatter (black) caused a few stylistic diff. If you would rather not have these changes I can go back and undo it.

I'm also gonna go ahead and turn the PR into a "ready for review" state as I think we're not going to debate much implementation logic so it should be a proper review.

Also, the first time I updated my fork I brought master in. Let me know if you'd like me to cherry pick those commits or something like that

…into feature/snowflake_cluster_by

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

drewbanin · 2019-07-20T17:02:41Z

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

+          {{ sql }}
+        {%- endif %}
+      );
+    {% if cluster_by_keys is not none and is_incremental -%}


(we discussed on Slack, but for posterity:)

Yeah - that’s a separate issue that would be really good to address soon. When dbt evaluates the models: block in the dbt_project.yml file, it doesn’t include the specified target in the context. So, it would be great to do:

models: automatic_clustering: "{{ target.name == 'prod' }}"

but unfortunately, target is going to evaluate to none in this context currently, so that won’t work at the moment. Unsure if we have an issue for this at the moment, but I can create one

drewbanin

This is looking great @bastienboutonnet! Really nice work :)

One tiny change requested, but this should be good to merge after that!

@jtcohen6 any chance you'll be able to take a quick peek at this on Monday?

bastienboutonnet · 2019-07-20T17:26:33Z

Awesome! Thanks for your help with that one too! (gosh where is the fidget spinner emoji when we need it)

drewbanin · 2019-07-20T18:47:40Z

Looks like a couple of pep8 tests failed here! You can run make test-unit locally to kick off these pep8 tests - let me know if you have any questions!

plugins/snowflake/dbt/adapters/snowflake/impl.py:13:80: E501 line too long (91 > 79 characters)
plugins/snowflake/dbt/adapters/snowflake/impl.py:23:80: E501 line too long (84 > 79 characters)
plugins/snowflake/dbt/adapters/snowflake/impl.py:24:80: E501 line too long (84 > 79 characters)

drewbanin · 2019-08-10T16:25:22Z

@bastienboutonnet the tests are all passing here! I'm going to just do a couple more sanity checks on this over the weekend, but I think we should be a-ok to merge this early next week. If everything looks good, it'll ship for our imminent 0.14.1 release.

Thanks also to @jtcohen6 for imparting his wisdom here!

🎉 ❄️ ✨

drewbanin · 2019-08-13T17:20:14Z

plugins/snowflake/dbt/include/snowflake/macros/adapters.sql

+  {%- set enable_automatic_clustering = config.get('automatic_clustering', default=false) -%}
+  {%- if cluster_by_keys is not none and cluster_by_keys is string -%}
+    {%-  set cluster_by_keys = [cluster_by_keys] -%}
+    {%- set cluster_by_string = cluster_by_keys|join(", ")-%}


Just did one last scan here - I think this won't work properly if cluster_by_keys is provided as a list. If that happens, then the code in this branch won't be executed. I think we should move this line outside of the if block here.

ohh dear... good catch! I can certainly fix that! BUT I just realised to cleaned my fork so I don't actually know how to can edit this on my side. I had to fork dbt again as we were experimenting and forgot this was still pending. Do you know what would be the best way to go? I could probably fork again, or do a new PR but then we'd loose the trace of our discussion (although we could refer to it even if the PR is closed). Let me know

oh dear indeed! I've never seen this before!

There's a cool thing you can do on GitHub - if you add .patch to the end of a PR url, you'll get a "patch" for the change. I think you should be able to make a clean branch off of dev/0.14.1, then do something like:

curl https://patch-diff.githubusercontent.com/raw/fishtown-analytics/dbt/pull/1591.patch > 1591.patch git apply 1591.patch

If git says that it can't apply the patch, you can try applying the patch from the commit 9e36ebd.

Let me know how it goes!

bastienboutonnet · 2019-08-17T19:54:18Z

Closing this one due to having had to create a new one after deleting my fork... New PR: #1689

drewbanin and others added 10 commits November 13, 2018 10:19

Merge pull request #1118 from fishtown-analytics/0.12.latest

70694e3

merge 0.12.latest to master

Update CHANGELOG.md

aef7866

Merge branch '0.12.latest'

e60280c

Merge pull request #1241 from fishtown-analytics/0.12.latest

aaa0127

Merge 0.12.latest into master

Merge branch '0.13.latest' of github.com:fishtown-analytics/dbt

9e36ebd

Merge branch 'master' of git://github.com/fishtown-analytics/dbt into…

4e635e3

… dev/wilt-chamberlain

implement snowflake cluster by

adb1df0

test to see if changes are taken into account

add alter statement

32540a1

add some todos and fixmes for draft PR

8fbd792

add one more todo

2bc59e0

drewbanin reviewed Jul 6, 2019

View reviewed changes

bastienboutonnet changed the title ~~Implement Snowflake clustering~~ Implement Snowflake clustering [WIP] Jul 6, 2019

bastienboutonnet added 2 commits July 6, 2019 17:32

break wrapping lines to avoid being bitten by a one line commen in us…

1cabf43

…er sql

bring alter statements in create table and add clustering activation

fe0be06

drewbanin reviewed Jul 18, 2019

View reviewed changes

bastienboutonnet added 2 commits July 20, 2019 11:22

add multiple keys handling

b4fbf7c

add cluster by and automatic clustering to impl and change tests

fd58297

bastienboutonnet marked this pull request as ready for review July 20, 2019 10:43

bastienboutonnet changed the title ~~Implement Snowflake clustering [WIP]~~ Implement Snowflake clustering Jul 20, 2019

Merge branch 'dev/0.14.1' of git://github.com/fishtown-analytics/dbt …

c5c1946

…into feature/snowflake_cluster_by

drewbanin changed the base branch from dev/wilt-chamberlain to dev/0.14.1 July 20, 2019 16:22

drewbanin reviewed Jul 20, 2019

View reviewed changes

remove is_incremental boolean check and remove useless witespace

954fed0

make max line length <= 79 chars

ba04a87

drewbanin approved these changes Aug 10, 2019

View reviewed changes

drewbanin reviewed Aug 13, 2019

View reviewed changes

bastienboutonnet mentioned this pull request Aug 17, 2019

Implement Clustering for Snowflake #1689

Merged

bastienboutonnet closed this Aug 17, 2019

jtcohen6 mentioned this pull request Aug 20, 2020

Redshift: Automatically insert in sort key order in incremental models #2719

Closed

dbeatty10 mentioned this pull request May 4, 2023

[ADAP-508] [Feature] Honour cluster_by config for Python models dbt-labs/dbt-snowflake#585

Closed

2 tasks

dbeatty10 mentioned this pull request May 12, 2023

[ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified dbt-labs/dbt-snowflake#606

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Snowflake clustering #1591

Implement Snowflake clustering #1591

bastienboutonnet commented Jul 6, 2019

bastienboutonnet commented Jul 6, 2019 •

edited

Loading

drewbanin left a comment

drewbanin Jul 6, 2019

bastienboutonnet Jul 6, 2019 •

edited

Loading

drewbanin Jul 6, 2019

bastienboutonnet Jul 6, 2019

bastienboutonnet Jul 6, 2019

bastienboutonnet Jul 8, 2019

jtcohen6 Jul 23, 2019 •

edited

Loading

bastienboutonnet Jul 31, 2019 •

edited

Loading

jtcohen6 Aug 1, 2019

bastienboutonnet Aug 2, 2019

drewbanin commented Jul 18, 2019

drewbanin Jul 18, 2019

drewbanin Jul 18, 2019

bastienboutonnet Jul 20, 2019 •

edited

Loading

drewbanin Jul 20, 2019

bastienboutonnet commented Jul 20, 2019 •

edited

Loading

drewbanin Jul 20, 2019

drewbanin left a comment

bastienboutonnet commented Jul 20, 2019 •

edited

Loading

drewbanin commented Jul 20, 2019

drewbanin commented Aug 10, 2019

drewbanin Aug 13, 2019

bastienboutonnet Aug 16, 2019 •

edited

Loading

drewbanin Aug 16, 2019

bastienboutonnet commented Aug 17, 2019 •

edited

Loading

Implement Snowflake clustering #1591

Implement Snowflake clustering #1591

Conversation

bastienboutonnet commented Jul 6, 2019

Aim

bastienboutonnet commented Jul 6, 2019 • edited Loading

drewbanin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bastienboutonnet Jul 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

bastienboutonnet Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin commented Jul 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bastienboutonnet Jul 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bastienboutonnet commented Jul 20, 2019 • edited Loading

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

bastienboutonnet commented Jul 20, 2019 • edited Loading

drewbanin commented Jul 20, 2019

drewbanin commented Aug 10, 2019

Choose a reason for hiding this comment

bastienboutonnet Aug 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bastienboutonnet commented Aug 17, 2019 • edited Loading

bastienboutonnet commented Jul 6, 2019 •

edited

Loading

bastienboutonnet Jul 6, 2019 •

edited

Loading

jtcohen6 Jul 23, 2019 •

edited

Loading

bastienboutonnet Jul 31, 2019 •

edited

Loading

bastienboutonnet Jul 20, 2019 •

edited

Loading

bastienboutonnet commented Jul 20, 2019 •

edited

Loading

bastienboutonnet commented Jul 20, 2019 •

edited

Loading

bastienboutonnet Aug 16, 2019 •

edited

Loading

bastienboutonnet commented Aug 17, 2019 •

edited

Loading