Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Snowflake clustering #1591

Closed
wants to merge 17 commits into from
Closed

Implement Snowflake clustering #1591

wants to merge 17 commits into from

Conversation

bastienboutonnet
Copy link
Contributor

Aim

Brings clustering to snowflake.

  • Table materialisations will leverage an order by via wrapping the SQL in a select * from ( {{sql}} ) order by ( {{cluster_by_keys}}
  • Incremental materialisation will create table as above and followed by an alter statement alter table {{relation}} cluster by ({{cluster_by_keys}}) to leverage Snowflake's automatic clustering.

This approach was discussed in #634

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Jul 6, 2019

Very much WIP. Just wanted to check this approach is in a good direction.
Also, I think I messed up the update of my fork 😞

I put it as part of the main materialisations, but also very much open to making explicit separate materialisations for those like incremental/table_clustered or something like that but that seems like it would introduce large amounts of code duplication.

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice PR @bastienboutonnet! Dropped a couple of comments in here - please let me know if you have any questions!

{% macro snowflake__alter_cluster(relation, sql) -%}
{# TODO: Add support for more than one keys #}
{%- set cluster_by_keys = config.get('cluster_by') -%}
alter table {{relation}} cluster by ({{cluster_by_keys}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was news to me when I learned it: altering the table's clustering keys doesn't actually recluster the table. You'd either need to run alter table ... recluster (which is now deprecated), or you'd need to enable automatic reclustering for these changes to take effect

Copy link
Contributor Author

@bastienboutonnet bastienboutonnet Jul 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn that's awkward. Indeed I just checked the status of the test table I created and automatic_clustering is off.

The only option is running a alter table foo.bar resume recluster but that will only work if a cluster by key has been provided. So doing clustering solely via order by at table creation will not engage automatic housekeeping and will become stale eventually. Again, I may have gotten this completely wrong.

This kind of leaves us with the following option:

  • mimic clustering on tables that you create with order by.
  • for incremental tables hope you're using a fairly monotonic cluster by key because most likely Snowflake won't ever to any housekeeping for you later on.
    Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that either! I thought automatic clustering needed to be enabled explicitly.

So, that's kind of good news for us - it means that as long as we specify the clustering keys for the table, Snowflake will automatically recluster it as needed. This should never happen for table models since dbt doesn't run DML against those models.

For incremental models, each subsequent invocation of dbt should trigger a recluster of the table, which sounds like the right behavior to me!

In my experience, automatic reclustering can be pretty expensive! It might be worth adding a config to suspend automatic reclustering as shown in the docs. I'm not exactly sure if there's a use case for setting clustering keys and also suspending automatic reclustering... I think this is an area where I'd need to learn a little bit more, or we'd need to pull in a real expert (like @jtcohen6) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, you were right and maybe I expressed myself badly. Atomatic clustering does need to be called explicitly to have Snowflake monitor and perform housekeeping-like reclustering from what I understand and have seen in operation on our db.

I have no idea what happens, if you give clustering keys, but leave automatic reclustering off. Would it just trigger on DMLs? In that case, yeah we don't need to worry. We coulda alter the keys, and bank on the DMLs to trigger reclustering.

We have a bunch of tables on which we have automatic clustering on. Clustered by created_at kind of timestamps, and in which we load incrementally. At first run automatic clustering was "expensive" as it was the first run, but on subsequent updates it got a lot cheaper since it only had to do incremental housekeeping. Or at least that's what we see. But my feeling is that if autoclustering is off, having a cluster by key isn't going to do anything which means we'd have to also make sure autoclustering is turned on in the tables that we want it on.

But yeah I have no idea if I'm right and I'd love to hear from someone who knows Snowflake in and out a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest re reading both the https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html and https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html confuses me. Somehow it's not clear whether you'll ever need automatic clustering but at the same time it seems strange that snowflake would keep things clustered without it turned on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drewbanin I just checked with my boss who has been using clustering a bit in Snowflake and his take on this is that, if you want to benefit from incremental re clustering, you need to specify the clustering keys and turn on reclustering on the table. The order by will take care of loading the data in the right order but if you do any further loading and you have clustering keys such as user_id, created_at incremental loads wouldn't take care of making the user_id be correctly placed so your data would become progressively un clustered and you'd have to do a full-refresh with the order by to re arrange things in the right place.

I just made some changes to reflect this logic. Note that we only need to turn on clustering on incremental tables. For simple tables since they are always fully refreshed the order by is sufficient.

Let me know if you find out conflicting information but, to me, what I did seems to make sense.

Copy link
Contributor

@jtcohen6 jtcohen6 Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in weighing in! I don't think I'm going to say anything novel, but in the process of saying it, I lend my endorsement to this work, and significant appreciation to @bastienboutonnet.

I may be misremembering—because the last time I dived deeply into this was several months ago when Manual Reclustering was un-deprecated and Automatic Reclustering was still a preview feature—but I believe my memory echoes @drewbanin here:

it means that as long as we specify the clustering keys for the table, Snowflake will automatically recluster it as needed.

If Automatic Reclustering is enabled for a Snowflake account, as soon as a table is defined with cluster keys, Snowflake will start keeping tabs on it.

This should never happen for table models since dbt doesn't run DML against those models.

We found out that both DML (inserting/updating/deleting rows from an existing clustered table) and DDL (creating a new table with cluster keys) could trigger automatic clustering, if Snowflake identified that the newly created table was not optimally clustered, based on the keys defined for it post hoc. This ended up being the primary expense of our experimentation :) and it's avoided, as this PR does, by creating the tables with an equivalent order by. That's this line from the top of the docs:

Note that, after a clustered table is defined, reclustering does not necessarily start immediately. Snowflake only reclusters a clustered table if it will benefit from the operation.

The correctly ordered table has nothing to benefit from the operation; automatic clustering is still enabled, but finds itself with nothing to do.

For incremental models, each subsequent invocation of dbt should trigger a recluster of the table, which sounds like the right behavior to me!

Right on. As long as the original table is properly ordered, as are any subsequent temp tables for merging, the subsequent automatic clustering can be a fairly performant, fairly cheap interleaving.

So my only practical thought here is that, per my understanding of late 2018, the automatic_clustering config and alter table {{relation}} resume recluster DDL is not actually needed. When automatic clustering is turned on for the account, any table with cluster keys defined is presumed to have auto-clustering resumed until it is explicitly suspended.

That said, it's a nice idea for dbt to be verbose-and-beyond in its operations, especially ones this subtle, and so I've similarly included the post-hook in past implementations.

Copy link
Contributor Author

@bastienboutonnet bastienboutonnet Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply @jtcohen6 ❤. I think your comments make sense and indeed confirm most of the things we wanted to check. Regarding turning clustering on at the table level (after of course having it enabled for the account for it to work) we found that if we do not run it, the automatic_clustering column for a table for which we have given clustering keys is marked as OFF which suggests that automatic clustering may not be performed as you suggest. I'm curious how you had tested it previously.

As you say we don't risk a lot by being extra explicit and I actually like to be extra explicit but I'm curious to really make sure I understand before just hiding behind a "it won't hurt" logic. Always better not to implement stuff that's un-necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bastienboutonnet That could be! That morsel of knowledge was one I had picked up when this feature was in private release many months ago, and it struck me as odd at the time that automatic_clustering would be enabled on tables, by default, as soon as a cluster key was defined. So I'm not surprised and actually pleased to hear that the switch has flipped.

Snowflake's docs on the matter are still ambiguous IMO, so I think we err well on the side of being extra un-ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! then I guess it all makes sense. Yes the documentation around that feature is hella ambiguous. I should raise it as an issue to our technical rep or on the snowflake community as I think it probably leaves more than us confused.

@bastienboutonnet bastienboutonnet changed the title Implement Snowflake clustering Implement Snowflake clustering [WIP] Jul 6, 2019
@drewbanin
Copy link
Contributor

I think you'll want to add cluster_by (and also automatic_clustering) to the list of AdapterSpecificConfigs for Snowflake.

dbt uses this list of configs to apply settings from the models: section of dbt_project.yml. After adding the configs to this list, any cluster_by and automatic_clustering configs specified in dbt_project.yml will be applied to the relevant models in the project.

@@ -1,14 +1,33 @@
{% macro snowflake__create_table_as(temporary, relation, sql) -%}
{%- set transient = config.get('transient', default=true) -%}
{# TODO: Add support for more than one keys #}
{%- set cluster_by_keys = config.get('cluster_by') -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using a pattern like this which allows users to either specify:

  1. a string
  2. a list of strings

It's a tiny bit of extra work to marshal a string into a list that contains one string, but it's worth the UX benefit IMO!

{{ sql }}
{%- endif %}
);
{% if cluster_by_keys is not none and is_incremental -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this check on is_incremental here. Instead, I think it might make more sense to add a Snowflake-specific config called automatic_clustering which controls whether or not dbt enables automatic clustering. In particular, I think this will be useful in conjunction with dev/prod environments. It's probably not super desirable to always enable automatic clustering in dev (it can be expensive!). I think it's less important to check that the model is incremental, as automatic clustering simply won't run for non-incremental models (like tables) because dbt never runs DML against them!

You buy this?

I'm not even sure that we need to add this check -- if the user specifies a cluster_by for a non-incremental model (like a table), then there shouldn't be any harm in turning on automatic clustering.

I am open to adding another config for Snowflake, like automatic_clustering which controls specifically when the alter table {{relation}} resume recluster; query is executed.

Copy link
Contributor Author

@bastienboutonnet bastienboutonnet Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I buy it! It makes sense that if there is no DML then no automatic clustering should be triggered, but I'm not sure if it wouldn't actually try to cluster things when you turn it on, which could be costly but I'm not sure. We'd have to have a snowflake specialist confirm this.

So, yes, I think putting alter table {{relation}} resume recluster; behind a config is a good safety nest and it gives transparency to the user about what is going on.

Curious, how would you implement an override of automatic clustering if you were running your model against a --target: dev for example? You're hinting that that's configurable in the dbt_project.yml? or would you put an override in your profile.yml?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we discussed on Slack, but for posterity:)

Yeah - that’s a separate issue that would be really good to address soon. When dbt evaluates the models: block in the dbt_project.yml file, it doesn’t include the specified target in the context. So, it would be great to do:

models:
   automatic_clustering: "{{ target.name == 'prod' }}"

but unfortunately, target is going to evaluate to none in this context currently, so that won’t work at the moment. Unsure if we have an issue for this at the moment, but I can create one

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Jul 20, 2019

@drewbanin I just implemented the changes you suggested. I think they make a lot of sense. Have a look and let me know whether it makes sense.

I just realised my python formatter (black) caused a few stylistic diff. If you would rather not have these changes I can go back and undo it.

I'm also gonna go ahead and turn the PR into a "ready for review" state as I think we're not going to debate much implementation logic so it should be a proper review.

Also, the first time I updated my fork I brought master in. Let me know if you'd like me to cherry pick those commits or something like that

@bastienboutonnet bastienboutonnet marked this pull request as ready for review July 20, 2019 10:43
@bastienboutonnet bastienboutonnet changed the title Implement Snowflake clustering [WIP] Implement Snowflake clustering Jul 20, 2019
@drewbanin drewbanin changed the base branch from dev/wilt-chamberlain to dev/0.14.1 July 20, 2019 16:22
{{ sql }}
{%- endif %}
);
{% if cluster_by_keys is not none and is_incremental -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we discussed on Slack, but for posterity:)

Yeah - that’s a separate issue that would be really good to address soon. When dbt evaluates the models: block in the dbt_project.yml file, it doesn’t include the specified target in the context. So, it would be great to do:

models:
   automatic_clustering: "{{ target.name == 'prod' }}"

but unfortunately, target is going to evaluate to none in this context currently, so that won’t work at the moment. Unsure if we have an issue for this at the moment, but I can create one

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @bastienboutonnet! Really nice work :)

One tiny change requested, but this should be good to merge after that!

@jtcohen6 any chance you'll be able to take a quick peek at this on Monday?

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Jul 20, 2019

Awesome! Thanks for your help with that one too! (gosh where is the fidget spinner emoji when we need it)

@drewbanin
Copy link
Contributor

Looks like a couple of pep8 tests failed here! You can run make test-unit locally to kick off these pep8 tests - let me know if you have any questions!

plugins/snowflake/dbt/adapters/snowflake/impl.py:13:80: E501 line too long (91 > 79 characters)
plugins/snowflake/dbt/adapters/snowflake/impl.py:23:80: E501 line too long (84 > 79 characters)
plugins/snowflake/dbt/adapters/snowflake/impl.py:24:80: E501 line too long (84 > 79 characters)

@drewbanin
Copy link
Contributor

@bastienboutonnet the tests are all passing here! I'm going to just do a couple more sanity checks on this over the weekend, but I think we should be a-ok to merge this early next week. If everything looks good, it'll ship for our imminent 0.14.1 release.

Thanks also to @jtcohen6 for imparting his wisdom here!

🎉 ❄️ ✨

{%- set enable_automatic_clustering = config.get('automatic_clustering', default=false) -%}
{%- if cluster_by_keys is not none and cluster_by_keys is string -%}
{%- set cluster_by_keys = [cluster_by_keys] -%}
{%- set cluster_by_string = cluster_by_keys|join(", ")-%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did one last scan here - I think this won't work properly if cluster_by_keys is provided as a list. If that happens, then the code in this branch won't be executed. I think we should move this line outside of the if block here.

Copy link
Contributor Author

@bastienboutonnet bastienboutonnet Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh dear... good catch! I can certainly fix that! BUT I just realised to cleaned my fork so I don't actually know how to can edit this on my side. I had to fork dbt again as we were experimenting and forgot this was still pending. Do you know what would be the best way to go? I could probably fork again, or do a new PR but then we'd loose the trace of our discussion (although we could refer to it even if the PR is closed). Let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh dear indeed! I've never seen this before!

There's a cool thing you can do on GitHub - if you add .patch to the end of a PR url, you'll get a "patch" for the change. I think you should be able to make a clean branch off of dev/0.14.1, then do something like:

curl https://patch-diff.githubusercontent.com/raw/fishtown-analytics/dbt/pull/1591.patch > 1591.patch
git apply 1591.patch

If git says that it can't apply the patch, you can try applying the patch from the commit 9e36ebd.

Let me know how it goes!

@bastienboutonnet
Copy link
Contributor Author

bastienboutonnet commented Aug 17, 2019

Closing this one due to having had to create a new one after deleting my fork... New PR: #1689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants