Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use md5_binary() instead of md5() for hash function on Snowflake #303

Closed
1 task done
ghost opened this issue Dec 2, 2020 · 9 comments
Closed
1 task done

Use md5_binary() instead of md5() for hash function on Snowflake #303

ghost opened this issue Dec 2, 2020 · 9 comments
Labels
bug Something isn't working Stale triage

Comments

@ghost
Copy link

ghost commented Dec 2, 2020

Describe the bug

Reportedly, Snowflake cannot properly micro-partition the UUID string generated by md5, and performs significantly better doing joins with md5_binary.

Steps to reproduce

Join a few large tables on surrogate keys in Snowflake.

Expected results

Performs something akin to a join on an integer id, or at the very least as well as a join across multiple columns (for instance an id and timestamp).

Actual results

Performs worse than joining on multiple columns, which give Snowflake information on partitioning.

System information

Which database are you using dbt with?

  • snowflake
@ghost ghost added bug Something isn't working triage labels Dec 2, 2020
@clrcrl
Copy link
Contributor

clrcrl commented Dec 3, 2020

Hey hey @grantwinship — do you know if this would be a breaking change? i.e. does md5(my_col) product a different result to md5_binary(my_col)? I'm assuming it does

@ghost
Copy link
Author

ghost commented Dec 3, 2020

yea - anybody who updated would have to full refresh all their tables that use surrogate keys

@clrcrl
Copy link
Contributor

clrcrl commented Dec 3, 2020

dangggg okay that's going to be quite the breaking change, not sure exactly the best way to roll it out :(

@ghost
Copy link
Author

ghost commented Dec 4, 2020

yep yep - i have some ideas around that -- i want to test it on a big data set first and make sure it's enough of an improvement to warrant it at all first though. open to any suggestions on that front if something good comes to mind.

@clausherther
Copy link
Contributor

Why not make this optional via an argument? That way new implementation could use the binary key and older models wouldn't break. The old string version could be deprecated at some point by switching the default to binary.

@ghost
Copy link
Author

ghost commented Dec 11, 2020

yep yep - i have some ideas around that

(that was the main idea) 😄 @clausherther -- we're on the same page 👍🏻

@jayziic
Copy link

jayziic commented Dec 30, 2021

I think part of the snowflake performance issue is that the surrogate_key() function generates a random hashed value. Intentionally so.

IIUC, especially for these large tables, using a randomised key for joining is not the way to go as any natural data ordering that exists is lost making Snowflake highly unaware of where to find the rows while performing the join. This results in what is known as "shuffling".

Thus i think its worth thinking about the purpose of this macro.

  • If its to be hashed and randomize for privacy purposes, it currently fulfills this.
  • If its to be a short-enough unique ID for joining across objects, it's probably not the most performant for the above shuffling reason, at least in snowflake.

It may be better, in the second case, to instead consider a method to implement a unique ID calculation which retains any natural ordering that would benefit snowflake's computation. What this is I am not sure, I can only think of concatenation but that is not "short-enough" IMO. But I do believe a md5_binary() is still randomised and the shuffling issue will likely still downgrade performance.

This article talks about ordering and distributed id generators:
ref: https://medium.com/geekculture/how-do-tech-giants-design-and-implement-a-distributed-id-generator-bd618803035f

@github-actions
Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 17, 2023
@github-actions
Copy link

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale triage
Projects
None yet
Development

No branches or pull requests

3 participants