Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake #303

ghost · 2020-12-02T22:04:00Z

Describe the bug

Reportedly, Snowflake cannot properly micro-partition the UUID string generated by md5, and performs significantly better doing joins with md5_binary.

Steps to reproduce

Join a few large tables on surrogate keys in Snowflake.

Expected results

Performs something akin to a join on an integer id, or at the very least as well as a join across multiple columns (for instance an id and timestamp).

Actual results

Performs worse than joining on multiple columns, which give Snowflake information on partitioning.

System information

Which database are you using dbt with?

snowflake

The text was updated successfully, but these errors were encountered:

clrcrl · 2020-12-03T18:56:38Z

Hey hey @grantwinship — do you know if this would be a breaking change? i.e. does md5(my_col) product a different result to md5_binary(my_col)? I'm assuming it does

ghost · 2020-12-03T20:18:47Z

yea - anybody who updated would have to full refresh all their tables that use surrogate keys

clrcrl · 2020-12-03T22:18:14Z

dangggg okay that's going to be quite the breaking change, not sure exactly the best way to roll it out :(

ghost · 2020-12-04T15:10:25Z

yep yep - i have some ideas around that -- i want to test it on a big data set first and make sure it's enough of an improvement to warrant it at all first though. open to any suggestions on that front if something good comes to mind.

clausherther · 2020-12-05T22:32:31Z

Why not make this optional via an argument? That way new implementation could use the binary key and older models wouldn't break. The old string version could be deprecated at some point by switching the default to binary.

ghost · 2020-12-11T23:25:30Z

yep yep - i have some ideas around that

(that was the main idea) 😄 @clausherther -- we're on the same page 👍🏻

jayziic · 2021-12-30T03:49:21Z

I think part of the snowflake performance issue is that the surrogate_key() function generates a random hashed value. Intentionally so.

IIUC, especially for these large tables, using a randomised key for joining is not the way to go as any natural data ordering that exists is lost making Snowflake highly unaware of where to find the rows while performing the join. This results in what is known as "shuffling".

Thus i think its worth thinking about the purpose of this macro.

If its to be hashed and randomize for privacy purposes, it currently fulfills this.
If its to be a short-enough unique ID for joining across objects, it's probably not the most performant for the above shuffling reason, at least in snowflake.

It may be better, in the second case, to instead consider a method to implement a unique ID calculation which retains any natural ordering that would benefit snowflake's computation. What this is I am not sure, I can only think of concatenation but that is not "short-enough" IMO. But I do believe a md5_binary() is still randomised and the shuffling issue will likely still downgrade performance.

This article talks about ordering and distributed id generators:
ref: https://medium.com/geekculture/how-do-tech-giants-design-and-implement-a-distributed-id-generator-bd618803035f

github-actions · 2023-08-17T01:43:13Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-08-24T01:43:29Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

ghost added bug Something isn't working triage labels Dec 2, 2020

jtcohen6 mentioned this issue Feb 8, 2021

Consider using varbinary in hash() instead of varchar dbt-msft/tsql-utils#26

Closed

github-actions bot added the Stale label Aug 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake #303

Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake #303

ghost commented Dec 2, 2020

clrcrl commented Dec 3, 2020

ghost commented Dec 3, 2020

clrcrl commented Dec 3, 2020

ghost commented Dec 4, 2020

clausherther commented Dec 5, 2020

ghost commented Dec 11, 2020 •

edited by ghost

Loading

jayziic commented Dec 30, 2021 •

edited

Loading

github-actions bot commented Aug 17, 2023

github-actions bot commented Aug 24, 2023

Use md5_binary() instead of md5() for hash function on Snowflake #303

Use md5_binary() instead of md5() for hash function on Snowflake #303

Comments

ghost commented Dec 2, 2020

Describe the bug

Steps to reproduce

Expected results

Actual results

System information

clrcrl commented Dec 3, 2020

ghost commented Dec 3, 2020

clrcrl commented Dec 3, 2020

ghost commented Dec 4, 2020

clausherther commented Dec 5, 2020

ghost commented Dec 11, 2020 • edited by ghost Loading

jayziic commented Dec 30, 2021 • edited Loading

github-actions bot commented Aug 17, 2023

github-actions bot commented Aug 24, 2023

Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake #303

Use `md5_binary()` instead of `md5()` for `hash` function on Snowflake #303

ghost commented Dec 11, 2020 •

edited by ghost

Loading

jayziic commented Dec 30, 2021 •

edited

Loading