-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTE Query Optimization #45
Comments
@belaidcherfa Wow, this is awesome! Thanks for all the work you've put into researching this ❤️ I have totally overlooked the performance part and I really appreciate your help. Implementing this is straightforward. Do you want to have a crack at it and create a PR? I can also do it, no problem. |
I enjoyed doing the tests. I'm going to take some time over the next few weeks to work on it and try to make the most of the optimizations offered by snowflake in particular. |
Your proposed improvement has now been implemented and included in the 0.4.1 release 🎉 |
I'm excited to see what other Snowflake optimizations you come up with! 💯 |
Hello,
I use your package for profiling and I find it great ! Thank's for your contribution.
I use Snwoflake for the compute part and I noticed (tested) a potential optimization
What is currently being done
A "column_profiles" CTE that performs a union all for each column and reads from the source table (DB.SCHEMA.TABLE) .
When analyzing the snowflake profile query, it performs as many table scans as there are columns. for large volumes, the small warehouse is not enough and returns a memory error.
We can optimize a little by adding a first cte to read the source table in full and then reference this cte for each column, that will force snowflake to read the table only once, mount it in memory and then use it.
I have metrics, we divide by 3/4 the amount of I/O , There is also the number of partitions, In my case
Before
After
PS : The notion of cache in the original runtime doesn't matter since I disabled the session cache
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
I also restarted the warehouses between each execution (using several warehouses on different time intervals). I can't explain myself but I think it's a common cache due to union all.
On the execution times, there is not a big difference but I have tests on large tables where the first query does not succeed because of the number of columns (1000) and the size of the table
I remain at your disposal in case of more information
Snippet code
The text was updated successfully, but these errors were encountered: