Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTE Query Optimization #45

Closed
belaidcherfa opened this issue May 17, 2022 · 4 comments · Fixed by #48
Closed

CTE Query Optimization #45

belaidcherfa opened this issue May 17, 2022 · 4 comments · Fixed by #48

Comments

@belaidcherfa
Copy link

Hello,
I use your package for profiling and I find it great ! Thank's for your contribution.
I use Snwoflake for the compute part and I noticed (tested) a potential optimization
What is currently being done
A "column_profiles" CTE that performs a union all for each column and reads from the source table (DB.SCHEMA.TABLE) .

When analyzing the snowflake profile query, it performs as many table scans as there are columns. for large volumes, the small warehouse is not enough and returns a memory error.

We can optimize a little by adding a first cte to read the source table in full and then reference this cte for each column, that will force snowflake to read the table only once, mount it in memory and then use it.

I have metrics, we divide by 3/4 the amount of I/O , There is also the number of partitions, In my case

Before
2022-05-17_13h21_54

After
2022-05-17_13h11_07

PS : The notion of cache in the original runtime doesn't matter since I disabled the session cache
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
I also restarted the warehouses between each execution (using several warehouses on different time intervals). I can't explain myself but I think it's a common cache due to union all.
On the execution times, there is not a big difference but I have tests on large tables where the first query does not succeed because of the number of columns (1000) and the size of the table
I remain at your disposal in case of more information

Snippet code
2022-05-17_13h35_23

@stumelius
Copy link
Contributor

@belaidcherfa Wow, this is awesome! Thanks for all the work you've put into researching this ❤️ I have totally overlooked the performance part and I really appreciate your help.

Implementing this is straightforward. Do you want to have a crack at it and create a PR? I can also do it, no problem.

@belaidcherfa
Copy link
Author

I enjoyed doing the tests.
I'm not used to do PR so if you can do it, that would be great.
Here is the modification I made directly to the package and it works on snowflake. I ran it on several models (different types like tables and views as well as sizes in terms of columns and data).
get_profile_opt.txt

I Added this here :
image

And change this :
image

I'm going to take some time over the next few weeks to work on it and try to make the most of the optimizations offered by snowflake in particular.
In terms of caches and metadata : Get the count data separately and Retrieve the min and max separately (in the CTEs apart, it takes them directly from the service layer metadata and it does not query the storage layer ==> So less compute)

@stumelius stumelius linked a pull request May 17, 2022 that will close this issue
7 tasks
@stumelius
Copy link
Contributor

Your proposed improvement has now been implemented and included in the 0.4.1 release 🎉

@stumelius
Copy link
Contributor

I'm excited to see what other Snowflake optimizations you come up with! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants