CTE Query Optimization #45

belaidcherfa · 2022-05-17T11:35:56Z

Hello,
I use your package for profiling and I find it great ! Thank's for your contribution.
I use Snwoflake for the compute part and I noticed (tested) a potential optimization
What is currently being done
A "column_profiles" CTE that performs a union all for each column and reads from the source table (DB.SCHEMA.TABLE) .

When analyzing the snowflake profile query, it performs as many table scans as there are columns. for large volumes, the small warehouse is not enough and returns a memory error.

We can optimize a little by adding a first cte to read the source table in full and then reference this cte for each column, that will force snowflake to read the table only once, mount it in memory and then use it.

I have metrics, we divide by 3/4 the amount of I/O , There is also the number of partitions, In my case

Before

After

PS : The notion of cache in the original runtime doesn't matter since I disabled the session cache
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
I also restarted the warehouses between each execution (using several warehouses on different time intervals). I can't explain myself but I think it's a common cache due to union all.
On the execution times, there is not a big difference but I have tests on large tables where the first query does not succeed because of the number of columns (1000) and the size of the table
I remain at your disposal in case of more information

Snippet code

stumelius · 2022-05-17T12:33:18Z

@belaidcherfa Wow, this is awesome! Thanks for all the work you've put into researching this ❤️ I have totally overlooked the performance part and I really appreciate your help.

Implementing this is straightforward. Do you want to have a crack at it and create a PR? I can also do it, no problem.

belaidcherfa · 2022-05-17T15:29:06Z

I enjoyed doing the tests.
I'm not used to do PR so if you can do it, that would be great.
Here is the modification I made directly to the package and it works on snowflake. I ran it on several models (different types like tables and views as well as sizes in terms of columns and data).
get_profile_opt.txt

I Added this here :

And change this :

I'm going to take some time over the next few weeks to work on it and try to make the most of the optimizations offered by snowflake in particular.
In terms of caches and metadata : Get the count data separately and Retrieve the min and max separately (in the CTEs apart, it takes them directly from the service layer metadata and it does not query the storage layer ==> So less compute)

stumelius · 2022-05-17T15:48:24Z

Your proposed improvement has now been implemented and included in the 0.4.1 release 🎉

stumelius · 2022-05-17T15:49:07Z

I'm excited to see what other Snowflake optimizations you come up with! 💯

stumelius linked a pull request May 17, 2022 that will close this issue

Improve profiler performance in Snowflake #48

Merged

7 tasks

belaidcherfa closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTE Query Optimization #45

CTE Query Optimization #45

belaidcherfa commented May 17, 2022

stumelius commented May 17, 2022

belaidcherfa commented May 17, 2022

stumelius commented May 17, 2022

stumelius commented May 17, 2022

CTE Query Optimization #45

CTE Query Optimization #45

Comments

belaidcherfa commented May 17, 2022

stumelius commented May 17, 2022

belaidcherfa commented May 17, 2022

stumelius commented May 17, 2022

stumelius commented May 17, 2022