[SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth #43854

ahshahid · 2023-11-16T23:53:55Z

What changes were proposed in this pull request?

This PR attempts to keep the depth of the LogicalPlan tree unchanged, when columns are added /dropped/renamed .
This is done via a special new rule called EarlyCollapseProject.
This is applied after the analysis has happened, but the analyzed plan is yet to be assigned to the holder variable in QueryExecution / DataFrame.

EarlyCollapseProject code does the following :

If the incoming plan is a project1 -> Project2 - X , then it collapses the two projects into one. such that the final plan looks like Project -> X
If the incoming plan is of the form Project1 -Filter1 - Filter2 ---FilterN - Project 2 - X, then it collapses the Project1 and Project2 as
Project -> Filter1 - Filter2 -- FilterN - X.
Pls note that in this case it is as if Project2 is pulled up for collapse, rather than vice versa.
The reason for this is that it is possible that Project1 has some behaviour ( like a UDF) which is not capable of handling certain data , which otherwise would be filtered by the filter chain. If project1 was pushed below Filter chain, then the unfiltered rows can cause issues. Existing spark tests on UDF have test which is sensitive to this.
The EarlyCollapseProject IS NOT applied if
a) any of the incoming Project nodes ( Project1 or Project2) have tag LogicalPlan.PLAN_ID_TAG set ( which implies it is coming from a spark client. It is not handled as of now, because for clients the subsequent resolutions are tied to a direct mapping of the tag ID associated with each Project, and removal of any Project via collapse , breaks the code)

b) If the incoming Project2 has any UserDefinedExpression or non deterministic expression. Or If Project2's child is a Window node.
The reason for non-deterministic exclusion is to ensure the functionality is not broken, as collapse project will replicate the non-deterministic expression. Similarly for UserDefinedExpression and Window node below Project2, the collapse is avoided, as it can cause replication & hence re-evaluation of expensive code when collapsed.

The other important thing the EarlyCollapseProject does, is to store the attributes in the Collapsed Project node, which get dropped and because of collapse , they loose their presence from the Plan completely.
This is needed so as to resurrect dropped attributes, if needed arises, so as to do the resolution correctly.
The dropped attributes are stored in a Seq using the tag LogicalPlan.DROPPED_NAMED_EXPRESSIONS
The need for this arises in following situations:
say we start with a DataFrame df1 , with plan as Project2 ( a, b, c) - X
then we create a new DataFrame df2 = df1.select( b, c). Here, attribute a is dropped
Because of the EarlyCollapseProject rule, the new DataFrame df2 , will have logical plan as Project(b, c) - X
Now spark allows a DataFrame df3 = df2.filter( a > 7).
which would result in a LogicalPlan as Filter( a > 7) -> Project(b, c) -> X
But because "a" has been dropped , its resolution is no longer possible.
To retain the existing behaviour and hence resolution, the Project(b, c) contains the dropped the NamedExpression "a" , and hence this can be revived as last effort for resolution.

ColumnResolutionHelper code change:
The reviving of dropped attributes for plan resolution is done via the code change in ColumnResolutionHelper.resolveExprsAndAddMissingAttrs, where the dropped attributes stored in the Tag are revived back for resolution.

Code changes in CacheManager
The major code change is in CacheManager.
Previously as any change in projection ( addition, drop, rename, shuffling) resulted in a new Project , it was straightforward to lookup cache for fragment logical plan match, as the plan cached would always match a query subtree , if the subtree is used in an immutable form to build complete query tree. But since in this PR , a new Project is collapsed with the existing Project, the subtree is no longer the same as what was cached. Apart from that, the presence of filters between two projects, also muddles the situation.

Case 1: using InMemoryRelation in a plan resulting from collapse of 2 consecutive Projects.

We start with a DataFrame df1 with plan = Project2 -> X
and then we cache this df1. So that the CachedRepresentation has ( IMR and the logical Plan as Project2 -> X)

Now we create a new data frame Df2 = df1.select ( some Proj) , which due to early collapse would look like
Project -> X
Clearly Project may no longer be same as Project2, so a direct check with CacheManager will not result in matching IMR.
But clearly X are same .
So the criteria is : an IMR can be used IFF following conditions are met

X for both is same ( i.e incoming Project's child and CachedPlan's Project's child are same)
All the NamedExpressions of Incoming Project are expressable in terms of output of Project2 ( which is what IMR's output is )
To do the check for above point 2, we consider following logic
Now given that X for both are same, which means their outputs are equivalent, so we remap the cached plan's Project2 in terms of output attribute ( Expr IDs) of X of incoming Project Plan
This will help us find out following
1. Those NamedExpressions of incoming Project which are directly same as NamedExpressions of Project 2
2. Those NamedExpressions of incoming Project which are some functions of output of Project 2
3. Those NamedExpressions of incoming Project which are sort of Literal Constants and independent of output of Project2
Those NamedExpressions of incoming Project which are functions of some attributes but those attributes are unavailable in the output of Project2

So so long as above # 4 types of NamedExpressions are empty, it means that InMemoryRelation of the CachedPlan is usable.
and this above logic is coded in CacheManager. The logic involves modifying the NamedExpressions in incoming Project, in terms of the Seq[Attributes] which will be forced on the IMR.

Case 2: using InMemoryRelation in a plan resulting from collapse of Projects interspersed with Filters.

We start with a DataFrame df1 with plan = Filter3 -> Filter4 -> Project2 -> X
and then we cache this df1. So that the CachedRepresentation has ( IMR and the logical Plan as
Filter3 -> Filter4 -> Project2 -> X )

Now we create a new data frame Df2 = df1.filter( f2) .filter(f1).select (some Proj) , which due to early collapse would look like
Project -> Filter1 -> Filter2 -> Filter3 - Filter4 - > X
( this is because in case of filters, Project2 would be pulled up for collapse)

Clearly here the cached plan chain
Filter3 -> Filter4 -> Project2 -> X
is no longer directly similar to
Project -> Filter1 -> Filter2 -> Filter3 - Filter4 - > X
But it is still possible to use IMR as actually the cached plan's LogicalPlan can be used as a subtree of the incoming Plan.

The logic for such check is partly the same as above for 2 consecutive Projects, with some handling for filters.
The algo for this is as follows

Identify the "child" X from the incoming plan and the Cached Plan 's Logical Plan. for similarity check.
For incoming Plan, we reach X, and store all the consecutive Filter Chain.

For the Cached Plan, we identify the first encountered Project , which is Project 2, and its child which is X.

so we have X from both incoming and cached plan, and we identify the incoming project "Project" and the CachedPlan's "Project2".
Now we can apply the Rule of case 1 of two consecutive Projects, and correctly modify the NamedExpressions of incoming Project , in terms of Seq[Attributes] which will be enforced upon the IMR.

But we also need to ensure that the filter chain present in Cached Plan i.e Filter3 -> Filter4 is a subset of filter chain in the incoming Plan , which is Filter1 -> Filter2 -> Filter3 - Filter4.
Now thing to note is that
for incoming plan it is
Project -> Filter1 -> Filter2 -> Filter3 - Filter4 - > X
In the above chain, the filters are expressed in terms of output of X
But for cached plan the filters are expressed in terms of output of Project2.
Filter3 -> Filter4 -> Project2 -> X

So for comparison we need to express the filter chain of Cached Plan in terms of X, by pulling up the P2 above filters such that
it is now
Project2 -> Filter3' -> Filter4' -> -> X

Now we can see that if we compare
Project -> Filter1 -> Filter2 -> Filter3 - Filter4 - > X

and find that Filter3' -> Filter4' is a subset of Filter1 -> Filter2 -> Filter3 - Filter4
and as the Project and Project2 are already compatible ( by case point 1)
we can use cached IMR , with a modified Project with partial filter chain.

i.e we should be able to get a plan like

Project -> Filter1 -> Filter2 -> IMR.

Why are the changes needed?

Due to addition/mods of new rules Spark 3.0 , clients are seeing extremely large increase in query compilation time, when the client code is adding one column at a time in a loop. Even though API doc does not recommend such practice, but it happens and clients are reluctant to change the code. So this PR attempts to handle the situation where columns are added not in a single shot but one at a time . This would help in Analyzer/ Resolver rules to complete faster.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new tests and relying on existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

…SPARK-45959

…ject

…ject. modified the cache manager to take care of intermediate filters

…yRelation not being used due to filter chain case not handled. added new tests

ahshahid · 2024-01-08T23:29:20Z

@attilapiros @peter-toth . pls review.. have added details to the PR

… use

…orrectly. Fixed the PR related to the uncaching issues found in testing

…ache invalidation among independent and dependent dataframes

cloud-fan · 2024-04-08T04:30:17Z

what's the target use case of this improvement? Super long SQL statement or super long DataFrame transformation chain?

ahshahid · 2024-04-08T06:26:30Z

@cloud-fan
super long DataFrame transformation chain.
additional beneficial side effect is better look up of Cached Plans for InMemoryRelation ( pls see
SPARK-47609)

cloud-fan · 2024-04-08T07:22:21Z

This is a well-known issue. The suggested fix is to ask users to not chain transformations too much, and use "batch" like APIs such as Dataset#withColumns.

How does this PR fix the issue without the problem mentioned in 23d9822 ?

ahshahid · 2024-04-08T07:48:42Z

Caching issue is fixed in this PR. That was the complex part. It will not miss any cache. I have described the approach in PR description. And as I mentioned it makes cache lookup code much robust as described in other bug I filed.

…

On Mon, Apr 8, 2024, 12:22 AM Wenchen Fan ***@***.***> wrote: This is a well-known issue. The suggested fix is to ask users to not chain transformations too much, and use "batch" like APIs such as Dataset#withColumns. How does this PR fix the issue without the problem mentioned in 23d9822 <23d9822> ? — Reply to this email directly, view it on GitHub <#43854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6XG2ED66ZCKM7MGK44MHLY4JAUJAVCNFSM6AAAAAA7O7DTR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGAZTKNRTHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ahshahid · 2024-04-08T07:52:54Z

I understand that suggestion is to not use api to add single column.. but I have come across many companies which generate dataframes via some loop logic. In my previous works I have seen query plans containing 40million plus project nodes only ( not counting filters joins windows etc). There are other customers who are now seeing query compilation times increased from 3 mins to 2 plus hours, due to de dup relation rule or plan cloning at every stage.

…

On Mon, Apr 8, 2024, 12:48 AM Asif Shahid ***@***.***> wrote: Caching issue is fixed in this PR. That was the complex part. It will not miss any cache. I have described the approach in PR description. And as I mentioned it makes cache lookup code much robust as described in other bug I filed. On Mon, Apr 8, 2024, 12:22 AM Wenchen Fan ***@***.***> wrote: > This is a well-known issue. The suggested fix is to ask users to not > chain transformations too much, and use "batch" like APIs such as > Dataset#withColumns. > > How does this PR fix the issue without the problem mentioned in 23d9822 > <23d9822> > ? > > — > Reply to this email directly, view it on GitHub > <#43854 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AC6XG2ED66ZCKM7MGK44MHLY4JAUJAVCNFSM6AAAAAA7O7DTR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGAZTKNRTHE> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

cloud-fan · 2024-04-08T08:44:45Z

Oh, the idea to make cache lookup smarter looks promising. Shall we create an individual PR for it? It's useful for SQL queries as well, as we can hit the cache if one SELECT query only has a few more columns than another cached SELECT query.

ahshahid · 2024-04-08T08:47:16Z

Sure..but I cannot split cache lookup code from this PR as otherwise tests will fail right left. The robustness of cache lookup is just a side effect of this fix. Regards Asif

…

On Mon, Apr 8, 2024, 1:45 AM Wenchen Fan ***@***.***> wrote: Oh, the idea to make cache lookup smarter looks promising. Shall we create an individual PR for it? It's useful for SQL queries as well, as we can hit the cache if one SELECT query only has a few more columns than another cached SELECT query. — Reply to this email directly, view it on GitHub <#43854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6XG2AM5BMXGKQK4FRLQZ3Y4JKJJAVCNFSM6AAAAAA7O7DTR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGE4TIOBZGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ahshahid · 2024-04-08T08:52:19Z

Oh I see.. you mean first use cache lookup pr then this pr goes in

…

On Mon, Apr 8, 2024, 1:47 AM Asif Shahid ***@***.***> wrote: Sure..but I cannot split cache lookup code from this PR as otherwise tests will fail right left. The robustness of cache lookup is just a side effect of this fix. Regards Asif On Mon, Apr 8, 2024, 1:45 AM Wenchen Fan ***@***.***> wrote: > Oh, the idea to make cache lookup smarter looks promising. Shall we > create an individual PR for it? It's useful for SQL queries as well, as we > can hit the cache if one SELECT query only has a few more columns than > another cached SELECT query. > > — > Reply to this email directly, view it on GitHub > <#43854 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AC6XG2AM5BMXGKQK4FRLQZ3Y4JKJJAVCNFSM6AAAAAA7O7DTR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGE4TIOBZGI> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

ashahid added 2 commits November 16, 2023 15:37

SPARK-45959.

7d77bba

SPARK-45959.

4b0c91a

ahshahid marked this pull request as draft November 16, 2023 23:54

github-actions bot added the SQL label Nov 16, 2023

ashahid and others added 19 commits November 16, 2023 17:21

SPARK-45959.

b3c529c

SPARK-45959.

948f4ca

SPARK-45959.

c5c7b99

SPARK-45959.

42816b7

SPARK-45959.

7f2945a

Merge branch 'apache:master' into SPARK-45959

c46860e

Merge branch 'master' into SPARK-45959

10e338a

Merge branch 'SPARK-45959' of https://github.com/ahshahid/spark into …

12bec60

…SPARK-45959

SPARK-45959.

027e0b2

Merge branch 'apache:master' into SPARK-45959

fa2cb58

Merge branch 'SPARK-45959' of https://github.com/ahshahid/spark into …

2072de4

…SPARK-45959

SPARK-45959.

0ecbc30

SPARK-45959.

09296c4

SPARK-45959.

a6054dc

SPARK-45959.

bab2518

SPARK-45959.

d0267d5

SPARK-45959.

6d035ee

SPARK-45959.

38a1d44

SPARK-45959.

0e5d627

HyukjinKwon changed the title ~~[SPARK-45959][SQL]: Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth~~ [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth Nov 20, 2023

ashahid added 6 commits November 19, 2023 18:49

SPARK-45959.

b8575dc

SPARK-45959.

c11496b

SPARK-45959.

0abf34a

SPARK-45959.

ebb2738

Merge branch 'master' into SPARK-45959

bf1412e

SPARK-45959.

a5de567

ashahid added 3 commits January 3, 2024 23:03

SPARK-45959. corrected the golden file. simplified early collapse pro…

00ce58d

…ject

SPARK-45959. corrected the golden file. simplified early collapse pro…

92ec8aa

…ject. modified the cache manager to take care of intermediate filters

SPARK-45959.refactored CacheManager code to simplify. added a new test

45b03f3

ahshahid marked this pull request as draft January 5, 2024 20:16

SPARK-45959.refactored CacheManager code. Handled the case of InMemor…

d3b7a2f

…yRelation not being used due to filter chain case not handled. added new tests

ahshahid marked this pull request as ready for review January 6, 2024 03:20

ahshahid requested a review from JoshRosen January 8, 2024 23:36

ashahid and others added 10 commits January 9, 2024 12:54

SPARK-45959. added tests to validate used of nested InMemoryRelations…

2fb1e75

… use

Merge branch 'master' into SPARK-45959

88967d7

Merge branch 'master' into SPARK-45959

3742ddd

SPARK-45959. Fixed a test corresponding to new issue SPARK-47609

5470c86

SPARK-45959. Fixed scalastyle issue

c3dea67

SPARK-45959. added tests for verifying plans uncaching is happening c…

2f7dbda

…orrectly. Fixed the PR related to the uncaching issues found in testing

SPARK-45959. added tests for verifying plans uncaching is happening c…

8bfb25b

…orrectly. Fixed the PR related to the uncaching issues found in testing

SPARK-45959. added more assertions and enhanced tests to do more of c…

59ff5db

…ache invalidation among independent and dependent dataframes

Merge branch 'apache:master' into SPARK-45959

b1a1ff3

Merge branch 'apache:master' into SPARK-45959

01fd0a1

Merge branch 'apache:master' into SPARK-45959

67802d6

ahshahid mentioned this pull request Apr 8, 2024

[SPARK-47609][SQL] Making CacheLookup more optimal to minimize cache miss #45935

Open

ashahid added 2 commits April 16, 2024 10:29

Merge branch 'master' into SPARK-45959

5cfbf1f

Merge branch 'master' into SPARK-45959

b792faf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth #43854

[SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth #43854

ahshahid commented Nov 16, 2023 •

edited

ahshahid commented Jan 8, 2024

cloud-fan commented Apr 8, 2024

ahshahid commented Apr 8, 2024

cloud-fan commented Apr 8, 2024

ahshahid commented Apr 8, 2024 via email

ahshahid commented Apr 8, 2024 via email

cloud-fan commented Apr 8, 2024

ahshahid commented Apr 8, 2024 via email

ahshahid commented Apr 8, 2024 via email

[SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth #43854

Are you sure you want to change the base?

[SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth #43854

Conversation

ahshahid commented Nov 16, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

ahshahid commented Jan 8, 2024

cloud-fan commented Apr 8, 2024

ahshahid commented Apr 8, 2024

cloud-fan commented Apr 8, 2024

ahshahid commented Apr 8, 2024 via email

ahshahid commented Apr 8, 2024 via email

cloud-fan commented Apr 8, 2024

ahshahid commented Apr 8, 2024 via email

ahshahid commented Apr 8, 2024 via email

ahshahid commented Nov 16, 2023 •

edited