[SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #45141

LuciferYang · 2024-02-16T15:09:40Z

What changes were proposed in this pull request?

This PR add a new function normalizeDataframeId to sets the dataframeId to the constant 0 of CollectMetrics before comparing LogicalPlan in the test case of SparkConnectProtoSuite.

Why are the changes needed?

The test scenario in SparkConnectProtoSuite does not need to compare the dataframeId in CollectMetrics

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually check

run

build/mvn clean install -pl connector/connect/server -am -DskipTests
build/mvn test -pl connector/connect/server

Before

- Test observe *** FAILED ***
  == FAIL: Plans do not match ===
  !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53
   +- LocalRelation <empty>, [id#0, name#0]                                                                 +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179)

After

Run completed in 41 seconds, 631 milliseconds.
Total number of tests run: 882
Suites: completed 24, aborted 0
Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No

…`CollectMetrics` in `SparkConnectProtoSuite` ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43155 from LuciferYang/SPARK-45357. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

LuciferYang · 2024-02-16T15:45:41Z

cc @HeartSaVioR @srowen @amaliujia

Actually, I didn't reproduce this issue locally because when I use Maven to test the branch-3.5 connect module, the order of test case execution is SparkConnectStreamingQueryCacheSuite, ExecuteEventsManagerSuite, SparkConnectProtoSuite... and there are no DataFrame instances in SparkConnectStreamingQueryCacheSuite and ExecuteEventsManagerSuite. Therefore, sparkTestRelation in SparkConnectProtoSuite is still the first DataFrame to be initialized, its id is 0, which bypasses this issue.

However, we can use some tricky methods to reproduce the failure. For example, change

spark/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

Lines 90 to 94 in 1c1c5fa

    
           test("Basic select") { 
        
             val connectPlan = connectTestRelation.select("id".protoAttr) 
        
             val sparkPlan = sparkTestRelation.select("id") 
        
             comparePlans(connectPlan, sparkPlan) 
        
           }

to

test("Basic select") {
  val connectPlan = connectTestRelation.select("id".protoAttr)
  val sparkPlan = sparkTestRelation2.select("id")
  comparePlans(connectPlan, sparkPlan)
}

In this way, sparkTestRelation will definitely not be the first DataFrame to be initialized, and the test failure can be reproduced, but Basic select will still pass the test.

dongjoon-hyun

+1, LGTM.

Do we need this to branch-3.4, @LuciferYang ?

…ring `CollectMetrics` in `SparkConnectProtoSuite` ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #45141 from LuciferYang/SPARK-45357-35. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2024-02-16T17:20:45Z

Merged to branch-3.5.

HeartSaVioR · 2024-02-16T23:44:15Z

Late +1.

LuciferYang · 2024-02-19T11:06:58Z

+1, LGTM.

Do we need this to branch-3.4, @LuciferYang ?

branch-3.4 does not need this patch

LuciferYang · 2024-02-19T11:07:26Z

Thanks @srowen @dongjoon-hyun @HeartSaVioR

github-actions bot added SQL CONNECT labels Feb 16, 2024

srowen approved these changes Feb 16, 2024

View reviewed changes

dongjoon-hyun approved these changes Feb 16, 2024

View reviewed changes

dongjoon-hyun closed this Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #45141

[SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #45141

LuciferYang commented Feb 16, 2024

LuciferYang commented Feb 16, 2024

dongjoon-hyun left a comment

dongjoon-hyun commented Feb 16, 2024

HeartSaVioR commented Feb 16, 2024

LuciferYang commented Feb 19, 2024 •

edited

LuciferYang commented Feb 19, 2024

[SPARK-45357][CONNECT][TESTS][3.5] Normalize dataframeId when comparing CollectMetrics in SparkConnectProtoSuite #45141

[SPARK-45357][CONNECT][TESTS][3.5] Normalize dataframeId when comparing CollectMetrics in SparkConnectProtoSuite #45141

Conversation

LuciferYang commented Feb 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang commented Feb 16, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Feb 16, 2024

HeartSaVioR commented Feb 16, 2024

LuciferYang commented Feb 19, 2024 • edited

LuciferYang commented Feb 19, 2024

[SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #45141

[SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` #45141

LuciferYang commented Feb 19, 2024 •

edited