[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

HyukjinKwon · 2024-07-13T03:19:50Z

What changes were proposed in this pull request?

This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes repartition(1). We actually don't need this when the input is single row as it creates only single partition:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala

Lines 49 to 57 in e5e751b

    
           @transient private lazy val rdd: RDD[InternalRow] = { 
        
             if (rows.isEmpty) { 
        
               sparkContext.emptyRDD 
        
             } else { 
        
               val numSlices = math.min( 
        
                 unsafeRows.length, session.leafNodeDefaultParallelism) 
        
               sparkContext.parallelize(unsafeRows.toImmutableArraySeq, numSlices) 
        
             } 
        
           }

Why are the changes needed?

In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918.

Also, we remove repartition(1). To avoid unnecessary shuffle.

With repartition(1):

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6]
   +- LocalTableScan [_1#0]

Without repartition(1):

== Physical Plan ==
LocalTableScan [_1#2]

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI in this PR should verify the change

Was this patch authored or co-authored using generative AI tooling?

No.

…ocation with Dataframe read / write API"" This reverts commit cc32137.

HyukjinKwon · 2024-07-13T08:36:31Z

cc @dongjoon-hyun @WeichenXu123 @zhengruifeng

HyukjinKwon · 2024-07-13T08:38:24Z

BTW, SparkR does not have RDD API so it is guaranteed to have Spark session already running.

zhengruifeng · 2024-07-14T00:25:51Z

mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala

@@ -129,7 +129,9 @@ private[r] object AFTSurvivalRegressionWrapper extends MLReadable[AFTSurvivalReg
      val rMetadata = ("class" -> instance.getClass.getName) ~
        ("features" -> instance.features.toImmutableArraySeq)
      val rMetadataJson: String = compact(render(rMetadata))
-      sc.parallelize(Seq(rMetadataJson), 1).saveAsTextFile(rMetadataPath)
+      // Note that we should write single file. If there are more than one row


BTW, does it make sense to make spark.createDataFrame support numPartitions: Int like spark.range?

We had a discussion about this somewhere and ended up with not having this (because we want to hide the concept of partition in DataFrame in general. But thinking about this again, I think it's probably good to have. SparkR has it FWIW.

dongjoon-hyun · 2024-07-14T17:20:38Z

Thank you, @HyukjinKwon and @zhengruifeng .

In the PR description, could you add specific JIRA issue links for the following ?

In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings.

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

dongjoon-hyun

I have two comments first.

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341 (comment)
[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341 (comment)

HyukjinKwon · 2024-07-14T23:36:00Z

Addressed all 👍

HyukjinKwon · 2024-07-14T23:45:35Z

Separated PR to #47347.

WeichenXu123

LGTM

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also #47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

HyukjinKwon · 2024-07-16T23:55:52Z

Merged to master.

…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of #47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added ML R labels Jul 13, 2024

HyukjinKwon mentioned this pull request Jul 13, 2024

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47328

Closed

HyukjinKwon force-pushed the SPARK-48883-followup branch 2 times, most recently from 654ebc4 to c007d0b Compare July 13, 2024 08:35

HyukjinKwon added 2 commits July 13, 2024 17:35

Revert "Revert "[SPARK-48883][ML][R] Replace RDD read / write API inv…

9eedeaa

…ocation with Dataframe read / write API"" This reverts commit cc32137.

fix

dd07772

HyukjinKwon force-pushed the SPARK-48883-followup branch from c007d0b to dd07772 Compare July 13, 2024 08:35

HyukjinKwon changed the title ~~[SPARK-48883][ML][R][FOLLOW-UP] Avoid repartition when writing out the SparkR metadata~~ [SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API Jul 13, 2024

zhengruifeng approved these changes Jul 14, 2024

View reviewed changes

zhengruifeng reviewed Jul 14, 2024

View reviewed changes

dongjoon-hyun reviewed Jul 14, 2024

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala Outdated Show resolved Hide resolved

dongjoon-hyun requested changes Jul 14, 2024

View reviewed changes

address comments

a744180

HyukjinKwon mentioned this pull request Jul 14, 2024

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

Closed

WeichenXu123 approved these changes Jul 15, 2024

View reviewed changes

HyukjinKwon closed this in c0f6db8 Jul 16, 2024

WeichenXu123 mentioned this pull request Jul 19, 2024

[SPARK-48941][PYTHON][ML] Replace RDD read / write API invocation with Dataframe read / write API #47411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

HyukjinKwon commented Jul 13, 2024 •

edited

Loading

HyukjinKwon commented Jul 13, 2024

HyukjinKwon commented Jul 13, 2024

zhengruifeng Jul 14, 2024

HyukjinKwon Jul 14, 2024

dongjoon-hyun commented Jul 14, 2024

dongjoon-hyun left a comment

HyukjinKwon commented Jul 14, 2024

HyukjinKwon commented Jul 14, 2024

WeichenXu123 left a comment

HyukjinKwon commented Jul 16, 2024

	@transient private lazy val rdd: RDD[InternalRow] = {
	if (rows.isEmpty) {
	sparkContext.emptyRDD
	} else {
	val numSlices = math.min(
	unsafeRows.length, session.leafNodeDefaultParallelism)
	sparkContext.parallelize(unsafeRows.toImmutableArraySeq, numSlices)
	}
	}

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

Conversation

HyukjinKwon commented Jul 13, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Jul 13, 2024

HyukjinKwon commented Jul 13, 2024

zhengruifeng Jul 14, 2024

Choose a reason for hiding this comment

HyukjinKwon Jul 14, 2024

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 14, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 14, 2024

HyukjinKwon commented Jul 14, 2024

WeichenXu123 left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 13, 2024 •

edited

Loading