-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341
Conversation
654ebc4
to
c007d0b
Compare
…ocation with Dataframe read / write API"" This reverts commit cc32137.
c007d0b
to
dd07772
Compare
BTW, SparkR does not have RDD API so it is guaranteed to have Spark session already running. |
@@ -129,7 +129,9 @@ private[r] object AFTSurvivalRegressionWrapper extends MLReadable[AFTSurvivalReg | |||
val rMetadata = ("class" -> instance.getClass.getName) ~ | |||
("features" -> instance.features.toImmutableArraySeq) | |||
val rMetadataJson: String = compact(render(rMetadata)) | |||
sc.parallelize(Seq(rMetadataJson), 1).saveAsTextFile(rMetadataPath) | |||
// Note that we should write single file. If there are more than one row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, does it make sense to make spark.createDataFrame
support numPartitions: Int
like spark.range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion about this somewhere and ended up with not having this (because we want to hide the concept of partition in DataFrame in general. But thinking about this again, I think it's probably good to have. SparkR has it FWIW.
Thank you, @HyukjinKwon and @zhengruifeng . In the PR description, could you add specific JIRA issue links for the following ?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all 👍 |
Separated PR to #47347. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also #47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Merged to master. |
…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of #47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
What changes were proposed in this pull request?
This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes
repartition(1)
. We actually don't need this when the input is single row as it creates only single partition:spark/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala
Lines 49 to 57 in e5e751b
Why are the changes needed?
In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918.
Also, we remove
repartition(1)
. To avoid unnecessary shuffle.With
repartition(1)
:Without
repartition(1)
:Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI in this PR should verify the change
Was this patch authored or co-authored using generative AI tooling?
No.