[SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset#19454
[SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset#19454sohum2002 wants to merge 3 commits intoapache:masterfrom
Conversation
| mapPartitions(_.flatMap(func)) | ||
|
|
||
| /** | ||
| * Returns a new Dataset by by flattening a traversable collection into a collection itself. |
There was a problem hiding this comment.
Could you please add @since 2.3.0?
|
Could you please add test cases? |
|
ok to test |
|
Test build #82541 has finished for PR 19454 at commit
|
|
This is missing from Python and Java. It also doesn't bother to implement this more efficiently than flatMap(identity). I am not sure this is worth while? |
Added unit test in both RDDSuite.scala and DatasetSuite.scala
|
Test build #82542 has finished for PR 19454 at commit
|
fixed style error
|
Would appreciate some help in the Python implementation of the |
|
Let's fix up the PR title from |
|
Test build #82550 has finished for PR 19454 at commit
|
|
I think @srowen requested to fix it in a more performant way as well, for example, referring #16276, if I understood correctly and otherwise closing it. I don't feel strongly about adding this but I was thinking that we might have to go ahead given this API has been requested multiple times without explicit objection IIUC and, looks consistent with Scala's I'd suggest to close this if we (you and other reviewers here) have to spend a lot of time. Workaround is quite easy anyway. |
|
BTW, for the answer to #19454 (comment), I think you should take a look at, for example, |
|
@HyukjinKwon - Thank you for your comments and analysis of this PR. I will also try to improve the |
|
Is this worth doing? |
|
I actually think this can be confusing on Dataset[T], when the Dataset is just untyped and a DataFrame. Do we throw a runtime exception there? |
| mapPartitions(_.flatMap(func)) | ||
|
|
||
| /** | ||
| * Returns a new Dataset by by flattening a traversable collection into a collection itself. |
| } | ||
|
|
||
| /** | ||
| * Return a new RDD by flattening a traversable collection into a collection itself. |
There was a problem hiding this comment.
Please follow existing comment style like line 392.
| assert(nums.map(_.toString).collect().toList === List("1", "2", "3", "4")) | ||
| assert(nums.filter(_ > 2).collect().toList === List(3, 4)) | ||
| assert(nums.flatMap(x => 1 to x).collect().toList === List(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)) | ||
| assert(sc.makeRDD(Array(Array(1,2,3,4), Array(1,2,3,4))).flatten == List(1,2,3,4,1,2,3,4)) |
|
Same opinion as other reviewers, we can easy go for workaround. Whether this is worth doing is a question. Btw, I'm not sure if #16276 is a more performant way, its |
|
Honestly I don't think it is worth doing this. |
|
Thank you all for your comments. I hope to improve in my future PRs. Cheers! |
What changes were proposed in this pull request?
This PR creates a flatten function in two places: RDD and Dataset classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855.
Author: Sohum Sachdev sohum2002@hotmail.com