[SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset by sohum2002 · Pull Request #19454 · apache/spark

sohum2002 · 2017-10-08T10:25:04Z

What changes were proposed in this pull request?

This PR creates a flatten function in two places: RDD and Dataset classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855.

Author: Sohum Sachdev sohum2002@hotmail.com

kiszk · 2017-10-08T10:47:24Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    mapPartitions(_.flatMap(func))

+  /**
+    * Returns a new Dataset by by flattening a traversable collection into a collection itself.


Could you please add @since 2.3.0?

kiszk · 2017-10-08T10:47:46Z

Could you please add test cases?

HyukjinKwon · 2017-10-08T11:10:10Z

ok to test

SparkQA · 2017-10-08T11:14:47Z

Test build #82541 has finished for PR 19454 at commit 075e7ef.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-10-08T11:31:52Z

This is missing from Python and Java. It also doesn't bother to implement this more efficiently than flatMap(identity). I am not sure this is worth while?

Added unit test in both RDDSuite.scala and DatasetSuite.scala

SparkQA · 2017-10-08T12:15:01Z

Test build #82542 has finished for PR 19454 at commit 261e45a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

fixed style error

sohum2002 · 2017-10-09T12:07:08Z

Would appreciate some help in the Python implementation of the flatten function as I have never used pyspark. Could someone help me out?

HyukjinKwon · 2017-10-09T12:10:14Z

Let's fix up the PR title from [SPARK-18855 ][SQL] to [SPARK-18855][SQL] BTW.

SparkQA · 2017-10-09T12:20:44Z

Test build #82550 has finished for PR 19454 at commit cc08623.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-09T12:27:34Z

I think @srowen requested to fix it in a more performant way as well, for example, referring #16276, if I understood correctly and otherwise closing it.

I don't feel strongly about adding this but I was thinking that we might have to go ahead given this API has been requested multiple times without explicit objection IIUC and, looks consistent with Scala's flatten. However, IMHO, it might be worthwhile only if this PR gives a clean shot.

I'd suggest to close this if we (you and other reviewers here) have to spend a lot of time. Workaround is quite easy anyway.

HyukjinKwon · 2017-10-09T12:45:01Z

BTW, for the answer to #19454 (comment), I think you should take a look at, for example, flatMap as a reference in rdd.py and related tests, for example, see cd ./python/pyspark && grep -r "flatMap" tests.py and Python doctest.

sohum2002 · 2017-10-09T15:06:32Z

@HyukjinKwon - Thank you for your comments and analysis of this PR. I will also try to improve the flatMap(identity) as mentioned by @srowen. Also, will add a python implementation.

rxin · 2017-10-09T22:14:19Z

Is this worth doing?

rxin · 2017-10-09T22:14:54Z

I actually think this can be confusing on Dataset[T], when the Dataset is just untyped and a DataFrame. Do we throw a runtime exception there?

viirya · 2017-10-10T00:17:31Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    mapPartitions(_.flatMap(func))

+  /**
+    * Returns a new Dataset by by flattening a traversable collection into a collection itself.


@group typedrel?

(and by by -> by` I guess)

viirya · 2017-10-10T00:28:26Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

  }

+  /**
+    * Return a new RDD by flattening a traversable collection into a collection itself.


Please follow existing comment style like line 392.

viirya · 2017-10-10T00:39:23Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

    assert(nums.map(_.toString).collect().toList === List("1", "2", "3", "4"))
    assert(nums.filter(_ > 2).collect().toList === List(3, 4))
    assert(nums.flatMap(x => 1 to x).collect().toList === List(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
+    assert(sc.makeRDD(Array(Array(1,2,3,4), Array(1,2,3,4))).flatten == List(1,2,3,4,1,2,3,4))


.flatten.collect().toList.

viirya · 2017-10-10T01:01:55Z

Same opinion as other reviewers, we can easy go for workaround. Whether this is worth doing is a question.

Btw, I'm not sure if #16276 is a more performant way, its flatten implementation seems to consume all elements in the source iterator first to construct the destination iterator. It may not be more performant than a simply call iter.flatMap, IMO.

rxin · 2017-10-10T02:56:42Z

Honestly I don't think it is worth doing this.

sohum2002 · 2017-10-10T03:06:52Z

Thank you all for your comments. I hope to improve in my future PRs. Cheers!

Added flatten functions for RDD and Dataset

075e7ef

sohum2002 changed the title ~~Added flatten functions for RDD and Dataset~~ [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions for RDD and Dataset Oct 8, 2017

kiszk reviewed Oct 8, 2017

View reviewed changes

Added @sincen 2.3.0 in Dataset flatten function

261e45a

Added unit test in both RDDSuite.scala and DatasetSuite.scala

changed x => x to identity

cc08623

fixed style error

sohum2002 changed the title ~~[SPARK-22152][SPARK-18855 ][SQL] Added flatten functions for RDD and Dataset~~ [SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset Oct 9, 2017

viirya reviewed Oct 10, 2017

View reviewed changes

sohum2002 closed this Oct 10, 2017

Conversation

sohum2002 commented Oct 8, 2017

What changes were proposed in this pull request?

Uh oh!

kiszk Oct 8, 2017

Choose a reason for hiding this comment

Uh oh!

kiszk commented Oct 8, 2017

Uh oh!

HyukjinKwon commented Oct 8, 2017

Uh oh!

SparkQA commented Oct 8, 2017

Uh oh!

srowen commented Oct 8, 2017

Uh oh!

SparkQA commented Oct 8, 2017

Uh oh!

sohum2002 commented Oct 9, 2017

Uh oh!

HyukjinKwon commented Oct 9, 2017

Uh oh!

SparkQA commented Oct 9, 2017

Uh oh!

HyukjinKwon commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 9, 2017

Uh oh!

sohum2002 commented Oct 9, 2017

Uh oh!

rxin commented Oct 9, 2017

Uh oh!

rxin commented Oct 9, 2017

Uh oh!

viirya Oct 10, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 10, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Oct 10, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Oct 10, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 10, 2017

Uh oh!

rxin commented Oct 10, 2017

Uh oh!

sohum2002 commented Oct 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

HyukjinKwon commented Oct 9, 2017 •

edited

Loading