Skip to content

[SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset#19454

Closed
sohum2002 wants to merge 3 commits intoapache:masterfrom
sohum2002:SPARK-18855_SPARK-18855
Closed

[SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset#19454
sohum2002 wants to merge 3 commits intoapache:masterfrom
sohum2002:SPARK-18855_SPARK-18855

Conversation

@sohum2002
Copy link

What changes were proposed in this pull request?

This PR creates a flatten function in two places: RDD and Dataset classes. This PR resolves the following issues: SPARK-22152 and SPARK-18855.

Author: Sohum Sachdev sohum2002@hotmail.com

@sohum2002 sohum2002 changed the title Added flatten functions for RDD and Dataset [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions for RDD and Dataset Oct 8, 2017
mapPartitions(_.flatMap(func))

/**
* Returns a new Dataset by by flattening a traversable collection into a collection itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add @since 2.3.0?

@kiszk
Copy link
Member

kiszk commented Oct 8, 2017

Could you please add test cases?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Oct 8, 2017

Test build #82541 has finished for PR 19454 at commit 075e7ef.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 8, 2017

This is missing from Python and Java. It also doesn't bother to implement this more efficiently than flatMap(identity). I am not sure this is worth while?

Added unit test in both RDDSuite.scala and DatasetSuite.scala
@SparkQA
Copy link

SparkQA commented Oct 8, 2017

Test build #82542 has finished for PR 19454 at commit 261e45a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

fixed style error
@sohum2002
Copy link
Author

Would appreciate some help in the Python implementation of the flatten function as I have never used pyspark. Could someone help me out?

@HyukjinKwon
Copy link
Member

Let's fix up the PR title from [SPARK-18855 ][SQL] to [SPARK-18855][SQL] BTW.

@SparkQA
Copy link

SparkQA commented Oct 9, 2017

Test build #82550 has finished for PR 19454 at commit cc08623.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 9, 2017

I think @srowen requested to fix it in a more performant way as well, for example, referring #16276, if I understood correctly and otherwise closing it.

I don't feel strongly about adding this but I was thinking that we might have to go ahead given this API has been requested multiple times without explicit objection IIUC and, looks consistent with Scala's flatten. However, IMHO, it might be worthwhile only if this PR gives a clean shot.

I'd suggest to close this if we (you and other reviewers here) have to spend a lot of time. Workaround is quite easy anyway.

@HyukjinKwon
Copy link
Member

BTW, for the answer to #19454 (comment), I think you should take a look at, for example, flatMap as a reference in rdd.py and related tests, for example, see cd ./python/pyspark && grep -r "flatMap" tests.py and Python doctest.

@sohum2002 sohum2002 changed the title [SPARK-22152][SPARK-18855 ][SQL] Added flatten functions for RDD and Dataset [SPARK-22152][SPARK-18855][SQL] Added flatten functions for RDD and Dataset Oct 9, 2017
@sohum2002
Copy link
Author

@HyukjinKwon - Thank you for your comments and analysis of this PR. I will also try to improve the flatMap(identity) as mentioned by @srowen. Also, will add a python implementation.

@rxin
Copy link
Contributor

rxin commented Oct 9, 2017

Is this worth doing?

@rxin
Copy link
Contributor

rxin commented Oct 9, 2017

I actually think this can be confusing on Dataset[T], when the Dataset is just untyped and a DataFrame. Do we throw a runtime exception there?

mapPartitions(_.flatMap(func))

/**
* Returns a new Dataset by by flattening a traversable collection into a collection itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@group typedrel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and by by -> by` I guess)

}

/**
* Return a new RDD by flattening a traversable collection into a collection itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow existing comment style like line 392.

assert(nums.map(_.toString).collect().toList === List("1", "2", "3", "4"))
assert(nums.filter(_ > 2).collect().toList === List(3, 4))
assert(nums.flatMap(x => 1 to x).collect().toList === List(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
assert(sc.makeRDD(Array(Array(1,2,3,4), Array(1,2,3,4))).flatten == List(1,2,3,4,1,2,3,4))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.flatten.collect().toList.

@viirya
Copy link
Member

viirya commented Oct 10, 2017

Same opinion as other reviewers, we can easy go for workaround. Whether this is worth doing is a question.

Btw, I'm not sure if #16276 is a more performant way, its flatten implementation seems to consume all elements in the source iterator first to construct the destination iterator. It may not be more performant than a simply call iter.flatMap, IMO.

@rxin
Copy link
Contributor

rxin commented Oct 10, 2017

Honestly I don't think it is worth doing this.

@sohum2002
Copy link
Author

Thank you all for your comments. I hope to improve in my future PRs. Cheers!

@sohum2002 sohum2002 closed this Oct 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants