[SPARK-2373]RDD add span function (split an RDD to two RDD based on user's function)] #1306

YanjieGao · 2014-07-05T07:51:25Z

Hi all,
This function is a basic function in Scala.
def span(p: T => Boolean): (RDD[T], RDD[T])
Splits this RDD into a prefix/suffix pair according to a predicate .
returns
a pair consisting of the longest prefix of this RDD whose elements all satisfy p, and the rest of this list.

JIRA:https://issues.apache.org/jira/browse/SPARK-2373

Hi all, I want to submit a basic operator Intersect For example , in sql case select * from table1 intersect select * from table2 So ,i want use this operator support this function in Spark SQL This operator will return the the intersection of SparkPlan child table RDD .

YanjieGao · 2014-07-06T01:11:04Z

Thanks ,I optimize the code so it only evaluates the function once .Other comments are on JIRA

YanjieGao · 2014-07-07T07:44:12Z

This function is useful in some cases ,Such as when i do Skew Join in another PR,I need to split an RDD to two RDD,One has skew keys ,and the other is not .
val (maxKeySkewedTable, mainSkewedTable) = skewedTable.span(row => {
skewSideKeyGenerator(row).toString().equals(maxrowKey.toString())
})

andrewor14 · 2014-08-25T21:52:42Z

test this please

SparkQA · 2014-08-25T21:55:53Z

QA tests have started for PR 1306 at commit fab7ed9.

This patch merges cleanly.

SparkQA · 2014-08-25T22:49:36Z

QA tests have finished for PR 1306 at commit fab7ed9.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-08-28T23:04:14Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

@@ -1270,4 +1270,20 @@ abstract class RDD[T: ClassTag](
  def toJavaRDD() : JavaRDD[T] = {
    new JavaRDD(this)(elementClassTag)
  }
+
+  def span(p: T => Boolean) : (RDD[T], RDD[T]) = {


This needs scaladoc, since this method's meaning won't be clear to users unless they're familiar with the Scala span function.

JoshRosen · 2014-08-28T23:11:47Z

@mateiz Thoughts on this? Do you think this API will be broadly useful enough to include?

mateiz · 2014-08-29T05:16:09Z

IMO this is too specialized to include. It's small enough that applications can do it themselves, but also fairly confusing unless your RDD is already sorted in some way. I think we should just leave it for applications to do it. If you are doing a skewed join operator for example, you can do it within the implementation of that but not show it to the user.

YanjieGao · 2014-08-29T11:11:09Z

Ok ,Got it, I will close this PR ;

crakjie · 2015-10-05T13:44:40Z

It's a nice start to code the scala def partition(p: (A) ⇒ Boolean): (List[A], List[A]) thank.
But it's better to name it fork to avoid confusion,

k0ala · 2016-07-08T14:21:49Z

FWIW, I frequently have this use case and would love a span on RDD.

YanjieGao and others added 24 commits June 20, 2014 15:20

Update basicOperators.scala

469f099

Update SqlParser.scala

61e88e7

Update HiveQl.scala

d4ac5e5

Update basicOperators.scala

ac73e60

Update SparkStrategies.scala

790765d

Update SQLQuerySuite.scala

4dd453e

Update basicOperators.scala

e2b64be

delete annotation

f1288b4

delete the annotation

0b49837

Update basicOperators.scala

bdc4a05

update the line less than

f7961f6

resolve conflict in SparkStrategies and basicOperator

5e374c7

Merge remote branch 'upstream/master' into patch-5

a802ca8

modify format problem

0c7cca5

resolve conflict and add annotation on basicOperator and remove HiveQl

ea78f33

refomat some files

1cfbfe6

Merge remote branch 'upstream/master' into rdd_span

b1a641c

RDD add span function

8c4eafe

resolve the other branch modify file in this branch

ea27e14

reformat other files

1f27fe0

reformat SqlParser blank

8615dcd

remove blank on line 275 in SparkStrategies

f4df130

reformat the span function in RDD

81bb9c5

YanjieGao added 2 commits July 6, 2014 09:18

Merge remote branch 'upstream/master' into rdd_span

4448781

make the code only scan data once

fab7ed9

JoshRosen reviewed Aug 28, 2014
View reviewed changes

YanjieGao closed this Aug 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2373]RDD add span function (split an RDD to two RDD based on user's function)] #1306

[SPARK-2373]RDD add span function (split an RDD to two RDD based on user's function)] #1306

YanjieGao commented Jul 5, 2014

YanjieGao commented Jul 6, 2014

YanjieGao commented Jul 7, 2014

andrewor14 commented Aug 25, 2014

SparkQA commented Aug 25, 2014

SparkQA commented Aug 25, 2014

JoshRosen Aug 28, 2014

JoshRosen commented Aug 28, 2014

mateiz commented Aug 29, 2014

YanjieGao commented Aug 29, 2014

crakjie commented Oct 5, 2015

k0ala commented Jul 8, 2016

[SPARK-2373]RDD add span function (split an RDD to two RDD based on user's function)] #1306

[SPARK-2373]RDD add span function (split an RDD to two RDD based on user's function)] #1306

Conversation

YanjieGao commented Jul 5, 2014

YanjieGao commented Jul 6, 2014

YanjieGao commented Jul 7, 2014

andrewor14 commented Aug 25, 2014

SparkQA commented Aug 25, 2014

SparkQA commented Aug 25, 2014

JoshRosen Aug 28, 2014

Choose a reason for hiding this comment

JoshRosen commented Aug 28, 2014

mateiz commented Aug 29, 2014

YanjieGao commented Aug 29, 2014

crakjie commented Oct 5, 2015

k0ala commented Jul 8, 2016