Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2373]RDD add span function (split an RDD to two RDD based on user's function)] #1306

Closed
wants to merge 26 commits into from

Conversation

YanjieGao
Copy link
Contributor

Hi all,
This function is a basic function in Scala.
def span(p: T => Boolean): (RDD[T], RDD[T])
Splits this RDD into a prefix/suffix pair according to a predicate .
returns
a pair consisting of the longest prefix of this RDD whose elements all satisfy p, and the rest of this list.

JIRA:https://issues.apache.org/jira/browse/SPARK-2373

@YanjieGao
Copy link
Contributor Author

Thanks ,I optimize the code so it only evaluates the function once .Other comments are on JIRA

@YanjieGao
Copy link
Contributor Author

This function is useful in some cases ,Such as when i do Skew Join in another PR,I need to split an RDD to two RDD,One has skew keys ,and the other is not .
val (maxKeySkewedTable, mainSkewedTable) = skewedTable.span(row => {
skewSideKeyGenerator(row).toString().equals(maxrowKey.toString())
})

@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 1306 at commit fab7ed9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 1306 at commit fab7ed9.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1270,4 +1270,20 @@ abstract class RDD[T: ClassTag](
def toJavaRDD() : JavaRDD[T] = {
new JavaRDD(this)(elementClassTag)
}

def span(p: T => Boolean) : (RDD[T], RDD[T]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs scaladoc, since this method's meaning won't be clear to users unless they're familiar with the Scala span function.

@JoshRosen
Copy link
Contributor

@mateiz Thoughts on this? Do you think this API will be broadly useful enough to include?

@mateiz
Copy link
Contributor

mateiz commented Aug 29, 2014

IMO this is too specialized to include. It's small enough that applications can do it themselves, but also fairly confusing unless your RDD is already sorted in some way. I think we should just leave it for applications to do it. If you are doing a skewed join operator for example, you can do it within the implementation of that but not show it to the user.

@YanjieGao
Copy link
Contributor Author

Ok ,Got it, I will close this PR ;

@YanjieGao YanjieGao closed this Aug 29, 2014
@crakjie
Copy link

crakjie commented Oct 5, 2015

It's a nice start to code the scala def partition(p: (A) ⇒ Boolean): (List[A], List[A]) thank.
But it's better to name it fork to avoid confusion,

@k0ala
Copy link

k0ala commented Jul 8, 2016

FWIW, I frequently have this use case and would love a span on RDD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants