[SPARK-7385][Core] Add RDD.foreachPartitionWithIndex #5927

tdas · 2015-05-05T23:33:43Z

Spark Streaming apps often update external stores transactionally, which requires it to have an id that uniquely identifies the partition of data to be inserted. This can be the (batch time, partition index).
Current work around is to use mapPartitionsWithIndex().count() which is quite hacky. This PR is to add foreachPartitionWithIndex().

rxin · 2015-05-05T23:39:15Z

Why not just use TaskContext?

SparkQA · 2015-05-05T23:41:31Z

Test build #31920 has finished for PR 5927 at commit 8520748.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-05-05T23:45:19Z

Easier for normal users. The alternative is.

rdd.foreachPartition { iter =>  
    val partitionId = TaskContext.get.partitionId
    .... 
}

This is ugly and non-intuitive.

SparkQA · 2015-05-06T00:02:06Z

Test build #31921 has finished for PR 5927 at commit b98a91f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-05-06T00:13:05Z

@koeninger Isn't this going to make it easier to do transactional output operations?

SparkQA · 2015-05-06T00:14:54Z

Test build #31923 has finished for PR 5927 at commit 69bcc61.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

koeninger · 2015-05-06T00:25:00Z

@tdas yeah, Kafka transactional output was why I originally wanted to add
it.

Although that usage of taskcontext shown above is better than my
alternative of mapPartitionsWithIndex plus an empty for each. If I had
thought of task context first I probably wouldn't have bothered.
On May 5, 2015 7:13 PM, "Tathagata Das" notifications@github.com wrote:

@koeninger https://github.com/koeninger Isn't this going to make it
easier to do transactional output operations?

—
Reply to this email directly or view it on GitHub
#5927 (comment).

SparkQA · 2015-05-06T01:36:30Z

Test build #31927 has finished for PR 5927 at commit 37f1c37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-06T09:16:32Z

Test build #31964 has finished for PR 5927 at commit 33fecb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-05-07T01:45:59Z

@rxin Any objections?
@JoshRosen We discussed about this offline. Please take a look.

JoshRosen · 2015-05-07T03:17:44Z

This seems fine to me. If we were to design this all over again, I'd consider just dropping the withIndex methods in favor of either a withContext method or just requiring users to use TaskContext.get. Since we already have mapPartitionsWithIndex, though, I suppose it's fine to add this for completeness. Note that I wouldn't recommend that we add *WithIndex variants for most methods, but I think that mapPartitions and foreachPartition are somewhat low-level special cases compared to most RDD operations.

SparkQA · 2015-05-07T03:55:01Z

Test build #32065 has finished for PR 5927 at commit 83d0e00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-05-07T07:18:24Z

@JoshRosen I agree with that we should not use this as a precedence for
adding XYZWithIndex. However this is also true that TaskContext.get is not
very easy to find out about. Maybe we should cover that in the Spark
website documentation (may be it already is, and I am just not aware).

If there are not other objections, I will merge it.

On Wed, May 6, 2015 at 8:55 PM, UCB AMPLab notifications@github.com wrote:

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32065/
Test PASSed.

—
Reply to this email directly or view it on GitHub
#5927 (comment).

pwendell · 2015-05-07T13:11:44Z

Hm - I think it might be better to document TaskContext and ask users to use that - this is why we exposed that. From what I can tell, but @rxin and @JoshRosen also prefer that approach.

tdas · 2015-05-08T07:41:49Z

But as @koeninger pointed out that it was not obvious to find out the TaskContext.get.partitionId(). And may be for consistency with mapPartitionsWithIndex, its okay to have with RDD.foreachPartitionWithIndex() which is strictly easier to find than TaskContext.get.partitionId(). So I think this is a nice-to-have feature.

rxin · 2015-05-08T07:43:43Z

Why not just add to the javadoc of mapPartitions to suggest how to get the partition id and task context.

tdas · 2015-05-08T08:11:00Z

Still ugly IMO.

On Fri, May 8, 2015 at 12:44 AM, Reynold Xin notifications@github.com
wrote:

Why not just add to the javadoc of mapPartitions to suggest how to get the
partition id and task context.

—
Reply to this email directly or view it on GitHub
#5927 (comment).

koeninger · 2015-05-08T13:42:53Z

I think if you're going to decide you really don't like withContext/withIndex etc they should be marked as deprecated, in addition to having a scaladoc reference to TaskContext.get

Either that or foreachPartitionWithIndex seems ok to me.

pwendell · 2015-05-08T14:31:35Z

Marking them deprecated sounds like a good idea. The static getter method was specifically designed to replace them.

tdas · 2015-05-11T17:17:59Z

All right. Since adding ForeachPartitionWithIndex is not the best idea, I will close this PR. Additionally I will document it in the programming guide to use TaskContext.getPartitionId()

deusaquilus · 2018-05-27T09:31:42Z

What do you do if you are in local mode and TaskContext .getPartitionId always returns zero?

Added RDD.foreachPartitionWithIndex

8520748

tdas added 2 commits May 5, 2015 16:46

Added missing function

b98a91f

Added unit test

69bcc61

tdas changed the title ~~[SPARK-7385][Core] Added RDD.foreachPartitionWithIndex~~ [SPARK-7385][Core] Add RDD.foreachPartitionWithIndex May 6, 2015

Adding MIMA excludes

37f1c37

Fixed unit test

33fecb2

Revert unnecessary space change.

83d0e00

tdas closed this May 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7385][Core] Add RDD.foreachPartitionWithIndex #5927

[SPARK-7385][Core] Add RDD.foreachPartitionWithIndex #5927

tdas commented May 5, 2015

rxin commented May 5, 2015

SparkQA commented May 5, 2015

tdas commented May 5, 2015

SparkQA commented May 6, 2015

tdas commented May 6, 2015

SparkQA commented May 6, 2015

koeninger commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

tdas commented May 7, 2015

JoshRosen commented May 7, 2015

SparkQA commented May 7, 2015

tdas commented May 7, 2015

pwendell commented May 7, 2015

tdas commented May 8, 2015

rxin commented May 8, 2015

tdas commented May 8, 2015

koeninger commented May 8, 2015

pwendell commented May 8, 2015

tdas commented May 11, 2015

deusaquilus commented May 27, 2018

[SPARK-7385][Core] Add RDD.foreachPartitionWithIndex #5927

[SPARK-7385][Core] Add RDD.foreachPartitionWithIndex #5927

Conversation

tdas commented May 5, 2015

rxin commented May 5, 2015

SparkQA commented May 5, 2015

tdas commented May 5, 2015

SparkQA commented May 6, 2015

tdas commented May 6, 2015

SparkQA commented May 6, 2015

koeninger commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

tdas commented May 7, 2015

JoshRosen commented May 7, 2015

SparkQA commented May 7, 2015

tdas commented May 7, 2015

pwendell commented May 7, 2015

tdas commented May 8, 2015

rxin commented May 8, 2015

tdas commented May 8, 2015

koeninger commented May 8, 2015

pwendell commented May 8, 2015

tdas commented May 11, 2015

deusaquilus commented May 27, 2018