SPARK-3604. Replace the map call in UnionRDD#getPartitions method to avo... #2463

harishreedharan · 2014-09-19T21:14:23Z

...id creating an additional Seq.

…avoid creating an additional Seq.

SparkQA · 2014-09-19T21:19:22Z

QA tests have started for PR 2463 at commit c3f476c.

This patch merges cleanly.

srowen · 2014-09-19T21:37:21Z

Is the goal here just to make the recursive calls take fewer stack frames and make it harder to overflow ? I got the impression there was an infinite recusrsion lurking here but don't see that this fixes it, but maybe I misunderstood the JIRA.

harishreedharan · 2014-09-19T22:14:13Z

Yes. The issue is that there could be union RDDs inside the rdds array - so the recursion may be unavoidable, but we can make them take fewer frames. I can't think of a real fix for this though.

SparkQA · 2014-09-19T22:23:39Z

QA tests have finished for PR 2463 at commit c3f476c.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericdf · 2014-09-19T23:08:41Z

Fundamentally the way union works is flawed because it forces a caller to create a recursive structure.

In my case, I have

files = [] # some list
rdd = sc.createAnRDDInTheUsualWay(files[0])
for afile in files[1:]:
rdd = rdd.union(sc.createAnRDDInTheUsualWay(afile))

At each point in the loop, I'm creating a UnionRDD whose collection of RDDs contains exactly one RDD (also a UnionRDD). You've coded for a tree, but really have a linked list that will blow up the stack.

It should be possible for me to get a broad, flat structure instead, ideally by doing something like this:

rddgen = (sc.createAnRddInTheUsualWay(x) for x in files)
rdd = sc.union(rddgen)

The proposed patch does not do that, but it should.

markhamstra · 2014-09-19T23:15:51Z

@ericdf What is the type of rddgen in your pseudocode? I'm not understanding why the existing SparkContext#union[T](Seq[RDD[T]]) doesn't already do what you want.

ericdf · 2014-09-19T23:28:00Z

Ah! I was not aware that there was an API for getting a union for a list on SparkContext -- I had only seen the one on RDD itself, which only takes a single `other' RDD.

Yes, the SparkContext#union is exactly what I want. Thank you!

pwendell · 2014-09-21T00:28:59Z

@ericdf is your original issue fixed by using the union utility function? I misread it to be a bug report, but I think the issue is just that you were chaining together unions instead of composing them using the utility.

pwendell · 2014-09-21T00:32:53Z

@harishreedharan I think the fix is that for people chaining many unions together they should use SparkContext#union - if that's the case we might want to just leave it as-is.

harishreedharan · 2014-09-21T01:01:39Z

Agreed. This patch simply make it more difficult to overflow - so it is not really a fix. Will close this.

Thanks,
Hari

On Sat, Sep 20, 2014 at 5:33 PM, Patrick Wendell notifications@github.com
wrote:

@harishreedharan I think the fix is that for people chaining many unions together they should use SparkContext#union - if that's the case we might want to just leave it as-is.

Reply to this email directly or view it on GitHub:
#2463 (comment)

pwendell · 2014-09-21T03:23:20Z

Gotcha - sounds good!

pwendell · 2014-09-25T05:53:21Z

Let's close this issue then

harishreedharan · 2014-09-25T05:54:07Z

Done

SPARK-3604. Replace the map call in UnionRDD#getPartitions method to …

c3f476c

…avoid creating an additional Seq.

harishreedharan closed this Sep 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-3604. Replace the map call in UnionRDD#getPartitions method to avo... #2463

SPARK-3604. Replace the map call in UnionRDD#getPartitions method to avo... #2463

harishreedharan commented Sep 19, 2014

SparkQA commented Sep 19, 2014

srowen commented Sep 19, 2014

harishreedharan commented Sep 19, 2014

SparkQA commented Sep 19, 2014

ericdf commented Sep 19, 2014

markhamstra commented Sep 19, 2014

ericdf commented Sep 19, 2014

pwendell commented Sep 21, 2014

pwendell commented Sep 21, 2014

harishreedharan commented Sep 21, 2014

@harishreedharan I think the fix is that for people chaining many unions together they should use `SparkContext#union` - if that's the case we might want to just leave it as-is.

pwendell commented Sep 21, 2014

pwendell commented Sep 25, 2014

harishreedharan commented Sep 25, 2014

SPARK-3604. Replace the map call in UnionRDD#getPartitions method to avo... #2463

SPARK-3604. Replace the map call in UnionRDD#getPartitions method to avo... #2463

Conversation

harishreedharan commented Sep 19, 2014

SparkQA commented Sep 19, 2014

srowen commented Sep 19, 2014

harishreedharan commented Sep 19, 2014

SparkQA commented Sep 19, 2014

ericdf commented Sep 19, 2014

markhamstra commented Sep 19, 2014

ericdf commented Sep 19, 2014

pwendell commented Sep 21, 2014

pwendell commented Sep 21, 2014

harishreedharan commented Sep 21, 2014

@harishreedharan I think the fix is that for people chaining many unions together they should use SparkContext#union - if that's the case we might want to just leave it as-is.

pwendell commented Sep 21, 2014

pwendell commented Sep 25, 2014

harishreedharan commented Sep 25, 2014

@harishreedharan I think the fix is that for people chaining many unions together they should use `SparkContext#union` - if that's the case we might want to just leave it as-is.