Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-3604. Replace the map call in UnionRDD#getPartitions method to avo... #2463

Closed
wants to merge 1 commit into from

Conversation

harishreedharan
Copy link
Contributor

...id creating an additional Seq.

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have started for PR 2463 at commit c3f476c.

  • This patch merges cleanly.

@srowen
Copy link
Member

srowen commented Sep 19, 2014

Is the goal here just to make the recursive calls take fewer stack frames and make it harder to overflow ? I got the impression there was an infinite recusrsion lurking here but don't see that this fixes it, but maybe I misunderstood the JIRA.

@harishreedharan
Copy link
Contributor Author

Yes. The issue is that there could be union RDDs inside the rdds array - so the recursion may be unavoidable, but we can make them take fewer frames. I can't think of a real fix for this though.

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have finished for PR 2463 at commit c3f476c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericdf
Copy link

ericdf commented Sep 19, 2014

Fundamentally the way union works is flawed because it forces a caller to create a recursive structure.

In my case, I have

files = [] # some list
rdd = sc.createAnRDDInTheUsualWay(files[0])
for afile in files[1:]:
rdd = rdd.union(sc.createAnRDDInTheUsualWay(afile))

At each point in the loop, I'm creating a UnionRDD whose collection of RDDs contains exactly one RDD (also a UnionRDD). You've coded for a tree, but really have a linked list that will blow up the stack.

It should be possible for me to get a broad, flat structure instead, ideally by doing something like this:

rddgen = (sc.createAnRddInTheUsualWay(x) for x in files)
rdd = sc.union(rddgen)

The proposed patch does not do that, but it should.

@markhamstra
Copy link
Contributor

@ericdf What is the type of rddgen in your pseudocode? I'm not understanding why the existing SparkContext#union[T](Seq[RDD[T]]) doesn't already do what you want.

@ericdf
Copy link

ericdf commented Sep 19, 2014

Ah! I was not aware that there was an API for getting a union for a list on SparkContext -- I had only seen the one on RDD itself, which only takes a single `other' RDD.

Yes, the SparkContext#union is exactly what I want. Thank you!

@pwendell
Copy link
Contributor

@ericdf is your original issue fixed by using the union utility function? I misread it to be a bug report, but I think the issue is just that you were chaining together unions instead of composing them using the utility.

@pwendell
Copy link
Contributor

@harishreedharan I think the fix is that for people chaining many unions together they should use SparkContext#union - if that's the case we might want to just leave it as-is.

@harishreedharan
Copy link
Contributor Author

Agreed. This patch simply make it more difficult to overflow - so it is not really a fix. Will close this. 

Thanks,
Hari

On Sat, Sep 20, 2014 at 5:33 PM, Patrick Wendell notifications@github.com
wrote:

@harishreedharan I think the fix is that for people chaining many unions together they should use SparkContext#union - if that's the case we might want to just leave it as-is.

Reply to this email directly or view it on GitHub:
#2463 (comment)

@pwendell
Copy link
Contributor

Gotcha - sounds good!

@pwendell
Copy link
Contributor

Let's close this issue then

@harishreedharan
Copy link
Contributor Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants