Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27041][PySpark] Use imap() for python 2.x to resolve oom issue #23954

Closed
wants to merge 4 commits into from

Conversation

TigerYang414
Copy link
Contributor

What changes were proposed in this pull request?

With large partition, pyspark may exceeds executor memory limit and trigger out of memory for python 2.7.
This is because map() is used. Unlike in python3.x, python 2.7 map() will generate a list and need to read all data into memory.

The proposed fix will use imap in python 2.7 and it has been verified.

How was this patch tested?

Manual test.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

With large partition, pyspark may exceeds executor memory limit and trigger out of memory for python 2.7.
This is because map() is used and python 2.7 map() will need to read all data into memory.
@@ -45,6 +45,8 @@

if sys.version >= '3':
basestring = str
else:
from itertools import imap as map # use iterator map by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two spaces before inlined comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I fix it and update the pull request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you need to fix the style to proceed the jenkins test.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Mar 4, 2019

Test build #102975 has finished for PR 23954 at commit c018f19.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks reasonable. Let me take a closer look soon to be doubly sure. It's quite a core path.

@SparkQA
Copy link

SparkQA commented Mar 4, 2019

Test build #102977 has finished for PR 23954 at commit c7d2eb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -45,6 +45,8 @@

if sys.version >= '3':
basestring = str
else:
from itertools import imap as map # use iterator map by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It deserves a comment, at least. I think this is a relatively safe change as this is already how Python 3 works.
I can only find one usage in this file, which is the part that applies a UDF to data. That's the source of the issue? just checking. Surprisingly that seems to be the only call in non-test code where it seems to matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only call is the source the trigger the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not going to potentially change the behavior for python UDF?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's a good point. I don't know enough Python to be sure. @holdenk do you know how Python would work in this regard? Is it safer to push this check to the one site below and call one or the other map function? it looks like itertools doesn't have imap in Python 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worrying about the case that global map inside the pickled function is overridden by existing global imap? That's not going to happen per https://github.com/cloudpipe/cloudpickle/pull/240.

Copy link
Member

@viirya viirya Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test to verify it in this PR too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sounds good to have a test.

Copy link
Member

@srowen srowen Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we test it here -- make a UDF that checks the value of map.__module__? if it's itertools, then fail, as it would mean this import 'leaked' into the UDF right? Otherwise in Python 2/3 it should return builtins or __builtin__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can test like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we pickle UDFs is a little weird, so I wouldn't be too surprised if we did end up doing something silly by accident here, in that case we can also invert the imports (e.g. import map as imap in py3)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me. would be great if this is double checked by @srowen and @viirya while you guys are here.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine. Agreed that this should be safe change as this is what Python3 does if I understand it correctly.

@HyukjinKwon
Copy link
Member

@TigerYang414, are you able to add a test? if you're not, I can add it for you.

@holdenk
Copy link
Contributor

holdenk commented Mar 9, 2019

Thanks for this change, really appreciate catching the problem :)

@TigerYang414
Copy link
Contributor Author

@TigerYang414, are you able to add a test? if you're not, I can add it for you.

I'm not very familiar with spark test framework yet. I'll appreciate if you could mentor me on this.

@srowen
Copy link
Member

srowen commented Mar 11, 2019

@TigerYang414 see #23954 (comment) ; maybe a short test that defines a UDF that checks what map is? in pyspark/sql/tests/test_functions.py

@HyukjinKwon
Copy link
Member

It's okie. I'll add a test on @TigerYang414's branch.

@HyukjinKwon
Copy link
Member

Hey @TigerYang414, I opened a PR TigerYang414#1 to add a test. please merge that after review into your branch.

@SparkQA
Copy link

SparkQA commented Mar 12, 2019

Test build #103356 has finished for PR 23954 at commit 7a404a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 12, 2019

Merged to master

@srowen srowen closed this in 60a899b Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants