[SPARK-27041][PySpark] Use imap() for python 2.x to resolve oom issue #23954

TigerYang414 · 2019-03-04T09:04:03Z

What changes were proposed in this pull request?

With large partition, pyspark may exceeds executor memory limit and trigger out of memory for python 2.7.
This is because map() is used. Unlike in python3.x, python 2.7 map() will generate a list and need to read all data into memory.

The proposed fix will use imap in python 2.7 and it has been verified.

How was this patch tested?

Manual test.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

With large partition, pyspark may exceeds executor memory limit and trigger out of memory for python 2.7. This is because map() is used and python 2.7 map() will need to read all data into memory.

HyukjinKwon · 2019-03-04T09:07:32Z

python/pyspark/worker.py

@@ -45,6 +45,8 @@

 if sys.version >= '3':
    basestring = str
+else:
+    from itertools import imap as map # use iterator map by default


two spaces before inlined comment

Should I fix it and update the pull request?

Yes, you need to fix the style to proceed the jenkins test.

HyukjinKwon · 2019-03-04T09:08:01Z

ok to test

SparkQA · 2019-03-04T09:16:44Z

Test build #102975 has finished for PR 23954 at commit c018f19.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-04T09:35:09Z

Looks reasonable. Let me take a closer look soon to be doubly sure. It's quite a core path.

SparkQA · 2019-03-04T09:56:43Z

Test build #102977 has finished for PR 23954 at commit c7d2eb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-03-04T13:23:51Z

python/pyspark/worker.py

@@ -45,6 +45,8 @@

 if sys.version >= '3':
    basestring = str
+else:
+    from itertools import imap as map  # use iterator map by default


It deserves a comment, at least. I think this is a relatively safe change as this is already how Python 3 works.
I can only find one usage in this file, which is the part that applies a UDF to data. That's the source of the issue? just checking. Surprisingly that seems to be the only call in non-test code where it seems to matter.

Yes, the only call is the source the trigger the issue.

is this not going to potentially change the behavior for python UDF?

Hm, that's a good point. I don't know enough Python to be sure. @holdenk do you know how Python would work in this regard? Is it safer to push this check to the one site below and call one or the other map function? it looks like itertools doesn't have imap in Python 3.

Are we worrying about the case that global map inside the pickled function is overridden by existing global imap? That's not going to happen per https://github.com/cloudpipe/cloudpickle/pull/240.

Shall we add a test to verify it in this PR too?

Yea, sounds good to have a test.

How can we test it here -- make a UDF that checks the value of map.__module__? if it's itertools, then fail, as it would mean this import 'leaked' into the UDF right? Otherwise in Python 2/3 it should return builtins or __builtin__

Yes, I think we can test like that.

The way we pickle UDFs is a little weird, so I wouldn't be too surprised if we did end up doing something silly by accident here, in that case we can also invert the imports (e.g. import map as imap in py3)

HyukjinKwon

looks fine to me. would be great if this is double checked by @srowen and @viirya while you guys are here.

viirya

I think it is fine. Agreed that this should be safe change as this is what Python3 does if I understand it correctly.

HyukjinKwon · 2019-03-09T00:17:43Z

@TigerYang414, are you able to add a test? if you're not, I can add it for you.

holdenk · 2019-03-09T02:08:41Z

Thanks for this change, really appreciate catching the problem :)

TigerYang414 · 2019-03-11T01:42:47Z

@TigerYang414, are you able to add a test? if you're not, I can add it for you.

I'm not very familiar with spark test framework yet. I'll appreciate if you could mentor me on this.

srowen · 2019-03-11T02:01:04Z

@TigerYang414 see #23954 (comment) ; maybe a short test that defines a UDF that checks what map is? in pyspark/sql/tests/test_functions.py

HyukjinKwon · 2019-03-11T02:26:06Z

It's okie. I'll add a test on @TigerYang414's branch.

HyukjinKwon · 2019-03-12T00:00:56Z

Hey @TigerYang414, I opened a PR TigerYang414#1 to add a test. please merge that after review into your branch.

Add a test for PR 23954

SparkQA · 2019-03-12T02:22:52Z

Test build #103356 has finished for PR 23954 at commit 7a404a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-03-12T15:24:33Z

Merged to master

Use imap for python 2.x to resolve oom issue

c018f19

With large partition, pyspark may exceeds executor memory limit and trigger out of memory for python 2.7. This is because map() is used and python 2.7 map() will need to read all data into memory.

HyukjinKwon reviewed Mar 4, 2019

View reviewed changes

Update worker.py

c7d2eb8

srowen reviewed Mar 4, 2019

View reviewed changes

HyukjinKwon approved these changes Mar 5, 2019

View reviewed changes

viirya approved these changes Mar 5, 2019

View reviewed changes

Add test

b982763

Merge pull request #1 from HyukjinKwon/imap-test

7a404a3

Add a test for PR 23954

HyukjinKwon approved these changes Mar 12, 2019

View reviewed changes

srowen approved these changes Mar 12, 2019

View reviewed changes

srowen closed this in 60a899b Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27041][PySpark] Use imap() for python 2.x to resolve oom issue #23954

[SPARK-27041][PySpark] Use imap() for python 2.x to resolve oom issue #23954

TigerYang414 commented Mar 4, 2019

HyukjinKwon Mar 4, 2019

TigerYang414 Mar 4, 2019

viirya Mar 4, 2019

HyukjinKwon commented Mar 4, 2019

SparkQA commented Mar 4, 2019

HyukjinKwon commented Mar 4, 2019

SparkQA commented Mar 4, 2019

srowen Mar 4, 2019

TigerYang414 Mar 5, 2019

felixcheung Mar 5, 2019

srowen Mar 5, 2019

HyukjinKwon Mar 6, 2019

viirya Mar 6, 2019 •

edited

HyukjinKwon Mar 6, 2019

srowen Mar 7, 2019 •

edited

HyukjinKwon Mar 9, 2019

holdenk Mar 9, 2019

HyukjinKwon left a comment

viirya left a comment

HyukjinKwon commented Mar 9, 2019

holdenk commented Mar 9, 2019

TigerYang414 commented Mar 11, 2019

srowen commented Mar 11, 2019

HyukjinKwon commented Mar 11, 2019

HyukjinKwon commented Mar 12, 2019

SparkQA commented Mar 12, 2019

srowen commented Mar 12, 2019

[SPARK-27041][PySpark] Use imap() for python 2.x to resolve oom issue #23954

[SPARK-27041][PySpark] Use imap() for python 2.x to resolve oom issue #23954

Conversation

TigerYang414 commented Mar 4, 2019

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 4, 2019

SparkQA commented Mar 4, 2019

HyukjinKwon commented Mar 4, 2019

SparkQA commented Mar 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 6, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen Mar 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 9, 2019

holdenk commented Mar 9, 2019

TigerYang414 commented Mar 11, 2019

srowen commented Mar 11, 2019

HyukjinKwon commented Mar 11, 2019

HyukjinKwon commented Mar 12, 2019

SparkQA commented Mar 12, 2019

srowen commented Mar 12, 2019

viirya Mar 6, 2019 •

edited

srowen Mar 7, 2019 •

edited