Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32435][PYTHON] Remove heapq3 port from Python 3 #29229

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 25, 2020

What changes were proposed in this pull request?

This PR removes the manual port of heapq3.py introduced from SPARK-3073. The main reason of this was to support Python 2.6 and 2.7 because Python 2's heapq.merge() doesn't not support key and reverse.

See

Since we dropped the Python 2 at SPARK-32138, we can remove this away.

Why are the changes needed?

To remove unnecessary codes. Also, we can leverage bug fixes made in Python 3.x at heapq.

Does this PR introduce any user-facing change?

No, dev-only.

How was this patch tested?

Existing tests should cover. I locally ran and verified:

./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_shuffle"
./python/run-tests --python-executable=python3 --testname="pyspark.shuffle ExternalSorter"
./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_rdd RDDTests.test_external_group_by_key"

@HyukjinKwon
Copy link
Member Author

@srowen, @JoshRosen @viirya can you take a look when you're available please?

@@ -498,7 +498,7 @@ def load(f):
if current_chunk:
chunks.append(iter(current_chunk))

return heapq.merge(chunks, key=key, reverse=reverse)
return heapq.merge(*chunks, key=key, reverse=reverse)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bit of modification when this file was ported from Python 3 because heapq has to be able to compile with Python 2 as well. The diffs made are:

< def merge(iterables, key=None, reverse=False):
---
> def merge(*iterables, key=None, reverse=False):
216c218,219
<                 h_append([next(it), order * direction, it])
---
>                 next = it.__next__
>                 h_append([next(), order * direction, next])
223c226
<                     value, order, it = s = h[0]
---
>                     value, order, next = s = h[0]
225c228
<                     s[0] = next(it)           # raises StopIteration when exhausted
---
>                     s[0] = next()           # raises StopIteration when exhausted
231c234
<             value, order, it = h[0]
---
>             value, order, next = h[0]
233,234c236
<             for value in it:
<                 yield value
---
>             yield from next.__self__
239,240c241,243
<             value = next(it)
<             h_append([key(value), order * direction, value, it])
---
>             next = it.__next__
>             value = next()
>             h_append([key(value), order * direction, value, next])
247c250
<                 key_value, order, value, it = s = h[0]
---
>                 key_value, order, value, next = s = h[0]
249c252
<                 value = next(it)
---
>                 value = next()
256c259
<         key_value, order, value, it = h[0]
---
>         key_value, order, value, next = h[0]
258,259c261
<         for value in it:
<             yield value
---
>         yield from next.__self__

These differences don't look affecting any behaviours.

I think it was ported from Python 3.5: https://github.com/python/cpython/blob/3.5/Lib/heapq.py

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for doing this.

@@ -25,7 +25,7 @@
import random
import sys

import pyspark.heapq3 as heapq
import heapq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put it with other python built-in imports above?

@SparkQA
Copy link

SparkQA commented Jul 25, 2020

Test build #126529 has finished for PR 29229 at commit d15a556.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 25, 2020

Test build #126537 has finished for PR 29229 at commit d15a556.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Thank you guys. Let me take a look for the test failure tomorrow. Should be trivial to handle I believe.

@@ -796,7 +796,7 @@ def load_partition(j):

if self._sorted:
# all the partitions are already sorted
sorted_items = heapq.merge(disk_items, key=operator.itemgetter(0))
sorted_items = heapq.merge(*disk_items, key=operator.itemgetter(0))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed one place here which caused the test failure. I double checked this is the last one I missed.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126612 has finished for PR 29229 at commit d6ac35d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon HyukjinKwon deleted the SPARK-32435 branch July 27, 2020 07:42
@HyukjinKwon HyukjinKwon restored the SPARK-32435 branch July 27, 2020 07:45
@HyukjinKwon HyukjinKwon reopened this Jul 27, 2020
@HyukjinKwon
Copy link
Member Author

Simply rebased to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126625 has finished for PR 29229 at commit d6ac35d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Merged to master.

Thanks @dongjoon-hyun and @viirya

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126631 has finished for PR 29229 at commit 007c2a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the SPARK-32435 branch December 7, 2020 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants