Wrong partitions distribution logic for MP Consumer #335

vshlapakov · 2015-03-03T12:47:01Z

I could be wrong, but it seems that there's wrong partitions distribution logic for MultiProcessConsumer.
It's not even at all currently, it can be checked with the following function (cut from the code and wrapped):

>>> def get_chunks(partitions, num_procs, partitions_per_proc=0):
...     if not partitions_per_proc:
...             partitions_per_proc = round(len(partitions) * 1.0 / num_procs)
...             if partitions_per_proc < num_procs * 0.5:
...                     partitions_per_proc += 1
...     chunker = lambda *x: [] + list(x)
...     return map(chunker, *[iter(partitions)] * int(partitions_per_proc))
...

Wrong cases examples:

>>> get_chunks(range(16), 3)
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, None, None, None, None]]
# there should be 3 chunks!
>>> get_chunks(range(16), 8)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14], [15, None, None]]
# there should be 8 chunks by 2 partitions each

This PR solves this issue by fixing the condition when we should increase partitions_per_proc by 1:

>>> get_chunks(range(16), 1)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]
>>> get_chunks(range(16), 2)
[[0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15]]
>>> get_chunks(range(16), 3)
[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, None, None]]
>>> get_chunks(range(16), 4)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
>>> get_chunks(range(16), 8)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]

dpkp · 2015-03-08T07:19:43Z

hmm, hadn't looked at this code before. wouldn't a list comprehension like this be a lot cleaner:

[partitions[proc::num_procs] for proc in range(num_procs)]

this stripes partitions via step-wise iteration, rather than chunking sequentially. But it seems to work a lot better. Try get_chunks(range(16), 7) for example. The current approach yields

In []: get_chunks(range(16), 7)
Out[]: [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14], [15, None, None]]

which is only 6 chunks, with the last having only 1 partition. So really it's 5 processes doing most of the work. Compare with step-wise list comprehension:

In []: partitions = range(16)
In []: num_procs=7
In []: [partitions[proc::num_procs] for proc in range(num_procs)]
Out[]: [[0, 7, 14], [1, 8, 15], [2, 9], [3, 10], [4, 11], [5, 12], [6, 13]]

Now we get 7 chunks, and the partition distribution more even.

vshlapakov · 2015-03-11T11:52:51Z

Great idea, I used it to improve the solution.

dpkp · 2015-03-11T17:21:42Z

this is only run during init, so the extra cost of dict.copy().keys() is probably worth it. Can you change and add a comment w/ link to http://blog.labix.org/2008/06/27/watch-out-for-listdictkeys-in-python-3 ?

vshlapakov · 2015-03-12T09:34:55Z

Done!

Wrong partitions distribution logic for MP Consumer

dpkp · 2015-03-12T23:53:14Z

thanks!

vshlapakov · 2015-03-24T17:11:43Z

Thank you :)

Fixing distribution for MP Consumer

ac66fe9

dpkp added the consumer label Mar 8, 2015

Cleaned code for MP consumer chunking

4bab2fa

Used thread-safe dict.copy().keys() for MP consumer partitions

01ea3bf

dpkp added a commit that referenced this pull request Mar 12, 2015

Merge pull request #335 from scrapinghub/fix-mp-consumer-distribution

a5b1c8d

Wrong partitions distribution logic for MP Consumer

dpkp merged commit a5b1c8d into dpkp:master Mar 12, 2015

vshlapakov deleted the fix-mp-consumer-distribution branch March 24, 2015 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong partitions distribution logic for MP Consumer #335

Wrong partitions distribution logic for MP Consumer #335

vshlapakov commented Mar 3, 2015

dpkp commented Mar 8, 2015

vshlapakov commented Mar 11, 2015

dpkp commented Mar 11, 2015

vshlapakov commented Mar 12, 2015

dpkp commented Mar 12, 2015

vshlapakov commented Mar 24, 2015

Wrong partitions distribution logic for MP Consumer #335

Wrong partitions distribution logic for MP Consumer #335

Conversation

vshlapakov commented Mar 3, 2015

dpkp commented Mar 8, 2015

vshlapakov commented Mar 11, 2015

dpkp commented Mar 11, 2015

vshlapakov commented Mar 12, 2015

dpkp commented Mar 12, 2015

vshlapakov commented Mar 24, 2015