MultiProcessScheduler breaks GlommedRDD #22

mikegoodspeed · 2013-04-26T22:32:58Z

See this test case:

mike@EDEN:~/Development/dpark$ cat t.py 
import dpark
rdd = dpark.makeRDD(range(9), 3).flatMap(lambda i: (i, i)).glom()
print list(map(list, rdd.collect()))
mike@EDEN:~/Development/dpark$ python t.py
[[0, 0, 1, 1, 2, 2], [3, 3, 4, 4, 5, 5], [6, 6, 7, 7, 8, 8]]
mike@EDEN:~/Development/dpark$ python t.py -m process
2013-04-26 17:24:30,239 [INFO] [scheduler] Got a job with 3 tasks
2013-04-26 17:24:30,249 [INFO] [scheduler] Job finished in 0.0 seconds                    
[[], [], []]

Take a look at Spark's implementation of glom, specifically line 11. They get the array from the iterator and make a new array. The equivalent of that in Python is list(). Take this patch:

diff --git a/dpark/rdd.py b/dpark/rdd.py
index c04a5e9..1efb3f8 100644
--- a/dpark/rdd.py
+++ b/dpark/rdd.py
@@ -511,7 +511,7 @@ class FilteredRDD(MappedRDD):

 class GlommedRDD(DerivedRDD):
     def compute(self, split):
-        yield self.prev.iterator(split)
+        yield list(self.prev.iterator(split))

 class MapPartitionsRDD(MappedRDD):
     def compute(self, split):

Once I apply the patch, here is the result:

mike@EDEN:~/Development/dpark$ python t.py -m process
2013-04-26 17:25:27,891 [INFO] [scheduler] Got a job with 3 tasks
2013-04-26 17:25:27,902 [INFO] [scheduler] Job finished in 0.0 seconds                    
[[0, 0, 1, 1, 2, 2], [3, 3, 4, 4, 5, 5], [6, 6, 7, 7, 8, 8]]

I believe this is being caused by the pickle-ing that is happening when chained iterables are being passed through the multiprocessing pool. If you have a better solution that doesn't involve pulling everything into memory, I'd be happy to hear it!

The text was updated successfully, but these errors were encountered:

mikegoodspeed · 2013-04-26T22:33:59Z

Also, if you put all the scheduler loggers as debug, that would be awesome.

davies · 2013-04-27T06:44:20Z

Hi Mike,

Sorry, I do not catch that about "put all the scheduler loggers as debug".

If the information about job and task progress is boring, you could use -q to make them quiet.

Davies

support real RDD.update, fix #2

提醒一下 dpark 输出目录下的非空隐藏文件

davies closed this as completed in e7bccf8 Apr 27, 2013

davies added a commit that referenced this issue May 30, 2013

fix #22: glom() should yield list not generator

eb9277c

windreamer added a commit that referenced this issue Dec 16, 2013

Merge pull request #22 from panmiaocai/dpark:master

4e04f27

support real RDD.update, fix #2

windreamer added a commit to windreamer/dpark that referenced this issue May 25, 2016

Merge pull request douban#22 from wangfei/master

b80170b

提醒一下 dpark 输出目录下的非空隐藏文件

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiProcessScheduler breaks GlommedRDD #22

MultiProcessScheduler breaks GlommedRDD #22

mikegoodspeed commented Apr 26, 2013

mikegoodspeed commented Apr 26, 2013

davies commented Apr 27, 2013

MultiProcessScheduler breaks GlommedRDD #22

MultiProcessScheduler breaks GlommedRDD #22

Comments

mikegoodspeed commented Apr 26, 2013

mikegoodspeed commented Apr 26, 2013

davies commented Apr 27, 2013