Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

MultiProcessScheduler breaks GlommedRDD #22

Closed
mikegoodspeed opened this issue Apr 26, 2013 · 2 comments
Closed

MultiProcessScheduler breaks GlommedRDD #22

mikegoodspeed opened this issue Apr 26, 2013 · 2 comments

Comments

@mikegoodspeed
Copy link
Contributor

See this test case:

mike@EDEN:~/Development/dpark$ cat t.py 
import dpark
rdd = dpark.makeRDD(range(9), 3).flatMap(lambda i: (i, i)).glom()
print list(map(list, rdd.collect()))
mike@EDEN:~/Development/dpark$ python t.py
[[0, 0, 1, 1, 2, 2], [3, 3, 4, 4, 5, 5], [6, 6, 7, 7, 8, 8]]
mike@EDEN:~/Development/dpark$ python t.py -m process
2013-04-26 17:24:30,239 [INFO] [scheduler] Got a job with 3 tasks
2013-04-26 17:24:30,249 [INFO] [scheduler] Job finished in 0.0 seconds                    
[[], [], []]

Take a look at Spark's implementation of glom, specifically line 11. They get the array from the iterator and make a new array. The equivalent of that in Python is list(). Take this patch:

diff --git a/dpark/rdd.py b/dpark/rdd.py
index c04a5e9..1efb3f8 100644
--- a/dpark/rdd.py
+++ b/dpark/rdd.py
@@ -511,7 +511,7 @@ class FilteredRDD(MappedRDD):

 class GlommedRDD(DerivedRDD):
     def compute(self, split):
-        yield self.prev.iterator(split)
+        yield list(self.prev.iterator(split))

 class MapPartitionsRDD(MappedRDD):
     def compute(self, split):

Once I apply the patch, here is the result:

mike@EDEN:~/Development/dpark$ python t.py -m process
2013-04-26 17:25:27,891 [INFO] [scheduler] Got a job with 3 tasks
2013-04-26 17:25:27,902 [INFO] [scheduler] Job finished in 0.0 seconds                    
[[0, 0, 1, 1, 2, 2], [3, 3, 4, 4, 5, 5], [6, 6, 7, 7, 8, 8]]

I believe this is being caused by the pickle-ing that is happening when chained iterables are being passed through the multiprocessing pool. If you have a better solution that doesn't involve pulling everything into memory, I'd be happy to hear it!

@mikegoodspeed
Copy link
Contributor Author

Also, if you put all the scheduler loggers as debug, that would be awesome.

@davies davies closed this as completed in e7bccf8 Apr 27, 2013
@davies
Copy link
Contributor

davies commented Apr 27, 2013

Hi Mike,

Sorry, I do not catch that about "put all the scheduler loggers as debug".

If the information about job and task progress is boring, you could use -q to make them quiet.

Davies

windreamer added a commit that referenced this issue Dec 16, 2013
windreamer added a commit to windreamer/dpark that referenced this issue May 25, 2016
提醒一下 dpark  输出目录下的非空隐藏文件
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants