You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 8, 2024. It is now read-only.
Take a look at Spark's implementation of glom, specifically line 11. They get the array from the iterator and make a new array. The equivalent of that in Python is list(). Take this patch:
diff --git a/dpark/rdd.py b/dpark/rdd.py
index c04a5e9..1efb3f8 100644
--- a/dpark/rdd.py+++ b/dpark/rdd.py@@ -511,7 +511,7 @@ class FilteredRDD(MappedRDD):
class GlommedRDD(DerivedRDD):
def compute(self, split):
- yield self.prev.iterator(split)+ yield list(self.prev.iterator(split))
class MapPartitionsRDD(MappedRDD):
def compute(self, split):
I believe this is being caused by the pickle-ing that is happening when chained iterables are being passed through the multiprocessing pool. If you have a better solution that doesn't involve pulling everything into memory, I'd be happy to hear it!
The text was updated successfully, but these errors were encountered:
See this test case:
Take a look at Spark's implementation of glom, specifically line 11. They get the array from the iterator and make a new array. The equivalent of that in Python is
list()
. Take this patch:Once I apply the patch, here is the result:
I believe this is being caused by the pickle-ing that is happening when chained iterables are being passed through the multiprocessing pool. If you have a better solution that doesn't involve pulling everything into memory, I'd be happy to hear it!
The text was updated successfully, but these errors were encountered: