[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs #19226

aray · 2017-09-14T01:33:12Z

What changes were proposed in this pull request?

(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have __len__ already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated zip's are done.

How was this patch tested?

Additional unit test

SparkQA · 2017-09-14T01:38:35Z

Test build #81746 has finished for PR 19226 at commit 4a9eb93.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-14T02:02:51Z

Test build #81748 has finished for PR 19226 at commit 0d64a6d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-09-14T02:09:35Z

So test_zip_with_different_number_of_items is failing in Jenkins.

holdenk · 2017-09-14T02:26:26Z

So quick testing locally seems like the exception type may have changed from the explicit valueerror to the Py4J error.

aray · 2017-09-14T02:36:35Z

It's actually this one that is failing https://github.com/aray/spark/blob/0d64a6d11237383c2a6ea21275dc9daa5cc8d634/python/pyspark/tests.py#L964

holdenk · 2017-09-14T02:40:00Z

Yup, thats part of test_zip_with_different_number_of_items.

holdenk · 2017-09-14T02:40:46Z

But that does mean an exception isn't being raised at all so probably not the exception type then. hmm.

aray · 2017-09-14T03:04:30Z

@holdenk I'm not going to be able to solve this tonight (short of just removing the failing test).

holdenk · 2017-09-14T03:22:47Z

Sure, no worries. I think we should keep the test for now and we can hope this goes into RC2 (I assume something will be missing from RC1 or I'll screw up its packaging in some way). Otherwise the fix can go out into 2.2.1 if somehow RC1 magically passes :)

HyukjinKwon · 2017-09-14T03:41:36Z

python/pyspark/serializers.py

@@ -343,9 +346,6 @@ def _load_stream_without_unbatching(self, stream):
        key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
        val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
        for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
-            if len(key_batch) != len(val_batch):
-                raise ValueError("Can not deserialize PairRDD with different number of items"
-                                 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
            # for correctness with repeated cartesian/zip this must be returned as one batch
            yield zip(key_batch, val_batch)


How about returning this batch as a list (and as described in the doc)?

…re equal

HyukjinKwon · 2017-09-14T04:55:14Z

python/pyspark/tests.py

@@ -644,6 +644,18 @@ def test_cartesian_chaining(self):
            set([(x, (y, y)) for x in range(10) for y in range(10)])
        )

+    def test_zip_chaining(self):
+        # Tests for SPARK-21985
+        rdd = self.sc.parallelize(range(10), 2)


This test case already passes without this change, doesn't it?

HyukjinKwon · 2017-09-14T04:57:19Z

python/pyspark/serializers.py

@@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream):
        key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
        val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
        for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+            key_batch = list(key_batch)
+            val_batch = list(val_batch)


Should we fix the doc in Serializer._load_stream_without_unbatching to say, it returns iterator of iterables?

fixed in 66477f8

Ah, I had to be clear. Actually, I meant if Serializer._load_stream_without_unbatching works as documented an iterator of deserialized batches (lists), everything should have worked fine. So, I think the reverse is actually more correct because PairDeserializer and CartesianDeserializer do not follow this.

I am okay with the current change too but I believe the reverse is better because I think we could prevent such issues in the future and make the things simpler. WDYT @aray and @holdenk ?

SparkQA · 2017-09-14T05:22:13Z

Test build #81760 has finished for PR 19226 at commit e99ed23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-14T15:22:54Z

Test build #81785 has finished for PR 19226 at commit 66477f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-14T23:32:14Z

python/pyspark/tests.py

@@ -644,6 +644,18 @@ def test_cartesian_chaining(self):
            set([(x, (y, y)) for x in range(10) for y in range(10)])
        )

+    def test_zip_chaining(self):
+        # Tests for SPARK-21985
+        rdd = self.sc.parallelize('abc')


I'd set the explicit number of partitions because zip reserializes it depending on this.

HyukjinKwon · 2017-09-14T23:57:26Z

(BTW, list on a list is not a no-op. It makes a copy up to my knowledge)

…ns in test

SparkQA · 2017-09-15T04:03:22Z

Test build #81810 has finished for PR 19226 at commit 54b7fd0.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-15T04:42:48Z

Test build #81811 has finished for PR 19226 at commit f6d42f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

OK, this way should resolve the issue too. I assume you strongly prefer this way and I am okay with it. LGTM. Let me leave it to @holdenk.

HyukjinKwon · 2017-09-15T10:11:40Z

python/pyspark/serializers.py

@@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream):
        key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
        val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
        for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+            key_batch = key_batch if hasattr(key_batch, '__len__') else list(key_batch)


Could we add a small comment that this is required because _load_stream_without_unbatching could return an iterator of iterators in this case?

SparkQA · 2017-09-15T14:48:23Z

Test build #81824 has finished for PR 19226 at commit ff1417e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-16T07:54:19Z

python/pyspark/serializers.py

@@ -343,6 +343,9 @@ def _load_stream_without_unbatching(self, stream):
        key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
        val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
        for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+            # the batch is an iterable, we need to check lengths so we convert to list if needed.


nit: For double-zipped RDDs, the batches can be iterators from other PairDeserializer, instead of lists. We need to convert them to lists if needed.

viirya · 2017-09-17T14:59:18Z

LGTM

SparkQA · 2017-09-17T15:27:32Z

Test build #81856 has finished for PR 19226 at commit 5282ee5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

HyukjinKwon · 2017-09-17T17:48:43Z

Merged to master, branch-2.2 and branch-2.1.

## What changes were proposed in this pull request? (edited) Fixes a bug introduced in apache#16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

(edited) Fixes a bug introduced in apache#16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

remove check and add tests

4a9eb93

woops

0d64a6d

HyukjinKwon mentioned this pull request Sep 14, 2017

[SPARK-21985][PYTHON] Fix zip-chained RDD to work #19228

Closed

HyukjinKwon reviewed Sep 14, 2017

View reviewed changes

convert batches to list in PairDeserializer so we can check the len a…

e99ed23

…re equal

HyukjinKwon reviewed Sep 14, 2017

View reviewed changes

update doc and test

66477f8

HyukjinKwon reviewed Sep 14, 2017

View reviewed changes

only convert to list if __len__ not available, set number of partitio…

54b7fd0

…ns in test

style

f6d42f4

HyukjinKwon approved these changes Sep 15, 2017

View reviewed changes

add comment per review

ff1417e

viirya reviewed Sep 16, 2017

View reviewed changes

nit comment

5282ee5

asfgit closed this in 6adf67d Sep 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs #19226

[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs #19226

aray commented Sep 14, 2017 •

edited

Loading

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

holdenk commented Sep 14, 2017

holdenk commented Sep 14, 2017

aray commented Sep 14, 2017

holdenk commented Sep 14, 2017

holdenk commented Sep 14, 2017

aray commented Sep 14, 2017

holdenk commented Sep 14, 2017

HyukjinKwon Sep 14, 2017

HyukjinKwon Sep 14, 2017 •

edited

Loading

HyukjinKwon Sep 14, 2017

aray Sep 14, 2017

HyukjinKwon Sep 14, 2017 •

edited

Loading

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

HyukjinKwon Sep 14, 2017

HyukjinKwon commented Sep 14, 2017

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

HyukjinKwon left a comment

HyukjinKwon Sep 15, 2017

SparkQA commented Sep 15, 2017

viirya Sep 16, 2017

viirya commented Sep 17, 2017

SparkQA commented Sep 17, 2017

HyukjinKwon commented Sep 17, 2017

[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs #19226

[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs #19226

Conversation

aray commented Sep 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

holdenk commented Sep 14, 2017

holdenk commented Sep 14, 2017

aray commented Sep 14, 2017

holdenk commented Sep 14, 2017

holdenk commented Sep 14, 2017

aray commented Sep 14, 2017

holdenk commented Sep 14, 2017

HyukjinKwon Sep 14, 2017

Choose a reason for hiding this comment

HyukjinKwon Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Sep 14, 2017

Choose a reason for hiding this comment

aray Sep 14, 2017

Choose a reason for hiding this comment

HyukjinKwon Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

HyukjinKwon Sep 14, 2017

Choose a reason for hiding this comment

HyukjinKwon commented Sep 14, 2017

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon Sep 15, 2017

Choose a reason for hiding this comment

SparkQA commented Sep 15, 2017

viirya Sep 16, 2017

Choose a reason for hiding this comment

viirya commented Sep 17, 2017

SparkQA commented Sep 17, 2017

HyukjinKwon commented Sep 17, 2017

aray commented Sep 14, 2017 •

edited

Loading

HyukjinKwon Sep 14, 2017 •

edited

Loading

HyukjinKwon Sep 14, 2017 •

edited

Loading