[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

aray · 2016-12-02T15:55:24Z

What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

CartesianDeserializer and PairDeserializer were changed to implement _load_stream_without_unbatching and borrow the one line implementation of load_stream from BatchedSerializer. The default implementation of _load_stream_without_unbatching was changed to give consistent results (always an iterable) so that it could be used without additional checks.

PairDeserializer no longer extends CartesianDeserializer as it was not really proper. If wanted a new common super class could be added.

Both CartesianDeserializer and PairDeserializer now only extend Serializer (which has no dump_stream implementation) since they are only meant for deserialization.

How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.

SparkQA · 2016-12-02T15:58:48Z

Test build #69571 has finished for PR 16121 at commit a0e3652.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-02T16:40:02Z

Test build #69573 has finished for PR 16121 at commit ad43e31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-12-02T20:04:41Z

It's pretty tricky to make the chained CartesianDeserializer work, maybe it's easier to have a workaround in the RDD.cartesian() to add an _reserialize() between chained cartesian (or zipped), it will be less performant, but given cartesian() is already super slow, I will not worry about it.

The current patch may still be wrong in case of chained DartesianDeserializer and PairSerializer, for example, a.cartesian(b.zip(c)) (have not verified yet)

aray · 2016-12-02T20:15:17Z

@davies I was trying to make minimal changes to PairDeserializer, but you are right it needs changed also. I'll update the PR shortly.

zero323 · 2016-12-02T20:41:57Z

@davies I suggested workaround before but I remember that @holdenk had some reservations.

Moreover it would have to be done proactively for all (?) calls. For example SPARK-17756 seems to hit a similar problem.

SparkQA · 2016-12-02T22:47:03Z

Test build #69587 has finished for PR 16121 at commit 6e3d9d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-12-03T20:24:41Z

I was hesistant with the previous PR since it seemed like we didn't fully understand why we were changing what we were at the time, I can try and take a closer look at this over the next few days if it is in a good place for that to happen.

SparkQA · 2016-12-05T15:58:12Z

Test build #69674 has finished for PR 16121 at commit 36e3876.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2016-12-05T16:33:34Z

@davies, @zero323, and @holdenk this is in a good place for review if you want to take a look.

holdenk

Thanks for working on this, just doing a quick first pass it looks like really good work - but I'd encourage you to add a few more comments in some places (since we had this bug before it seems like just the code wasn't sufficiently self explanatory). I'll do a deeper look later on this week.

holdenk · 2016-12-07T06:13:05Z

python/pyspark/serializers.py

@@ -96,7 +96,7 @@ def load_stream(self, stream):
        raise NotImplementedError

    def _load_stream_without_unbatching(self, stream):


Even though this is internal it might make sense to have a docstring for this since were changing its behaviour.

holdenk · 2016-12-07T06:42:23Z

python/pyspark/serializers.py

@@ -278,50 +278,51 @@ def __repr__(self):
        return "AutoBatchedSerializer(%s)" % self.serializer


-class CartesianDeserializer(FramedSerializer):
+class CartesianDeserializer(Serializer):

    """
    Deserializes the JavaRDD cartesian() of two PythonRDDs.


Maybe we should document this a bit given that we had problems with the implementation. (e.g. expand on the "Due to batching, we can't use the Java cartesian method." comment from rdd.py to explain how this is intended to function).

holdenk · 2016-12-07T06:42:58Z

python/pyspark/serializers.py

+        key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
+        val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
+        for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
+            yield product(key_batch, val_batch)


Maybe consider adding a comment here explaining why the interaction of batching & product

SparkQA · 2016-12-08T17:45:44Z

Test build #69865 has finished for PR 16121 at commit 12f3ab0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-12-08T19:07:53Z

LGTM, merging into master and 2.1/2.0 branch, thanks!

… records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew Ray <ray.andrew@gmail.com> Closes #16121 from aray/fix-cartesian. (cherry picked from commit 3c68944) Signed-off-by: Davies Liu <davies.liu@gmail.com>

… records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip. Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#16121 from aray/fix-cartesian.

stuarteberg · 2017-09-13T18:34:50Z

This PR seems to have introduced a bug, which I have reported here:
https://issues.apache.org/jira/browse/SPARK-21985

Any thoughts, @aray? Can the check in question simply be removed, or is there a better solution to consider?

aray · 2017-09-13T20:25:19Z

I'll take a look, sorry about that.

## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985.

## What changes were proposed in this pull request? (edited) Fixes a bug introduced in apache#16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

(edited) Fixes a bug introduced in apache#16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes apache#19226 from aray/SPARK-21985. (cherry picked from commit 6adf67d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

aray added 3 commits December 1, 2016 14:20

unit test

a73c1a2

working

4ed8c38

remove unneeded debug vars and add comment

a0e3652

wrap comment to fix style issue

ad43e31

additional fixes for zip anlong with unit test

6e3d9d0

remove unneccessary list conversion since batches are already lists

36e3876

holdenk requested changes Dec 7, 2016

View reviewed changes

add/amend docstring and comments per @holdenk review

12f3ab0

asfgit closed this in 3c68944 Dec 8, 2016

aray mentioned this pull request Sep 15, 2017

[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs #19226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

aray commented Dec 2, 2016 •

edited

Loading

SparkQA commented Dec 2, 2016

SparkQA commented Dec 2, 2016

davies commented Dec 2, 2016 •

edited

Loading

aray commented Dec 2, 2016

zero323 commented Dec 2, 2016

SparkQA commented Dec 2, 2016

holdenk commented Dec 3, 2016

SparkQA commented Dec 5, 2016

aray commented Dec 5, 2016

holdenk left a comment

holdenk Dec 7, 2016

holdenk Dec 7, 2016

holdenk Dec 7, 2016

SparkQA commented Dec 8, 2016

davies commented Dec 8, 2016 •

edited

Loading

stuarteberg commented Sep 13, 2017

aray commented Sep 13, 2017

		@@ -96,7 +96,7 @@ def load_stream(self, stream):
		raise NotImplementedError

		def _load_stream_without_unbatching(self, stream):

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

Conversation

aray commented Dec 2, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 2, 2016

SparkQA commented Dec 2, 2016

davies commented Dec 2, 2016 • edited Loading

aray commented Dec 2, 2016

zero323 commented Dec 2, 2016

SparkQA commented Dec 2, 2016

holdenk commented Dec 3, 2016

SparkQA commented Dec 5, 2016

aray commented Dec 5, 2016

holdenk left a comment

Choose a reason for hiding this comment

holdenk Dec 7, 2016

Choose a reason for hiding this comment

holdenk Dec 7, 2016

Choose a reason for hiding this comment

holdenk Dec 7, 2016

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2016

davies commented Dec 8, 2016 • edited Loading

stuarteberg commented Sep 13, 2017

aray commented Sep 13, 2017

aray commented Dec 2, 2016 •

edited

Loading

davies commented Dec 2, 2016 •

edited

Loading

davies commented Dec 8, 2016 •

edited

Loading