Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

Closed
wants to merge 7 commits into from

Conversation

aray
Copy link
Contributor

@aray aray commented Dec 2, 2016

What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

CartesianDeserializer and PairDeserializer were changed to implement _load_stream_without_unbatching and borrow the one line implementation of load_stream from BatchedSerializer. The default implementation of _load_stream_without_unbatching was changed to give consistent results (always an iterable) so that it could be used without additional checks.

PairDeserializer no longer extends CartesianDeserializer as it was not really proper. If wanted a new common super class could be added.

Both CartesianDeserializer and PairDeserializer now only extend Serializer (which has no dump_stream implementation) since they are only meant for deserialization.

How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69571 has finished for PR 16121 at commit a0e3652.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69573 has finished for PR 16121 at commit ad43e31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Dec 2, 2016

It's pretty tricky to make the chained CartesianDeserializer work, maybe it's easier to have a workaround in the RDD.cartesian() to add an _reserialize() between chained cartesian (or zipped), it will be less performant, but given cartesian() is already super slow, I will not worry about it.

The current patch may still be wrong in case of chained DartesianDeserializer and PairSerializer, for example, a.cartesian(b.zip(c)) (have not verified yet)

@aray
Copy link
Contributor Author

aray commented Dec 2, 2016

@davies I was trying to make minimal changes to PairDeserializer, but you are right it needs changed also. I'll update the PR shortly.

@zero323
Copy link
Member

zero323 commented Dec 2, 2016

@davies I suggested workaround before but I remember that @holdenk had some reservations.

Moreover it would have to be done proactively for all (?) calls. For example SPARK-17756 seems to hit a similar problem.

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69587 has finished for PR 16121 at commit 6e3d9d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Dec 3, 2016

I was hesistant with the previous PR since it seemed like we didn't fully understand why we were changing what we were at the time, I can try and take a closer look at this over the next few days if it is in a good place for that to happen.

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69674 has finished for PR 16121 at commit 36e3876.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Dec 5, 2016

@davies, @zero323, and @holdenk this is in a good place for review if you want to take a look.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, just doing a quick first pass it looks like really good work - but I'd encourage you to add a few more comments in some places (since we had this bug before it seems like just the code wasn't sufficiently self explanatory). I'll do a deeper look later on this week.

@@ -96,7 +96,7 @@ def load_stream(self, stream):
raise NotImplementedError

def _load_stream_without_unbatching(self, stream):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is internal it might make sense to have a docstring for this since were changing its behaviour.

@@ -278,50 +278,51 @@ def __repr__(self):
return "AutoBatchedSerializer(%s)" % self.serializer


class CartesianDeserializer(FramedSerializer):
class CartesianDeserializer(Serializer):

"""
Deserializes the JavaRDD cartesian() of two PythonRDDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should document this a bit given that we had problems with the implementation. (e.g. expand on the "Due to batching, we can't use the Java cartesian method." comment from rdd.py to explain how this is intended to function).

key_batch_stream = self.key_ser._load_stream_without_unbatching(stream)
val_batch_stream = self.val_ser._load_stream_without_unbatching(stream)
for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
yield product(key_batch, val_batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider adding a comment here explaining why the interaction of batching & product

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69865 has finished for PR 16121 at commit 12f3ab0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Dec 8, 2016

LGTM, merging into master and 2.1/2.0 branch, thanks!

asfgit pushed a commit that referenced this pull request Dec 8, 2016
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #16121 from aray/fix-cartesian.

(cherry picked from commit 3c68944)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
@asfgit asfgit closed this in 3c68944 Dec 8, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#16121 from aray/fix-cartesian.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#16121 from aray/fix-cartesian.
@stuarteberg
Copy link

This PR seems to have introduced a bug, which I have reported here:
https://issues.apache.org/jira/browse/SPARK-21985

Any thoughts, @aray? Can the check in question simply be removed, or is there a better solution to consider?

@aray
Copy link
Contributor Author

aray commented Sep 13, 2017

I'll take a look, sorry about that.

asfgit pushed a commit that referenced this pull request Sep 17, 2017
## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67d)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
asfgit pushed a commit that referenced this pull request Sep 17, 2017
## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67d)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
asfgit pushed a commit that referenced this pull request Sep 17, 2017
## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #19226 from aray/SPARK-21985.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in apache#16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67d)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
(edited)
Fixes a bug introduced in apache#16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67d)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants