[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. #1894

davies · 2014-08-11T18:38:21Z

If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.

SparkQA · 2014-08-11T18:44:44Z

QA tests have started for PR 1894. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18317/consoleFull

SparkQA · 2014-08-11T19:29:38Z

QA results for PR 1894:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18317/consoleFull

davies · 2014-08-11T19:34:03Z

The failure is not related to this PR, how to re-test this?

JoshRosen · 2014-08-12T00:21:03Z

Jenkins, retest this please.

SparkQA · 2014-08-12T00:24:44Z

QA tests have started for PR 1894. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18345/consoleFull

SparkQA · 2014-08-12T01:15:14Z

QA results for PR 1894:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18345/consoleFull

JoshRosen · 2014-08-18T17:31:00Z

Sorry to drop the ball on reviewing this.

On StackOverflow, someone has reported an issue where zip() silently returns less output than it should:

http://stackoverflow.com/questions/25364380/why-does-zip-truncate-the-data-in-pyspark

Any thoughts on what's happening there? My hunch is that the PythonRDDs have the same number of partitions, same number of batches, and same batch sizes, but a different grouping of objects across batched elements. As long as you're working on zip() issues, could you take a look at this and maybe port their test case?

davies · 2014-08-18T17:48:13Z

The builtin zip will return the shortest number of items in them, such as

zip(range(3), range(1))
[(0, 0)]

I will look deep in it.

Davies

On Mon, Aug 18, 2014 at 10:31 AM, Josh Rosen notifications@github.com
wrote:

Sorry to drop the ball on reviewing this.

On StackOverflow, someone has reported an issue where zip() silently
returns less output than it should:

http://stackoverflow.com/questions/25364380/why-does-zip-truncate-the-data-in-pyspark

Any thoughts on what's happening there? My hunch is that the PythonRDDs
have the same number of partitions, same number of batches, and same batch
sizes, but a different grouping of objects across batched elements. As long
as you're working on zip() issues, could you take a look at this and maybe
port their test case?

Reply to this email directly or view it on GitHub
#1894 (comment).

Davies

Conflicts: python/pyspark/tests.py

davies · 2014-08-19T06:36:20Z

@JoshRosen The issue reported in StackOverFlow, the two rdd have different number of elements in them, but PairDeserializer did not check the length of keys and values, so parts of them was dropped by zip() silently.

In this PR, this issue has been fixed. I had added unit tests for these cases.

JoshRosen · 2014-08-19T06:41:40Z

My impression from the StackOverflow example was that both RDDs had the same number of items, since f.count() and ind.count() were both 52. It doesn't look like the new test cases address this.

davies · 2014-08-19T07:08:00Z

They have same total number of items in rdd, but different number of items
in partitions, I will add another cases to address this.

On Mon, Aug 18, 2014 at 11:41 PM, Josh Rosen notifications@github.com
wrote:

My impression from the StackOverflow example was that both RDDs had the
same number of items, since f.count() and ind.count() were both 52. It
doesn't look like the new test cases address this.

Reply to this email directly or view it on GitHub
#1894 (comment).

Davies

JoshRosen · 2014-08-19T07:11:15Z

It seems like it could be really confusing to users if we fail on a zip() where the RDDs have the same total number of items and same number of partitions; is there any cheap way to detect this case and re-balance / serialize or to otherwise work around this limitation? If not, it's still better to fail loudly with an error than to return a wrong answer, but it would be nice if we supported this case.

davies · 2014-08-19T07:17:39Z

I had not find a way to solve this problem, we can not control exactly how many items will be placed in each partition.

In this PR, it will raise an exception. Also, we have documented this in the API doc.

For this user cases, zipWithIndex() is what user wanted.

SparkQA · 2014-08-19T07:20:22Z

QA tests have started for PR 1894 at commit c4652ea.

This patch merges cleanly.

SparkQA · 2014-08-19T08:16:41Z

QA tests have finished for PR 1894 at commit c4652ea.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

JoshRosen · 2014-08-19T21:46:13Z

This looks great. I think it's better to crash early and loudly rather than to silently return a bad result, so I'm going to merge it into master and branch-1.1.

…batch sizes. If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark. Author: Davies Liu <davies.liu@gmail.com> Closes #1894 from davies/zip and squashes the following commits: c4652ea [Davies Liu] add more test cases 6d05fc8 [Davies Liu] Merge branch 'master' into zip 813b1e4 [Davies Liu] add more tests for failed cases a4aafda [Davies Liu] fix zip with serializers which have different batch sizes. (cherry picked from commit d7e80c2) Signed-off-by: Josh Rosen <joshrosen@apache.org>

…batch sizes. If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark. Author: Davies Liu <davies.liu@gmail.com> Closes apache#1894 from davies/zip and squashes the following commits: c4652ea [Davies Liu] add more test cases 6d05fc8 [Davies Liu] Merge branch 'master' into zip 813b1e4 [Davies Liu] add more tests for failed cases a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.

Change log: https://github.pie.apple.com/pie/boson/compare/d85bda7ebbab666d8c5d6c4051f2c0e17fd27478...9bc7bc3703650c0f3fdf9685ef8800f6db73c5a6 Release notes: https://github.pie.apple.com/pie/boson/releases/tag/0.3.27

fix zip with serializers which have different batch sizes.

a4aafda

davies changed the title ~~[SPARK-2790] [PySPark] fix zip with serializers which have different batch sizes.~~ [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. Aug 11, 2014

davies added 2 commits August 18, 2014 11:23

add more tests for failed cases

813b1e4

Merge branch 'master' into zip

6d05fc8

Conflicts: python/pyspark/tests.py

add more test cases

c4652ea

asfgit closed this in d7e80c2 Aug 19, 2014

davies deleted the zip branch September 15, 2014 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. #1894

[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. #1894

davies commented Aug 11, 2014

SparkQA commented Aug 11, 2014

SparkQA commented Aug 11, 2014

davies commented Aug 11, 2014

JoshRosen commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

JoshRosen commented Aug 18, 2014

davies commented Aug 18, 2014

davies commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

davies commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

davies commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. #1894

[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. #1894

Conversation

davies commented Aug 11, 2014

SparkQA commented Aug 11, 2014

SparkQA commented Aug 11, 2014

davies commented Aug 11, 2014

JoshRosen commented Aug 12, 2014

SparkQA commented Aug 12, 2014

SparkQA commented Aug 12, 2014

JoshRosen commented Aug 18, 2014

davies commented Aug 18, 2014

davies commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

davies commented Aug 19, 2014

JoshRosen commented Aug 19, 2014

davies commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

JoshRosen commented Aug 19, 2014