SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter #1722

mateiz · 2014-08-01T22:40:25Z

All these changes are from @mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.

In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.

@mridulm

All these changes are from @mridulm's work in apache#1609, but extracted here to fix this specific issue. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.

Modified ExternalSorterSuite to also set a low object stream reset and batch size, and verified that it failed before the changes and succeeded after.

SparkQA · 2014-08-01T22:44:47Z

QA tests have started for PR 1722. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17708/consoleFull

mateiz · 2014-08-01T22:45:00Z

core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala

+
+        val bufferedStream = new BufferedInputStream(ByteStreams.limit(fileStream, end - start))
+        val compressedStream = blockManager.wrapForCompression(blockId, bufferedStream)
+        ser.deserializeStream(compressedStream)


One delta w.r.t. your patch, @mridulm: you used to do ser = serializer.newInstance before this, but this should not be necessary; our serializers support reading even multiple streams concurrently (though confusingly not writing them as far as I see; they can share an output buffer there). I removed that because creating a new instance is actually kind of expensive for Kryo.

So that is something I was not sure of : particularly with kryo (not java).
We were seeing the input buffer getting stepped on from various threads - this was specifically in context of 2G fixes though, where we had to modify the way the buffer was created anyway. I dont know if the initialization changes something else.

mateiz · 2014-08-01T22:46:08Z

Jenkins, test this please

mateiz · 2014-08-01T22:48:43Z

core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala

-    private val fileStream = new FileInputStream(file)
-    private val bufferedStream = new BufferedInputStream(fileStream, fileBufferSize)
+    extends Iterator[(K, C)]
+  {


Here I also removed some of the more paranoid asserts about batchSizes

Those asserts caught the bugs :-) Bug yeah, some of them might have been expensive.

SparkQA · 2014-08-02T01:59:16Z

QA tests have started for PR 1722. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17734/consoleFull

SparkQA · 2014-08-02T03:06:06Z

QA results for PR 1722:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17734/consoleFull

mateiz · 2014-08-03T06:42:56Z

@aarondav / @mridulm any other comments on this, or is it okay to merge?

mridulm · 2014-08-03T14:41:04Z

LGTM, thanks Matei !
On 03-Aug-2014 12:13 pm, "Matei Zaharia" notifications@github.com wrote:

@aarondav https://github.com/aarondav / @mridulm
https://github.com/mridulm any other comments on this, or is it okay to
merge?

—
Reply to this email directly or view it on GitHub
#1722 (comment).

mridulm · 2014-08-03T14:45:39Z

Oh wait, is the java serialier change also ported ?
Else the tests won't do what we want it to do.
On 03-Aug-2014 8:11 pm, "Mridul Muralidharan" mridul@gmail.com wrote:

LGTM, thanks Matei !
On 03-Aug-2014 12:13 pm, "Matei Zaharia" notifications@github.com wrote:

@aarondav https://github.com/aarondav / @mridulm
https://github.com/mridulm any other comments on this, or is it okay
to merge?

—
Reply to this email directly or view it on GitHub
#1722 (comment).

aarondav · 2014-08-03T17:19:05Z

+0, I have not actually reviewed this, I only did a cursory pass-through. When it LGTM to @mridulm, we can merge.

mateiz · 2014-08-03T18:08:00Z

Ah good point.. I've now pushed the JavaSerializer change.

mateiz · 2014-08-03T18:09:56Z

Jenkins, test this please

SparkQA · 2014-08-03T18:14:24Z

QA tests have started for PR 1722. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17817/consoleFull

SparkQA · 2014-08-03T19:21:31Z

QA results for PR 1722:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17817/consoleFull

mridulm · 2014-08-03T21:14:28Z

LGTM !
Though I would prefer if @aarondav also took a look at it - since this is based on my earlier work, I might be too close to it to see potential issues ...

This makes it precise -- before we'd only reset after (reset + 1) writes

mateiz · 2014-08-03T23:48:01Z

I just fixed objectStreamReset slightly so that 1 means "reset after every object" (that's what it was intended to be originally)

mateiz · 2014-08-04T00:21:26Z

Jenkins, test this please

SparkQA · 2014-08-04T00:24:20Z

QA tests have started for PR 1722. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17835/consoleFull

SparkQA · 2014-08-04T01:15:29Z

QA results for PR 1722:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17835/consoleFull

mateiz · 2014-08-04T02:58:48Z

Jenkins, test this please

SparkQA · 2014-08-04T03:04:11Z

QA tests have started for PR 1722. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17837/consoleFull

SparkQA · 2014-08-04T03:57:03Z

QA results for PR 1722:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17837/consoleFull

mridulm · 2014-08-04T11:16:03Z

core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala

   * the stream 'resets' object class descriptions have to be re-written)
   */
  def writeObject[T: ClassTag](t: T): SerializationStream = {
    objOut.writeObject(t)
+    counter += 1
    if (counterReset > 0 && counter >= counterReset) {


This is the right behavior, but is a slight change ... I dont think anyone is expecting the earlier behavior though !

mridulm · 2014-08-04T11:16:35Z

LGTM !

mateiz · 2014-08-04T19:59:42Z

Alright, I've merged this in. Thanks for looking over it!

@mridulm

…in ExternalMap / Sorter All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed. In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues. Author: Matei Zaharia <matei@databricks.com> Closes #1722 from mateiz/spark-2792 and squashes the following commits: 5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too 18fe865 [Matei Zaharia] Update docs on objectStreamReset 576ee83 [Matei Zaharia] Allow objectStreamReset to be 0 0374217 [Matei Zaharia] Remove super paranoid code to close file handles bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too 0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap 9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes

@mridulm

…in ExternalMap / Sorter All these changes are from mridulm's work in apache#1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed. In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues. Author: Matei Zaharia <matei@databricks.com> Closes apache#1722 from mateiz/spark-2792 and squashes the following commits: 5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too 18fe865 [Matei Zaharia] Update docs on objectStreamReset 576ee83 [Matei Zaharia] Allow objectStreamReset to be 0 0374217 [Matei Zaharia] Remove super paranoid code to close file handles bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too 0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap 9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes

mateiz added 3 commits August 1, 2014 15:02

Added Mridul's test changes for ExternalAppendOnlyMap

0d6dad7

Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too

bda37bb

Modified ExternalSorterSuite to also set a low object stream reset and batch size, and verified that it failed before the changes and succeeded after.

mateiz reviewed Aug 1, 2014
View reviewed changes

mateiz mentioned this pull request Aug 1, 2014

[SPARK-2532] Consolidated shuffle fixes #1609

Closed

5 tasks

mateiz reviewed Aug 1, 2014
View reviewed changes

Remove super paranoid code to close file handles

0374217

Allow objectStreamReset to be 0

576ee83

Update docs on objectStreamReset

18fe865

Make objectStreamReset counter count the last object written too

5d4bfb5

This makes it precise -- before we'd only reset after (reset + 1) writes

mateiz mentioned this pull request Aug 3, 2014

SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

Closed

mridulm reviewed Aug 4, 2014
View reviewed changes

asfgit closed this in 8e7d5ba Aug 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter #1722

SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter #1722

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mateiz Aug 1, 2014

mridulm Aug 2, 2014

mateiz commented Aug 1, 2014

mateiz Aug 1, 2014

mridulm Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

mateiz commented Aug 3, 2014

mridulm commented Aug 3, 2014

mridulm commented Aug 3, 2014

aarondav commented Aug 3, 2014

mateiz commented Aug 3, 2014

mateiz commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

mridulm commented Aug 3, 2014

mateiz commented Aug 3, 2014

mateiz commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mateiz commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mridulm Aug 4, 2014

mridulm commented Aug 4, 2014

mateiz commented Aug 4, 2014

SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter #1722

SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter #1722

Conversation

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mateiz Aug 1, 2014

Choose a reason for hiding this comment

mridulm Aug 2, 2014

Choose a reason for hiding this comment

mateiz commented Aug 1, 2014

mateiz Aug 1, 2014

Choose a reason for hiding this comment

mridulm Aug 2, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

mateiz commented Aug 3, 2014

mridulm commented Aug 3, 2014

mridulm commented Aug 3, 2014

aarondav commented Aug 3, 2014

mateiz commented Aug 3, 2014

mateiz commented Aug 3, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

mridulm commented Aug 3, 2014

mateiz commented Aug 3, 2014

mateiz commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mateiz commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mridulm Aug 4, 2014

Choose a reason for hiding this comment

mridulm commented Aug 4, 2014

mateiz commented Aug 4, 2014