[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore #11791

JoshRosen · 2016-03-17T19:03:25Z

When a block is persisted in the MemoryStore at a serialized storage level, the current MemoryStore.putIterator() code will unroll the entire iterator as Java objects in memory, then will turn around and serialize an iterator obtained from the unrolled array. This is inefficient and doubles our peak memory requirements.

Instead, I think that we should incrementally serialize blocks while unrolling them.

A downside to incremental serialization is the fact that we will need to deserialize the partially-unrolled data in case there is not enough space to unroll the block and the block cannot be dropped to disk. However, I'm hoping that the memory efficiency improvements will outweigh any performance losses as a result of extra serialization in that hopefully-rare case.

SparkQA · 2016-03-17T21:20:43Z

Test build #53456 has finished for PR 11791 at commit 7dc3623.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T05:47:29Z

Test build #53496 has finished for PR 11791 at commit 5489748.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-03-18T23:37:55Z

/cc @rxin @andrewor14, this is the next most important patch to review towards off-heap caching. After these changes get in, we'll be able to use off-heap memory for the unroll memory in off-heap caching, greatly simplifying things. Without this change, the on-heap unroll array needs to be accounted properly even if the final cache destination is off-heap, making the caching more OOM-prone and complicating the accounting logic (since it then becomes different between the two modes).

rxin · 2016-03-19T06:35:37Z

Still WIP?

…ally

JoshRosen · 2016-03-21T18:43:15Z

This is no longer WIP and should be ready for review now.

JoshRosen · 2016-03-21T18:44:48Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

@@ -129,10 +136,9 @@ private[spark] class MemoryStore(
   *         iterator or call `close()` on it in order to free the storage memory consumed by the
   *         partially-unrolled block.
   */
-  private[storage] def putIterator(
+  private[storage] def putIteratorAsValues(


In case it isn't obvious from the diff, the main change in this file is to split putIterator into two separate methods, putIteratorAsValues and putIteratorAsBytes.

It's possible that there's some opportunity to reduce code duplication here, but unless we can come up with an obvious and simple approach I would prefer to defer cleanup to followup patches.

SparkQA · 2016-03-21T18:57:18Z

Test build #53700 has finished for PR 11791 at commit a336c17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-03-21T19:19:18Z

Jenkins, retest this please.

SparkQA · 2016-03-21T21:38:56Z

Test build #53704 has finished for PR 11791 at commit a336c17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-21T23:19:50Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

+        if (keepUnrolling) {
+          unrollMemoryUsedByThisBlock += amountToRequest
+        }
+        unrollMemoryUsedByThisBlock += amountToRequest


i dont understand why you add this twice in some cases

Ah, I think this case is a mistake which might have been introduced while repairing a merge conflict. We should only increment this if keepUnrolling == true.

This has now been fixed.

rxin · 2016-03-22T01:45:05Z

cc @sameeragarwal

…ally

SparkQA · 2016-03-22T05:15:29Z

Test build #53742 has finished for PR 11791 at commit 4976b74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T05:20:01Z

Test build #53741 has finished for PR 11791 at commit cec1f02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-03-22T18:37:45Z

Doing a bit of refactoring now in order to make it easier to write proper unit tests for this. Therefore I'd hold off on the final review pass here for a little bit and review my SPARK-3000 patch instead.

…ally

SparkQA · 2016-03-23T20:36:16Z

Test build #53957 has finished for PR 11791 at commit 768a8d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LogisticRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class NaiveBayesModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class KMeansModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class Binarizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class CountVectorizerModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class DCT(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class ElementwiseProduct(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable,
- class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, JavaMLReadable,
- class IDF(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class IDFModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class MaxAbsScalerModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class MinMaxScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class MinMaxScalerModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class NGram(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class Normalizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable,
- class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, HasSeed, JavaMLReadable,
- class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class SQLTransformer(JavaTransformer, JavaMLReadable, JavaMLWritable):
- class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class StandardScalerModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,
- class StringIndexerModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class IndexToString(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class VectorAssembler(JavaTransformer, HasInputCols, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class VectorIndexerModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class Word2VecModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class PCA(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
- class PCAModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaMLWritable):
- class RFormulaModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, JavaMLReadable,
- class ChiSqSelectorModel(JavaModel, JavaMLReadable, JavaMLWritable):
- class PipelineMLWriter(JavaMLWriter):
- class Pipeline(Estimator, MLReadable, MLWritable):
- class PipelineModelMLWriter(JavaMLWriter):
- class PipelineModel(Model, MLReadable, MLWritable):
- class ALSModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class LinearRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class IsotonicRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class AFTSurvivalRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable):
- class MLWriter(object):
- class JavaMLWriter(MLWriter):
- class JavaMLWritable(MLWritable):
- class MLReader(object):
- class JavaMLReader(MLReader):
- class JavaMLReadable(MLReadable):
- implicit class StringToColumn(val sc: StringContext)
- class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
- // the type in next() and we get a class cast exception. If we make that function return
- class HDFSMetadataLog[T: ClassTag](sqlContext: SQLContext, path: String)
- class StreamProgress(

…ally

JoshRosen · 2016-03-24T17:11:11Z

Alright, just added a few more tests to MemoryStoreSuite to bump up the coverage of putBytesAsIterator() and fixed a problem leading to leaked unroll memory in PartiallySerializedResult.finishWritingToStream.

SparkQA · 2016-03-24T19:12:35Z

Test build #54057 has finished for PR 11791 at commit 749df73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-03-24T19:13:07Z

Jenkins, retest this please.

SparkQA · 2016-03-24T19:34:21Z

Test build #54070 has finished for PR 11791 at commit 749df73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-24T21:29:14Z

core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala

+
+    // Make sure that we have enough memory to store the block. By this point, it is possible that
+    // the block's actual memory usage has exceeded the unroll memory by a small amount, so we
+    // perform one final call to attempt to allocate additional memory if necessary.


This is because of the call to close? That can use more memory?

JoshRosen · 2016-03-24T21:35:23Z

Jenkins, retest this please.

nongli · 2016-03-24T21:37:29Z

LGTM

andrewor14 · 2016-03-24T22:04:23Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+              } else {
+                iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
+              }
+          }


about my previous comment about duplicate code, never mind. It can't actually be abstracted cleanly.

andrewor14 · 2016-03-24T23:08:52Z

Looks good.

SparkQA · 2016-03-24T23:50:38Z

Test build #54094 has finished for PR 11791 at commit 749df73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-03-25T00:33:37Z

Merging to master.

JoshRosen added 2 commits March 17, 2016 20:27

Duplicate the code to separate putIterator() by serialization level.

1f275f2

Incrementally serialize blocks while unrolling them.

5489748

JoshRosen force-pushed the serialize-incrementally branch from 7dc3623 to 5489748 Compare March 18, 2016 03:28

JoshRosen changed the title ~~[SPARK-13980][WIP] Incrementally serialize blocks while unrolling them in MemoryStore~~ [SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore Mar 21, 2016

JoshRosen added 4 commits March 21, 2016 11:18

Merge remote-tracking branch 'origin/master' into serialize-increment…

4623c7b

…ally

Remove local variable to reduce potential for use-after-invalid.

9326439

More docs.

b79d2b9

Roll back unintentional reformatting.

a336c17

JoshRosen reviewed Mar 21, 2016
View reviewed changes

nongli reviewed Mar 21, 2016
View reviewed changes

JoshRosen added 2 commits March 21, 2016 19:58

Merge remote-tracking branch 'origin/master' into serialize-increment…

cec1f02

…ally

Remove accidental double-increment.

4976b74

Explain potentially-confusing if case.

e235a83

Pass SerializerManager to MemoryStore constructor.

f975c47

JoshRosen mentioned this pull request Mar 22, 2016

[SPARK-14075] Refactor MemoryStore to be testable independent of BlockManager #11899

Closed

Merge remote-tracking branch 'origin/master' into serialize-increment…

768a8d9

…ally

JoshRosen added 2 commits March 24, 2016 09:32

Merge remote-tracking branch 'origin/master' into serialize-increment…

06d0e5e

…ally

Add more tests.

749df73

nongli reviewed Mar 24, 2016
View reviewed changes

andrewor14 reviewed Mar 24, 2016
View reviewed changes

asfgit closed this in fdd460f Mar 25, 2016

JoshRosen deleted the serialize-incrementally branch August 29, 2016 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore #11791

[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore #11791

JoshRosen commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 18, 2016

JoshRosen commented Mar 18, 2016

rxin commented Mar 19, 2016

JoshRosen commented Mar 21, 2016

JoshRosen Mar 21, 2016

SparkQA commented Mar 21, 2016

JoshRosen commented Mar 21, 2016

SparkQA commented Mar 21, 2016

nongli Mar 21, 2016

JoshRosen Mar 21, 2016

JoshRosen Mar 22, 2016

rxin commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 22, 2016

JoshRosen commented Mar 22, 2016

SparkQA commented Mar 23, 2016

JoshRosen commented Mar 24, 2016

SparkQA commented Mar 24, 2016

JoshRosen commented Mar 24, 2016

SparkQA commented Mar 24, 2016

nongli Mar 24, 2016

JoshRosen Mar 24, 2016

JoshRosen commented Mar 24, 2016

nongli commented Mar 24, 2016

andrewor14 Mar 24, 2016

andrewor14 commented Mar 24, 2016

SparkQA commented Mar 24, 2016

JoshRosen commented Mar 25, 2016

[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore #11791

[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore #11791

Conversation

JoshRosen commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 18, 2016

JoshRosen commented Mar 18, 2016

rxin commented Mar 19, 2016

JoshRosen commented Mar 21, 2016

JoshRosen Mar 21, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2016

JoshRosen commented Mar 21, 2016

SparkQA commented Mar 21, 2016

nongli Mar 21, 2016

Choose a reason for hiding this comment

JoshRosen Mar 21, 2016

Choose a reason for hiding this comment

JoshRosen Mar 22, 2016

Choose a reason for hiding this comment

rxin commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 22, 2016

JoshRosen commented Mar 22, 2016

SparkQA commented Mar 23, 2016

JoshRosen commented Mar 24, 2016

SparkQA commented Mar 24, 2016

JoshRosen commented Mar 24, 2016

SparkQA commented Mar 24, 2016

nongli Mar 24, 2016

Choose a reason for hiding this comment

JoshRosen Mar 24, 2016

Choose a reason for hiding this comment

JoshRosen commented Mar 24, 2016

nongli commented Mar 24, 2016

andrewor14 Mar 24, 2016

Choose a reason for hiding this comment

andrewor14 commented Mar 24, 2016

SparkQA commented Mar 24, 2016

JoshRosen commented Mar 25, 2016