[SPARK-22599][SQL] In-Memory Table Pruning without Extra Reading #19810

CodingCat · 2017-11-24T07:42:49Z

What changes were proposed in this pull request?

In the current implementation of Spark, InMemoryTableExec read all data in a cached table, filter CachedBatch according to stats and pass data to the downstream operators. This implementation makes it inefficient to reside the whole table in memory to serve various queries against different partitions of the table, which occupies a certain portion of our users' scenarios.

design doc: https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing

The following is an example of such a use case:

store_sales is a 1TB-sized table in cloud storage, which is partitioned by 'location'. The first query, Q1, wants to output several metrics A, B, C for all stores in all locations. After that, a small team of 3 data scientists wants to do some causal analysis for the sales in different locations. To avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole table in memory in Q1.

With the current implementation, even any one of the data scientists is only interested in one out of three locations, the queries they submit to Spark cluster is still reading 1TB data completely.

The reason behind the extra reading operation is that we implement CachedBatch as

case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: InternalRow)

where the stats is a part of every CachedBatch, so we can only filter batches for output of InMemoryTableExec operator by reading all data in in-memory table as input. The extra reading would be even more unacceptable when some of the table's data is evicted to disks.

We propose to introduce a new type of block, metadata block, for the partitions of RDD representing data in the cached table. Every metadata block contains stats info for all columns in a partition and is saved to BlockManager when executing compute() method for the partition. To minimize the number of bytes to read,

How was this patch tested?

unit test: add 3 new unit tests
performance test:

Environment: 6 Executors, each of which has 16 cores 90G memory

dataset: 1T TPCDS data

queries: tested 4 queries (Q19, Q46, Q34, Q27) in https://github.com/databricks/spark-sql-perf/blob/c2224f37e50628c5c8691be69414ec7f5a3d919a/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala

results: https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing

CodingCat · 2017-11-24T21:01:05Z

retest this please

SparkQA · 2017-11-24T23:51:05Z

Test build #84178 has finished for PR 19810 at commit a853ce6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-25T03:47:54Z

Test build #84181 has finished for PR 19810 at commit 9d450ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2017-11-27T16:39:07Z

ping @cloud-fan @viirya @gatorsmile @felixcheung @hvanhovell @HyukjinKwon

dongjoon-hyun · 2017-11-28T02:43:43Z

Retest this please.

cloud-fan · 2017-11-28T02:55:42Z

are you trying to optimize the case that data is too large to fit in memory? Spark RDD cache doesn't work well for this case.

CodingCat · 2017-11-28T03:46:41Z

Hi, @cloud-fan, this PR is not only for the case where the data size is larger than the memory size, even when all data is in-memory, I observed 10-40% speedup because the implementation here

(1) read less data

(2) started less tasks

you can understand this PR as it implements the functionality of Parquet's footer for the in-memory table

SparkQA · 2017-11-28T06:15:40Z

Test build #84240 has finished for PR 19810 at commit 9d450ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-28T07:29:31Z

When all data is in memory, what do you mean by reading less data? Starting less tasks makes sense.

CodingCat · 2017-11-28T17:01:19Z

reading less data is a observation from the input metrics in Spark UI which includes both of local/remote read in BlockManagers, and also the overhead in BlockManager layer itself (especially when the user chooses to cache with serialized format)

but I didn't count how much it contributes to the speedup (and a small portion of data is in disk in my perf test)

CodingCat · 2017-12-04T18:57:46Z

@cloud-fan would you mind continuing the review?

sadikovi

@CodingCat Thanks for working on this - looks great! I left couple of comments, mainly to understand the code. Would appreciate if you could have a look. Thanks!

sadikovi · 2017-12-05T00:16:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

+    var rowCount = 0
+    var totalSize = 0L
+
+    val terminateLoop = (singleBatch: Boolean, rowIter: Iterator[InternalRow],


@CodingCat Could you explain me what singleBatch means here? I cannot get my head around it:) Thanks!

to make getting partition stats easier, we construct only one CatchedBatch for each partition when enabling the functionality proposed in this PR. singleBatch distinguishes the scenarios which enables/disables the functionality by introducing different while loop termination conditions, making the other code reusable

sadikovi · 2017-12-05T00:19:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+              partitionFilters.reduceOption(And).getOrElse(Literal(true)),
+              partitionStatsSchema)
+            partitionFilter.initialize(partitionIndex)
+            if (!partitionFilter.eval(cachedBatch.stats)) {


Are there any issues with discarding a partition based on statistics that could be partially computed (e.g. when total size in bytes of a partition iterator is larger than configurable batch size) as per https://github.com/apache/spark/pull/19810/files#diff-5fc188468d3066580ea9a766114b8f1dR74?

Would be it be beneficial to record such situation by logging it, and still include such partition when statistics are partially computed and filters are evaluated to false, or discard all statistics when some of the partitions hit this situation? Thanks!

I might not understand your proposal well...are you trying to simplify the logic in https://github.com/apache/spark/pull/19810/files#diff-5fc188468d3066580ea9a766114b8f1dR74? it would make the code simpler but degrade pruning effect here,

All good, no need to change. I was trying to understand the code, so my question would be referring to statistics collection overall, not changes in this PR. Link points to a condition (exists before this PR) that could potentially result in exiting iterator before exhausting all records in it, so statistics would be partially collected, which might affect any filtering that uses such statistics - though it is quite possibly handled later, or a theoretical use case.

@sadikovi this while loop is building CatchedBatch, it just decides what's the right time to seal the building window of a CatchedBatch and start the next one....so, in any way, you need to go through every all records in the partition,

CodingCat · 2017-12-05T04:39:26Z

@sadikovi thanks for the review, I replied in comments

cloud-fan · 2017-12-05T10:42:26Z

store_sales is a 1TB-sized table in cloud storage, which is partitioned by 'location'. The first query, Q1, wants to output several metrics A, B, C for all stores in all locations. After that, a small team of 3 data scientists wants to do some causal analysis for the sales in different locations. To avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole table in memory in Q1.

Reading your use case, it sounds like you are trying to optimize the case that data is too large to fit in memory. For that case, if a partition is on the disk, Spark needs to load the entire partition to memory before filtering blocks.

It sounds like something can be done better in 3rd party data sources, or we need to change the Spark core just for a better table cache, which seems risky.

CodingCat · 2017-12-06T23:34:14Z

@cloud-fan for this case, if the data has been dumped to disk or some non-local tasks are started, I/O is involved in addition to the overhead to start extra tasks. If all data is in-memory, only the overhead related to tasks' launching is there

It sounds like something can be done better in 3rd party data sources, or we need to change the Spark core just for a better table cache, which seems risky.

Yes, some work can be done in 3rd party data sources, e.g. to avoid parsing overhead in parquet,

Regarding the risk, in the current implementation, I directly modify the core part to add a new type of block and make it recognizable by BlockManager. The new RDD and dependency implementation is in SQL module. An alternative way to do that is implementing this new type block in SQL as well (but it needs some small refactoring to make BlockManager open to anything outside of Spark Core)

I personally think it's a good feature to add which is beneficial to the user and without actual threat to existing code

eyalfa · 2018-09-19T08:03:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

+      if (!singleBatch) {
+        rowIter.hasNext && rowCount < batchSize && totalSize < ColumnBuilder.MAX_BATCH_SIZE_IN_BYTE
+      } else {
+        rowIter.hasNext


doesn't this run the risk of OOM for large partitions?

eyalfa · 2018-09-19T08:06:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+      case (partitionIndex, cachedBatches) =>
+        if (inMemoryPartitionPruningEnabled) {
+          cachedBatches.filter { cachedBatch =>
+            val partitionFilter = newPredicate(


can this be pulled up out of usePartitionLevelMetadata ? seems like you're constructing the predicate per record

eyalfa · 2018-09-19T08:35:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/CachedColumnarRDD.scala

+
+private[columnar] object CachedColumnarRDD {
+
+  private val rddIdToMetadata = new ConcurrentHashMap[Int, mutable.ArraySeq[Option[InternalRow]]]()


could these be moved to become a member of the RDD class? seems like a map of this->some.property, in this case can be made an instance member.

eyalfa · 2018-09-19T08:40:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/CachedColumnarRDD.scala

+  override protected def getPartitions: Array[Partition] = dataRDD.partitions
+
+  override private[spark] def getOrCompute(split: Partition, context: TaskContext):
+      Iterator[CachedBatch] = {


can this be avoided by maintaining two (zipped) RDDs? one of CachedBatchs and the other holding only the stats?
can this approach avoid the need for a specialized block type for managing metadata?
correct me if i'm wrong, but first time the stats are accessed, your approach performs a full scan to extract the stats (happens in the sql code), so having a second RDD which is something like : batches.map(_.stats).persist should give the same behavior, right?

maropu · 2018-09-19T09:43:21Z

@CodingCat Is this still active?

eyalfa · 2018-09-19T10:03:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/CachedColumnarRDD.scala

+    ).getOrElse {
+      val batchIter = superGetOrCompute(split, context)
+      if (containsPartitionMetadata && getStorageLevel != StorageLevel.NONE && batchIter.hasNext) {
+        val cachedBatch = batchIter.next()


assert post condition !batchIter.hasNext, you expect this partition to contain a single batch

eyalfa · 2018-09-19T11:47:20Z

@maropu , this looks rather cold 😎 , but extremely interesting and relevant.

maropu · 2018-09-19T12:09:51Z

If the author is inactive, its ok to take this over by someone. But, we should first discuss more to get consensus for this feature. I'm not sure we could get enough performance gains at the expense of the current simple cache logic.

CodingCat · 2018-09-19T12:11:15Z

When I contributed it back, the community is as looking at something else, so I didn’t spend too much time to convince the people to review....but if the interests are raised again now, I am happy to pick it up again,

SparkQA · 2018-10-22T06:12:26Z

Test build #97714 has finished for PR 19810 at commit 9d450ad.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

github-actions · 2020-01-14T00:05:58Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

CodingCat changed the title ~~Partition level pruning 2~~ [SQL][SPARK-22599] Partition level pruning 2 Nov 24, 2017

CodingCat changed the title ~~[SQL][SPARK-22599] Partition level pruning 2~~ [SQL][SPARK-22599] In-memory table pruning without extra reading Nov 24, 2017

CodingCat changed the title ~~[SQL][SPARK-22599] In-memory table pruning without extra reading~~ [SQL][SPARK-22599] In-Memory Table Pruning without Extra Reading Nov 24, 2017

CodingCat added 2 commits November 23, 2017 23:50

improve the doc for "spark.memory.offHeap.size"

6a34b15

fix

3a2be8c

CodingCat force-pushed the partition_level_pruning_2 branch from b4f51ed to accd549 Compare November 24, 2017 07:50

CodingCat changed the title ~~[SQL][SPARK-22599] In-Memory Table Pruning without Extra Reading~~ [SPARK-22599][SQL] In-Memory Table Pruning without Extra Reading Nov 24, 2017

CodingCat added 23 commits November 23, 2017 23:51

add configuration for partition_metadata

eb535d6

framework of CachedColumnarRDD

f910b27

code framework

846b032

remove cachedcolumnarbatchRDD

f511b6f

temp

50f2612

'CachedColumnarRDD'

2b945c9

change types

89f0a98

fix compilation error

5aa1808

update

a5adc56

fix storage level

00b1642

fix getOrCompute

311fe5a

evaluate with partition metadata

6411b82

fix getOrCompute

41f6ad2

add logging

c50b743

add logging for skipped partition

78f774f

try to print stats

71456bd

add logging for skipped partition

97544a6

add logging for skipped partition

c131b2d

add logging for skipped partition

d588fb0

refactor the code

1ba1f80

fix compilation issue

62f358d

refactor the code

500d4fd

test

63c5897

fix the failed test

9d450ad

sadikovi reviewed Dec 5, 2017

View reviewed changes

eyalfa reviewed Sep 19, 2018

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 14, 2020

github-actions bot closed this Jan 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22599][SQL] In-Memory Table Pruning without Extra Reading #19810

[SPARK-22599][SQL] In-Memory Table Pruning without Extra Reading #19810

CodingCat commented Nov 24, 2017 •

edited

CodingCat commented Nov 24, 2017

SparkQA commented Nov 24, 2017

SparkQA commented Nov 25, 2017

CodingCat commented Nov 27, 2017

dongjoon-hyun commented Nov 28, 2017

cloud-fan commented Nov 28, 2017

CodingCat commented Nov 28, 2017 •

edited

SparkQA commented Nov 28, 2017

cloud-fan commented Nov 28, 2017

CodingCat commented Nov 28, 2017

CodingCat commented Dec 4, 2017

sadikovi left a comment

sadikovi Dec 5, 2017

CodingCat Dec 5, 2017

sadikovi Dec 5, 2017

CodingCat Dec 5, 2017

sadikovi Dec 5, 2017

CodingCat Dec 6, 2017

CodingCat commented Dec 5, 2017

cloud-fan commented Dec 5, 2017

CodingCat commented Dec 6, 2017 •

edited

eyalfa Sep 19, 2018

eyalfa Sep 19, 2018

eyalfa Sep 19, 2018

eyalfa Sep 19, 2018

maropu commented Sep 19, 2018

eyalfa Sep 19, 2018 •

edited

eyalfa commented Sep 19, 2018

maropu commented Sep 19, 2018

CodingCat commented Sep 19, 2018

SparkQA commented Oct 22, 2018

github-actions bot commented Jan 14, 2020


		private[columnar] object CachedColumnarRDD {

		private val rddIdToMetadata = new ConcurrentHashMap[Int, mutable.ArraySeq[Option[InternalRow]]]()

[SPARK-22599][SQL] In-Memory Table Pruning without Extra Reading #19810

[SPARK-22599][SQL] In-Memory Table Pruning without Extra Reading #19810

Conversation

CodingCat commented Nov 24, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

CodingCat commented Nov 24, 2017

SparkQA commented Nov 24, 2017

SparkQA commented Nov 25, 2017

CodingCat commented Nov 27, 2017

dongjoon-hyun commented Nov 28, 2017

cloud-fan commented Nov 28, 2017

CodingCat commented Nov 28, 2017 • edited

SparkQA commented Nov 28, 2017

cloud-fan commented Nov 28, 2017

CodingCat commented Nov 28, 2017

CodingCat commented Dec 4, 2017

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodingCat commented Dec 5, 2017

cloud-fan commented Dec 5, 2017

CodingCat commented Dec 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Sep 19, 2018

eyalfa Sep 19, 2018 • edited

Choose a reason for hiding this comment

eyalfa commented Sep 19, 2018

maropu commented Sep 19, 2018

CodingCat commented Sep 19, 2018

SparkQA commented Oct 22, 2018

github-actions bot commented Jan 14, 2020

CodingCat commented Nov 24, 2017 •

edited

CodingCat commented Nov 28, 2017 •

edited

CodingCat commented Dec 6, 2017 •

edited

eyalfa Sep 19, 2018 •

edited