-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22673][SQL] InMemoryRelation should utilize existing stats whenever possible #19864
Changes from 5 commits
b2fb1d2
0971900
32f7c74
2082c0e
bc70817
4650307
b6a36be
898f1b5
4c34701
8f912a3
064f6d1
385edc0
fbb3729
0170070
6299f3d
eb2dc0a
0fde46e
c2a92d4
4b2fcb6
b2d829b
de2905c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -94,14 +94,16 @@ class CacheManager extends Logging { | |
logWarning("Asked to cache already cached data.") | ||
} else { | ||
val sparkSession = query.sparkSession | ||
cachedData.add(CachedData( | ||
planToCache, | ||
InMemoryRelation( | ||
sparkSession.sessionState.conf.useCompression, | ||
sparkSession.sessionState.conf.columnBatchSize, | ||
storageLevel, | ||
sparkSession.sessionState.executePlan(planToCache).executedPlan, | ||
tableName))) | ||
val inMemoryRelation = InMemoryRelation( | ||
sparkSession.sessionState.conf.useCompression, | ||
sparkSession.sessionState.conf.columnBatchSize, | ||
storageLevel, | ||
sparkSession.sessionState.executePlan(planToCache).executedPlan, | ||
tableName) | ||
if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { | ||
inMemoryRelation.setStatsFromCachedPlan(planToCache) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have to make InMemoryRelation stateful to avoid breaking APIs..... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a more ideal change is to put the original plan stats into the constructor of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like I have no way to access InMemoryRelation from outside of spark package, though it is not a package private class...how is that achieved? If this is the case, I can modify the constructor Thanks @cloud-fan There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
cachedData.add(CachedData(planToCache, inMemoryRelation)) | ||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,13 +25,15 @@ import org.apache.spark.sql.catalyst.InternalRow | |
import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation | ||
import org.apache.spark.sql.catalyst.expressions._ | ||
import org.apache.spark.sql.catalyst.plans.logical | ||
import org.apache.spark.sql.catalyst.plans.logical.Statistics | ||
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Statistics} | ||
import org.apache.spark.sql.execution.SparkPlan | ||
import org.apache.spark.sql.execution.datasources.LogicalRelation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unused import. |
||
import org.apache.spark.storage.StorageLevel | ||
import org.apache.spark.util.LongAccumulator | ||
|
||
|
||
object InMemoryRelation { | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unnecessary change. |
||
def apply( | ||
useCompression: Boolean, | ||
batchSize: Int, | ||
|
@@ -71,14 +73,20 @@ case class InMemoryRelation( | |
|
||
override def computeStats(): Statistics = { | ||
if (batchStats.value == 0L) { | ||
// Underlying columnar RDD hasn't been materialized, no useful statistics information | ||
// available, return the default statistics. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes) | ||
inheritedStats.getOrElse(Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes)) | ||
} else { | ||
Statistics(sizeInBytes = batchStats.value.longValue) | ||
} | ||
} | ||
|
||
private var inheritedStats: Option[Statistics] = _ | ||
|
||
private[execution] def setStatsFromCachedPlan(planToCache: LogicalPlan): Unit = { | ||
require(planToCache.conf.cboEnabled, "you cannot use the stats of cached plan in" + | ||
" InMemoryRelation without cbo enabled") | ||
inheritedStats = Some(planToCache.stats) | ||
} | ||
|
||
// If the cached column buffers were not passed in, we calculate them in the constructor. | ||
// As in Spark, the actual work of caching is lazy. | ||
if (_cachedColumnBuffers == null) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to limit to those conditions? I think we can pass the stats into the created
InMemoryRelation
even the two conditions don't match.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I put it here is that when we did not enable CBO, the stats in the underlying plan might be much smaller than the actual size in memory leading to the potential risk of OOM error.
The underlying cause is that without CBO enabled, the size of the plan is calculated with BaseRelation's sizeInBytes, but with CBO, we can have a more accurate estimation,
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
Lines 42 to 46 in 03fdc92
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Lines 370 to 381 in 03fdc92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When CBO is disabled, don't we just set the
sizeInBytes
todefaultSizeInBytes
? Is it different than current statistics of first run?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if CBO is disabled, the relation's sizeInBytes is the file size
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
Line 85 in 5c3a1f3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogicalRelation
uses the statistics from therelation
only when there is no givencatalogTable
. In this case, it doesn't consider if CBO is enabled or not.Only
catalogTable
considers CBO when computing its statistics intoPlanStats
. It doesn't refer to relation's statistics, IIUC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a catalog table doesn't have statistics in its metadata, we will fill it with
defaultSizeInBytes
.spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
Lines 121 to 134 in 326f1d6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya you're right! Thanks for clearing the confusion
however, to prevent using relation's stats which can be much smaller than the in-memory size and lead to a potential OOM error, we should still have this condition here (we can remove cboEnabled though), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statistics from relation is based on files size, will it easily cause OOM issue? I think in the cases other than cached query, we still use this relation's statistics. If this is an issue, doesn't it also affect the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true, it affects I believe....there is a similar discussion in #19743