-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47804] Add Dataframe cache debug log #45990
Conversation
@@ -204,6 +215,8 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper { | |||
cachedData = cachedData.filterNot(cd => plansToUncache.exists(_ eq cd)) | |||
} | |||
plansToUncache.foreach { _.cachedRepresentation.cacheBuilder.clearCache(blocking) } | |||
CacheManager.logCacheOperation(s"Removed ${plansToUncache.size} Dataframe cache " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1609,6 +1609,19 @@ object SQLConf { | |||
.checkValues(StorageLevelMapper.values.map(_.name()).toSet) | |||
.createWithDefault(StorageLevelMapper.MEMORY_AND_DISK.name()) | |||
|
|||
val DATAFRAME_CACHE_LOG_LEVEL = buildConf("spark.sql.dataframeCache.logLevel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we need to debug cache table
as well? Shall we rename the config as
spark.sql.cache.logLevel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's make it an internal conf since it is for developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the Dataframe cache naming to differentiate it from the RDD cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDD is a Spark core concept. Anyway I respect your choice here.
Thanks, merging to master |
What changes were proposed in this pull request?
This PR adds a debug log for Dataframe cache that uses SQL conf to turn on. It logs necessary information on
Because every query applies cache, this log could be huge and should be only turned on during some debugging process, and should not enabled by default in production.
Example:
Why are the changes needed?
Easier debugging.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Run local spark shell.
Was this patch authored or co-authored using generative AI tooling?
No.