[WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics#50157
Closed
zecookiez wants to merge 3 commits intoapache:masterfrom
Closed
[WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics#50157zecookiez wants to merge 3 commits intoapache:masterfrom
zecookiez wants to merge 3 commits intoapache:masterfrom
Conversation
Contributor
Author
|
@cloud-fan I was recommended to get your advice on this change, as this is related to #49816 and was merged in recently |
cloud-fan
reviewed
Mar 5, 2025
| /** | ||
| * @return All instance metrics of this SparkPlan. | ||
| */ | ||
| def instanceMetrics: Map[String, SQLMetric] = Map.empty |
Contributor
There was a problem hiding this comment.
why do we add this method to the base class? It looks specific to the StateStoreWriter operartor.
Contributor
Author
There was a problem hiding this comment.
You're right, it looks like that part wasn't needed. Reduced the scope, thanks!
ericm-db
reviewed
Mar 5, 2025
| protected def setStoreInstanceMetrics( | ||
| instanceMetrics: Map[StateStoreInstanceMetric, Long]): Unit = { | ||
| instanceMetrics.foreach { | ||
| newInstanceMetrics: Map[StateStoreInstanceMetric, Long]): Unit = { |
Contributor
There was a problem hiding this comment.
I guess this is because its conflicting with the var name ?
Contributor
Author
There was a problem hiding this comment.
Yeah it's a conflict with the variable names, I'll rename this to otherStoreInstanceMetrics to clear this up
HeartSaVioR
added a commit
that referenced
this pull request
Mar 14, 2025
…apshot version instance metrics ### What changes were proposed in this pull request? SPARK-51097 #50161 recently had to revert the changes in #49816 due to instance metrics showing up on SparkUI, causing excessive clutter. This PR aims to re-introduce the new instance metrics while incorporating the fix in #50157. The main difference between this and the original PR are: - Distinguishing metrics and instance metrics as a whole, so that SparkPlan does not accidentally pick up hundreds of instance metrics - Because of this change, some of the logic behind processing instance metrics were simplified so I cleaned those up. More detail in the section below. - Adding a test to the suite to verify that no nodes in the execution plan contain the instance metrics we just introduced. Since SparkPlan nodes store metrics as strings, distinguishing between metric types isn't possible in this context. Thus, this test will only check for the specific metric we just introduced. - Small edit to another test to verify instance metric behavior as well (see [here](https://github.com/apache/spark/pull/50195/files#diff-c7a07d8e111bfd75e5579874c6be9ed5afcbfae7411d663352dbbbb51427b084R127)) Line-by-line: - [statefulOperators.scala line 206-238](https://github.com/apache/spark/pull/50195/files#diff-da6ad0bc819dce994a16436fa0797bfc8484644b475227a04c2a3eb5927515f7R206-R238) is one of the main changes. Instead of adding instance metrics to all metrics, we lazy initialize the instance metrics alongside but keep the two SQLMetric maps separate. The instance metrics map also changed its mapping to use the metric objects instead, since we want to hold the configuration information as well and don't need an additional map for it. - [statefulOperators.scala line 349-379](https://github.com/apache/spark/pull/50195/files#diff-da6ad0bc819dce994a16436fa0797bfc8484644b475227a04c2a3eb5927515f7R349-R379) to use the new instance metric map instead of switching between string metric names and metric objects. - [statefulOperators.scala line 474-486](https://github.com/apache/spark/pull/50195/files#diff-da6ad0bc819dce994a16436fa0797bfc8484644b475227a04c2a3eb5927515f7R474-R486) to initialize all metrics in `stateStoreInstanceMetrics` and remove the `stateStoreInstanceMetricObjects` method as we no longer needed to use separate map indexing for instance metrics. - [RocksDBStateStoreIntegrationSuite line 523-561](https://github.com/apache/spark/pull/50195/files#diff-c7a07d8e111bfd75e5579874c6be9ed5afcbfae7411d663352dbbbb51427b084R523-R561) to verify instance metrics don't appear in SparkPlan - Small edit to a pre-existing test in RocksDBStateStoreIntegrationSuite line 127 to instead verify instance metrics still show up in progress reports Before the fix (note that metrics are sorted lexicographically):  After including the fix:  ### Why are the changes needed? From #49816: There's currently a lack of observability into state store specific maintenance information, notably metrics of the last snapshot version uploaded. This affects the ability to identify performance degradation issues behind maintenance tasks and more as described in [SPARK-51097](https://issues.apache.org/jira/browse/SPARK-51097). ### Does this PR introduce _any_ user-facing change? From #49816: There will be some new metrics displayed from StreamingQueryProgress: ``` Streaming query made progress: { ... "stateOperators" : [ { ... "customMetrics" : { ... "SnapshotLastUploaded.partition_0_default" : 2, "SnapshotLastUploaded.partition_12_default" : 10, "SnapshotLastUploaded.partition_8_default" : 10, ... } } ], "sources" : ..., "sink" : ... } ``` ### How was this patch tested? Five tests are added to RocksDBStateStoreIntegrationSuite: - The first four are identical to the tests from #49816, which verify metrics are properly updating using custom faulty stores and different types of stateful queries (deduplicate and join). - The fifth test goes through the generated execution plan for stateful queries to make sure no instance metric shows up. I additionally manually verified these metrics on SparkUI (refer to screenshots above with and without the fix) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50195 from zecookiez/SPARK-51097-rocksdb-with-fix. Lead-authored-by: Zeyu Chen <zycm03@gmail.com> Co-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
SPARK-51097
This PR aims to fix the issue of instance metrics being propagated all the way to Spark UI, even when uninitialized.
Marked as WIP since the original RocksDB PR's changes were reverted, so I'll be making a separate PR to reintroduce both changes.
Why are the changes needed?
Previously, Spark UI would show every possible instance metric that we pre-allocated for every partition. This led to the SQL tab showing a lot of metrics that are unused.
Does this PR introduce any user-facing change?
This should eliminate instance metrics being shown on Spark UI, leaving only the instance metrics visible from Streaming query progress reports.
How was this patch tested?
Tests were reverified in RocksDBStateStoreIntegrationSuite
I ran some stateful queries and checked on SparkUI to make sure these metrics aren't showing up. Will upload screenshots soon.
Was this patch authored or co-authored using generative AI tooling?
No