Skip to content

[WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics#50157

Closed
zecookiez wants to merge 3 commits intoapache:masterfrom
zecookiez:SPARK-51097-fix-metric-prop
Closed

[WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics#50157
zecookiez wants to merge 3 commits intoapache:masterfrom
zecookiez:SPARK-51097-fix-metric-prop

Conversation

@zecookiez
Copy link
Copy Markdown
Contributor

@zecookiez zecookiez commented Mar 4, 2025

What changes were proposed in this pull request?

SPARK-51097

This PR aims to fix the issue of instance metrics being propagated all the way to Spark UI, even when uninitialized.

Marked as WIP since the original RocksDB PR's changes were reverted, so I'll be making a separate PR to reintroduce both changes.

Why are the changes needed?

Previously, Spark UI would show every possible instance metric that we pre-allocated for every partition. This led to the SQL tab showing a lot of metrics that are unused.

Does this PR introduce any user-facing change?

This should eliminate instance metrics being shown on Spark UI, leaving only the instance metrics visible from Streaming query progress reports.

How was this patch tested?

Tests were reverified in RocksDBStateStoreIntegrationSuite

I ran some stateful queries and checked on SparkUI to make sure these metrics aren't showing up. Will upload screenshots soon.

Was this patch authored or co-authored using generative AI tooling?

No

@zecookiez
Copy link
Copy Markdown
Contributor Author

@cloud-fan I was recommended to get your advice on this change, as this is related to #49816 and was merged in recently

/**
* @return All instance metrics of this SparkPlan.
*/
def instanceMetrics: Map[String, SQLMetric] = Map.empty
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we add this method to the base class? It looks specific to the StateStoreWriter operartor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it looks like that part wasn't needed. Reduced the scope, thanks!

protected def setStoreInstanceMetrics(
instanceMetrics: Map[StateStoreInstanceMetric, Long]): Unit = {
instanceMetrics.foreach {
newInstanceMetrics: Map[StateStoreInstanceMetric, Long]): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is because its conflicting with the var name ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a conflict with the variable names, I'll rename this to otherStoreInstanceMetrics to clear this up

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@zecookiez zecookiez changed the title [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics [WIP] [SPARK-51097] [SS] Split apart SparkPlan metrics and instance metrics Mar 6, 2025
@zecookiez zecookiez closed this Mar 12, 2025
HeartSaVioR added a commit that referenced this pull request Mar 14, 2025
…apshot version instance metrics

### What changes were proposed in this pull request?

SPARK-51097

#50161 recently had to revert the changes in #49816 due to instance metrics showing up on SparkUI, causing excessive clutter. This PR aims to re-introduce the new instance metrics while incorporating the fix in #50157.

The main difference between this and the original PR are:
- Distinguishing metrics and instance metrics as a whole, so that SparkPlan does not accidentally pick up hundreds of instance metrics
  - Because of this change, some of the logic behind processing instance metrics were simplified so I cleaned those up. More detail in the section below.
- Adding a test to the suite to verify that no nodes in the execution plan contain the instance metrics we just introduced. Since SparkPlan nodes store metrics as strings, distinguishing between metric types isn't possible in this context. Thus, this test will only check for the specific metric we just introduced.
- Small edit to another test to verify instance metric behavior as well (see [here](https://github.com/apache/spark/pull/50195/files#diff-c7a07d8e111bfd75e5579874c6be9ed5afcbfae7411d663352dbbbb51427b084R127))

Line-by-line:
- [statefulOperators.scala line 206-238](https://github.com/apache/spark/pull/50195/files#diff-da6ad0bc819dce994a16436fa0797bfc8484644b475227a04c2a3eb5927515f7R206-R238) is one of the main changes. Instead of adding instance metrics to all metrics, we lazy initialize the instance metrics alongside but keep the two SQLMetric maps separate. The instance metrics map also changed its mapping to use the metric objects instead, since we want to hold the configuration information as well and don't need an additional map for it.
- [statefulOperators.scala line 349-379](https://github.com/apache/spark/pull/50195/files#diff-da6ad0bc819dce994a16436fa0797bfc8484644b475227a04c2a3eb5927515f7R349-R379) to use the new instance metric map instead of switching between string metric names and metric objects.
- [statefulOperators.scala line 474-486](https://github.com/apache/spark/pull/50195/files#diff-da6ad0bc819dce994a16436fa0797bfc8484644b475227a04c2a3eb5927515f7R474-R486) to initialize all metrics in `stateStoreInstanceMetrics` and remove the `stateStoreInstanceMetricObjects` method as we no longer needed to use separate map indexing for instance metrics.
- [RocksDBStateStoreIntegrationSuite line 523-561](https://github.com/apache/spark/pull/50195/files#diff-c7a07d8e111bfd75e5579874c6be9ed5afcbfae7411d663352dbbbb51427b084R523-R561) to verify instance metrics don't appear in SparkPlan
- Small edit to a pre-existing test in RocksDBStateStoreIntegrationSuite line 127 to instead verify instance metrics still show up in progress reports

Before the fix (note that metrics are sorted lexicographically):
![image](https://github.com/user-attachments/assets/f6ed3e44-7a38-403f-bb14-8e30ad835498)

After including the fix:
![image](https://github.com/user-attachments/assets/12e16b39-094b-4e29-9fe5-15a1b4c4fac2)

### Why are the changes needed?

From #49816:
There's currently a lack of observability into state store specific maintenance information, notably metrics of the last snapshot version uploaded. This affects the ability to identify performance degradation issues behind maintenance tasks and more as described in [SPARK-51097](https://issues.apache.org/jira/browse/SPARK-51097).

### Does this PR introduce _any_ user-facing change?

From #49816:
There will be some new metrics displayed from StreamingQueryProgress:
```
Streaming query made progress: {
  ...
  "stateOperators" : [ {
    ...
    "customMetrics" : {
      ...
      "SnapshotLastUploaded.partition_0_default" : 2,
      "SnapshotLastUploaded.partition_12_default" : 10,
      "SnapshotLastUploaded.partition_8_default" : 10,
      ...
    }
  } ],
  "sources" : ...,
  "sink" : ...
}
```

### How was this patch tested?

Five tests are added to RocksDBStateStoreIntegrationSuite:
- The first four are identical to the tests from #49816, which verify metrics are properly updating using custom faulty stores and different types of stateful queries (deduplicate and join).
- The fifth test goes through the generated execution plan for stateful queries to make sure no instance metric shows up.

I additionally manually verified these metrics on SparkUI (refer to screenshots above with and without the fix)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50195 from zecookiez/SPARK-51097-rocksdb-with-fix.

Lead-authored-by: Zeyu Chen <zycm03@gmail.com>
Co-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants