Skip to content

[SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units#33116

Closed
xkrogen wants to merge 5 commits intoapache:masterfrom
xkrogen:xkrogen-SPARK-35259-ess-fix-metric-unit-prefix
Closed

[SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units#33116
xkrogen wants to merge 5 commits intoapache:masterfrom
xkrogen:xkrogen-SPARK-35259-ess-fix-metric-unit-prefix

Conversation

@xkrogen
Copy link
Contributor

@xkrogen xkrogen commented Jun 28, 2021

What changes were proposed in this pull request?

ExternalBlockHandler exposes 4 metrics which are Dropwizard Timer metrics, and are named with a millis suffix:

    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
    private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
    private final Timer finalizeShuffleMergeLatencyMillis = new Timer();

However these Dropwizard Timers by default use nanoseconds (documentation).

This causes YarnShuffleServiceMetrics to expose confusingly-named metrics like openBlockRequestLatencyMillis_nanos_max (the actual values are currently in nanos).

This PR adds a new Timer subclass, TimerWithCustomTimeUnit, which accepts a TimeUnit at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The Timer metrics within ExternalBlockHandler are updated to use the new class with milliseconds as the unit. The logic to include the nanos suffix in the metric name within YarnShuffleServiceMetrics has also been removed, with the assumption that the metric name itself includes the units.

Does this PR introduce any user-facing change?

Yes, there are two changes.
First, the names for metrics exposed by ExternalBlockHandler via YarnShuffleServiceMetrics such as openBlockRequestLatencyMillis_nanos_max and openBlockRequestLatencyMillis_nanos_50thPercentile have been changed to remove the _nanos suffix. This would be considered a breaking change, but these names were only exposed as part of #32388, which has not yet been released (slated for 3.2.0). New names are like openBlockRequestLatencyMillis_max and openBlockRequestLatencyMillis_50thPercentile
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as openBlockRequestLatencyMillis_count or openBlockRequestLatencyMillis_rate1, only the Snapshot-related metrics (max, median, percentiles, etc.). For the YARN case, these metrics were also introduced by #32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.

How was this patch tested?

Unit tests have been updated.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 28, 2021

cc @jaceklaskowski @HyukjinKwon @Ngone51 @mridulm @otterc

My only concern here is whether this will be considered backwards-incompatible due to the metrics name changes. I'm not sure what the contract is for metric names, so any input here would be appreciated.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a breaking change, @xkrogen .

  1. +1 for changing the internal code variables.
  2. -1 for removing the existing metrics like openBlockRequestLatencyMillis. Instead, we need to expose correct metric values and a migration doc.
  3. -0 for adding new metrics like openBlockRequestLatency additionally because it causes the increase of the traffic and storage of metric collection systems with the duplicated values.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 28, 2021

Got it, thanks for chiming in @dongjoon-hyun ! I'm fine with just renaming the variables for now and neutral on adding correct names alongside the old ones. I will wait for other folks to chime in before updating.

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140361 has finished for PR 33116 at commit faa0925.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44889/

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44889/

@xkrogen xkrogen force-pushed the xkrogen-SPARK-35259-ess-fix-metric-unit-prefix branch from faa0925 to 9681c36 Compare July 1, 2021 22:47
@xkrogen xkrogen changed the title [SPARK-35259][SHUFFLE] Rename ExternalBlockHandler Timer metrics without incorrect millis suffix [SPARK-35259][SHUFFLE] Rename ExternalBlockHandler Timer variables to remove incorrect millis suffix Jul 1, 2021
@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 1, 2021

@dongjoon-hyun updated per your comments, PTAL. I also updated the documentation to address the discrepancy.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45057/

@dongjoon-hyun
Copy link
Member

Thank you for updates, @xkrogen .

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, this is not what I asked. I must be clearer. I asked to expose the correct metric values which mean converting the value to mills and put here, @xkrogen .

-1 for removing the existing metrics like openBlockRequestLatencyMillis. Instead, we need to expose correct metric values and a migration doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun It requires a Metric here rather than a long value.

The Timer doesn't seem to provide any APIs to get the milliseconds. The only way I see now is to implements Spark's own Timer by extending Dropwizard's one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I understand now @dongjoon-hyun. Thanks for the clarification.

For the YARN case specifically, we can modify YarnShuffleServiceMetrics to do the conversion. This would be quite straightforward.

However I'm not sure of how the metrics are used for the non-YARN case, so I'm not sure what would be appropriate. Can either of you comment on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought on this a bit more and came to agree with @Ngone51 that the better way to do this is to implement a custom Timer instead of trying to perform the conversion at some other layer. Just put up a new diff for that.

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45057/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140544 has finished for PR 33116 at commit 9681c36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xkrogen xkrogen force-pushed the xkrogen-SPARK-35259-ess-fix-metric-unit-prefix branch from 9681c36 to 8e8f6b3 Compare July 12, 2021 18:56
@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 12, 2021

Just put up a new diff which implements a custom Timer subclass, TimerWithMillisecondSnapshots, which acts the same as a normal Timer and stores nanoseconds internally, but exposes all values as milliseconds. It works pretty cleanly.

For YarnShuffleServiceMetrics, I removed the nanos suffix that the reporter was adding, so it is now unit-agnostic. Note that this suffix only appeared in 3.2.0 (via SPARK-25358 / #32388), so it hasn't been released yet and we can safely remove it if we get this PR into branch-3.2.

My only concern with this approach is if some other metrics reporter (besides YarnShuffleService) may try to use these custom timers as if they still had nanosecond units. I'm not familiar enough with how the metrics are used in the non-YARN context to know if this will be an issue, so input is welcomed here.

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45445/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45445/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Test build #140933 has finished for PR 33116 at commit 8e8f6b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • static class TimerWithMillisecondSnapshots extends Timer

@Ngone51
Copy link
Member

Ngone51 commented Jul 14, 2021

My only concern with this approach is if some other metrics reporter (besides YarnShuffleService) may try to use these custom timers as if they still had nanosecond units. I'm not familiar enough with how the metrics are used in the non-YARN context to know if this will be an issue, so input is welcomed here.

Ideally, one should learn the usage before using the timer. That being said, I think we can force the TimerWithMillisecondSnapshots to receive a time-unit parameter to indicate which time-unit it exposes.

WDYT?

@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 14, 2021

I think we can force the TimerWithMillisecondSnapshots to receive a time-unit parameter to indicate which time-unit it exposes.

I don't it makes sense to force callers to supply a time unit, since (a) we should adhere to the Timer interface, which already exposes the unitless methods, (b) the whole issue here is that these metrics themselves are trying to define their unit, and if the caller supplies a unit, then the caller has to somehow know that this metric wants to be in milliseconds, which seems to defeat the purpose.

If we want to go down a route like this, it's basically what I was proposing initially where YarnShuffleServiceMetrics does the conversion. The Timer interface doesn't provide any way to extract unit information, but we could either check if the metric name ends in "millis" (and if so, treat the unit as millis instead of nanos) or create a custom extension which has a method defining the units, like:

  static class TimerWithUnits extends Timer {
    private final TimeUnit timeUnit;
    TimerWithUnits(TimeUnit timeUnit) {
      this.timeUnit = timeUnit;
    }
    public TimeUnit getTimeUnit() {
      return timeUnit;
    }
  }

Then within YarnShuffleServiceMetrics#collectMetric:

      TimeUnit timeUnit;
      if (metric instanceof TimerWithUnits) {
        timeUnit = ((TimerWithUnits) metric).getTimeUnit();
      } else {
        timeUnit = TimeUnit.NANOSECONDS;
      }
      // do some conversion on the snapshot based on the unit above

@Ngone51
Copy link
Member

Ngone51 commented Jul 15, 2021

(a) we should adhere to the Timer interface, which already exposes the unitless methods,

Although the Timer exposes the unitless methods, it doesn't mean the methods don't have the time unit. Actually, I think this the key point that why we misused the Timer in the first place. The timer has hidden the time unit. So, wouldn't it be clearer if we force a unit requirement while creating the (our custom) Timer?

(b) the whole issue here is that these metrics themselves are trying to define their unit, and if the caller supplies a unit, then the caller has to somehow know that this metric wants to be in milliseconds, which seems to defeat the purpose.

TBH, I don't understand this well...The use case I image is like this, e.g.,

allMetrics.put("metricA", new OurTimer(timeUnit = TimeUnit.SECONDS))
allMetrics.put("metricBMs", new OurTimer(timeUnit = TimeUnit.MILLISECONDS))

It's fine for the metric to omit the unit suffix. But, if the metric has the unit suffix, then, it's the developer's responsibility to ensure the time unit consistent between the metric and Timer.

Then within YarnShuffleServiceMetrics#collectMetric:

TimeUnit timeUnit;
 if (metric instanceof TimerWithUnits) {
  timeUnit = ((TimerWithUnits) metric).getTimeUnit();
 } else {
  timeUnit = TimeUnit.NANOSECONDS;
}
 // do some conversion on the snapshot based on the unit above

And, with my understanding, the time unit should be used at all related places inside the timer (Context, Histogram, etc). So, all APIs exposed by the Timer would return the values based on the specified time unit directly. Thus, we don't have to do the manual conversion after getting the return value.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 15, 2021

Ah, I see, I misunderstood what you meant. I thought you meant that the place where the values are extracted from the Timer (e.g. YarnShuffleServiceMetrics) would supply a unit, but you are saying that it would be where the Timer is created (e.g. ExternalBlockHander). I agree with you.

@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 15, 2021

Pushed up a new commit changing TimerWithMillisecondSnapshots to TimerWithCustomTimeUnit, which will accept a configurable time unit as a constructor parameter. I also moved it into the org.apache.spark.network.util package within network-common since it is becoming more general. @Ngone51 let me know what you think.

@SparkQA
Copy link

SparkQA commented Jul 15, 2021

Test build #141082 has started for PR 33116 at commit ca1c396.

@SparkQA
Copy link

SparkQA commented Jul 15, 2021

Kubernetes integration test unable to build dist.

exiting with code: 141
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45596/

@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 20, 2021

@dongjoon-hyun @Ngone51 gentle ping, any thoughts on the latest diff?

@xkrogen xkrogen force-pushed the xkrogen-SPARK-35259-ess-fix-metric-unit-prefix branch from ca1c396 to d62dd57 Compare July 21, 2021 20:44
@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 21, 2021

Pushed up a new commit fixing a few old references to milliseconds and adding testing for the Timer.Context API. Thanks for the review @Ngone51 !

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45964/

@SparkQA
Copy link

SparkQA commented Jul 21, 2021

Test build #141445 has finished for PR 33116 at commit d62dd57.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Time latency for processing fetch merged blocks meta request latency in ms
private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
private final Timer fetchMergedBlocksMetaLatencyMillis =
new TimerWithCustomTimeUnit(TimeUnit.MILLISECONDS);
Copy link
Member

@Ngone51 Ngone51 Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51
Copy link
Member

Ngone51 commented Jul 22, 2021

LGTM if tests pass.

@Ngone51
Copy link
Member

Ngone51 commented Jul 22, 2021

@dongjoon-hyun it'd be great if you could take a look too.

@Ngone51
Copy link
Member

Ngone51 commented Jul 22, 2021

@xkrogen Could you update the PR description?

@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 22, 2021

Pushed up new commit fixing the indentation.

Looks like the GitHub Actions failures are due to the issues being worked on in #33475.

The Jenkins build also failed with one test:

org.apache.spark.sql.execution.DataSourceV2ScanExecRedactionSuite.SPARK-30362: test input metrics for DSV2

It looks unrelated and passes locally.

@xkrogen xkrogen changed the title [SPARK-35259][SHUFFLE] Rename ExternalBlockHandler Timer variables to remove incorrect millis suffix [SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units Jul 22, 2021
@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 22, 2021

Also updated description, thanks for the reminder @Ngone51 ! It had gotten pretty outdated :)

@Ngone51 Ngone51 closed this in 70a1586 Jul 24, 2021
Ngone51 pushed a commit that referenced this pull request Jul 24, 2021
… expose correct units

### What changes were proposed in this pull request?
`ExternalBlockHandler` exposes 4 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
    private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
    private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).

This causes `YarnShuffleServiceMetrics` to expose confusingly-named metrics like `openBlockRequestLatencyMillis_nanos_max` (the actual values are currently in nanos).

This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit. The logic to include the `nanos` suffix in the metric name within `YarnShuffleServiceMetrics` has also been removed, with the assumption that the metric name itself includes the units.

### Does this PR introduce _any_ user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by `ExternalBlockHandler` via `YarnShuffleServiceMetrics` such as `openBlockRequestLatencyMillis_nanos_max` and `openBlockRequestLatencyMillis_nanos_50thPercentile` have been changed to remove the `_nanos` suffix. This would be considered a breaking change, but these names were only exposed as part of #32388, which has not yet been released (slated for 3.2.0). New names are like `openBlockRequestLatencyMillis_max` and `openBlockRequestLatencyMillis_50thPercentile`
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics were also introduced by #32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.

### How was this patch tested?
Unit tests have been updated.

Closes #33116 from xkrogen/xkrogen-SPARK-35259-ess-fix-metric-unit-prefix.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 70a1586)
Signed-off-by: yi.wu <yi.wu@databricks.com>
@Ngone51
Copy link
Member

Ngone51 commented Jul 24, 2021

Thanks, merged to master/3.2.

@Ngone51
Copy link
Member

Ngone51 commented Jul 24, 2021

@xkrogen could you also backport this to 3.1/3.0?

@dongjoon-hyun
Copy link
Member

+1, late LGTM. Sorry for missing pingings here. Thank you, @xkrogen and @Ngone51 .

@xkrogen xkrogen deleted the xkrogen-SPARK-35259-ess-fix-metric-unit-prefix branch July 26, 2021 15:48
xkrogen added a commit to xkrogen/spark that referenced this pull request Jul 26, 2021
…es to expose correct units

`ExternalBlockHandler` exposes 3 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
    private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).

This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit.

This introduces a user-facing change as the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics are not exposed prior to 3.2, so there is no change. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages, so this PR is primarily code cleanup and to make branch 3.1 match with 3.2/master.

Note that this differs from the master/3.2 version (apache#33116) primarily because there are no changes needed in `YarnShuffleServiceMetrics`.

Unit tests have been updated.
@xkrogen
Copy link
Contributor Author

xkrogen commented Jul 26, 2021

Awesome, many thanks @Ngone51 and @dongjoon-hyun !

Backport PR for 3.1: #33523
Backport PR for 3.0: #33524

xkrogen added a commit to xkrogen/spark that referenced this pull request Jul 26, 2021
…es to expose correct units

`ExternalBlockHandler` exposes 2 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).

This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit.

This introduces a user-facing change as the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics are not exposed prior to 3.2, so there is no change. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages, so this PR is primarily code cleanup and to make branch 3.0 match with 3.2/master.

Note that this differs from the master/3.2 version (apache#33116) primarily because there are no changes needed in `YarnShuffleServiceMetrics`.

Unit tests have been updated.
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
… expose correct units

### What changes were proposed in this pull request?
`ExternalBlockHandler` exposes 4 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
    private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
    private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).

This causes `YarnShuffleServiceMetrics` to expose confusingly-named metrics like `openBlockRequestLatencyMillis_nanos_max` (the actual values are currently in nanos).

This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit. The logic to include the `nanos` suffix in the metric name within `YarnShuffleServiceMetrics` has also been removed, with the assumption that the metric name itself includes the units.

### Does this PR introduce _any_ user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by `ExternalBlockHandler` via `YarnShuffleServiceMetrics` such as `openBlockRequestLatencyMillis_nanos_max` and `openBlockRequestLatencyMillis_nanos_50thPercentile` have been changed to remove the `_nanos` suffix. This would be considered a breaking change, but these names were only exposed as part of apache#32388, which has not yet been released (slated for 3.2.0). New names are like `openBlockRequestLatencyMillis_max` and `openBlockRequestLatencyMillis_50thPercentile`
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics were also introduced by apache#32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.

### How was this patch tested?
Unit tests have been updated.

Closes apache#33116 from xkrogen/xkrogen-SPARK-35259-ess-fix-metric-unit-prefix.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 70a1586)
Signed-off-by: yi.wu <yi.wu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants