[SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units#33116
[SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units#33116xkrogen wants to merge 5 commits intoapache:masterfrom
Conversation
|
cc @jaceklaskowski @HyukjinKwon @Ngone51 @mridulm @otterc My only concern here is whether this will be considered backwards-incompatible due to the metrics name changes. I'm not sure what the contract is for metric names, so any input here would be appreciated. |
There was a problem hiding this comment.
Yes, this is a breaking change, @xkrogen .
- +1 for changing the internal code variables.
- -1 for removing the existing metrics like
openBlockRequestLatencyMillis. Instead, we need to expose correct metric values and a migration doc. - -0 for adding new metrics like
openBlockRequestLatencyadditionally because it causes the increase of the traffic and storage of metric collection systems with the duplicated values.
|
Got it, thanks for chiming in @dongjoon-hyun ! I'm fine with just renaming the variables for now and neutral on adding correct names alongside the old ones. I will wait for other folks to chime in before updating. |
|
Test build #140361 has finished for PR 33116 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
faa0925 to
9681c36
Compare
|
@dongjoon-hyun updated per your comments, PTAL. I also updated the documentation to address the discrepancy. |
|
Kubernetes integration test starting |
|
Thank you for updates, @xkrogen . |
There was a problem hiding this comment.
Ur, this is not what I asked. I must be clearer. I asked to expose the correct metric values which mean converting the value to mills and put here, @xkrogen .
-1 for removing the existing metrics like openBlockRequestLatencyMillis. Instead, we need to expose correct metric values and a migration doc.
There was a problem hiding this comment.
@dongjoon-hyun It requires a Metric here rather than a long value.
The Timer doesn't seem to provide any APIs to get the milliseconds. The only way I see now is to implements Spark's own Timer by extending Dropwizard's one.
There was a problem hiding this comment.
Ohh, I understand now @dongjoon-hyun. Thanks for the clarification.
For the YARN case specifically, we can modify YarnShuffleServiceMetrics to do the conversion. This would be quite straightforward.
However I'm not sure of how the metrics are used for the non-YARN case, so I'm not sure what would be appropriate. Can either of you comment on that?
There was a problem hiding this comment.
I thought on this a bit more and came to agree with @Ngone51 that the better way to do this is to implement a custom Timer instead of trying to perform the conversion at some other layer. Just put up a new diff for that.
|
Kubernetes integration test status success |
|
Test build #140544 has finished for PR 33116 at commit
|
9681c36 to
8e8f6b3
Compare
|
Just put up a new diff which implements a custom For My only concern with this approach is if some other metrics reporter (besides |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #140933 has finished for PR 33116 at commit
|
Ideally, one should learn the usage before using the timer. That being said, I think we can force the WDYT? |
I don't it makes sense to force callers to supply a time unit, since (a) we should adhere to the If we want to go down a route like this, it's basically what I was proposing initially where Then within |
Although the Timer exposes the unitless methods, it doesn't mean the methods don't have the time unit. Actually, I think this the key point that why we misused the Timer in the first place. The timer has hidden the time unit. So, wouldn't it be clearer if we force a unit requirement while creating the (our custom) Timer?
TBH, I don't understand this well...The use case I image is like this, e.g., allMetrics.put("metricA", new OurTimer(timeUnit = TimeUnit.SECONDS))
allMetrics.put("metricBMs", new OurTimer(timeUnit = TimeUnit.MILLISECONDS))It's fine for the metric to omit the unit suffix. But, if the metric has the unit suffix, then, it's the developer's responsibility to ensure the time unit consistent between the metric and Timer.
And, with my understanding, the time unit should be used at all related places inside the timer (Context, Histogram, etc). So, all APIs exposed by the |
|
Ah, I see, I misunderstood what you meant. I thought you meant that the place where the values are extracted from the Timer (e.g. |
|
Pushed up a new commit changing |
|
Test build #141082 has started for PR 33116 at commit |
|
Kubernetes integration test unable to build dist. exiting with code: 141 |
|
@dongjoon-hyun @Ngone51 gentle ping, any thoughts on the latest diff? |
…. Also update documentation to reflect the discrepancy.
…s in ms instead of nanos.
…'util' package within network-common.
ca1c396 to
d62dd57
Compare
|
Pushed up a new commit fixing a few old references to milliseconds and adding testing for the |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #141445 has finished for PR 33116 at commit
|
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java
Show resolved
Hide resolved
| // Time latency for processing fetch merged blocks meta request latency in ms | ||
| private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer(); | ||
| private final Timer fetchMergedBlocksMetaLatencyMillis = | ||
| new TimerWithCustomTimeUnit(TimeUnit.MILLISECONDS); |
|
LGTM if tests pass. |
|
@dongjoon-hyun it'd be great if you could take a look too. |
|
@xkrogen Could you update the PR description? |
|
Pushed up new commit fixing the indentation. Looks like the GitHub Actions failures are due to the issues being worked on in #33475. The Jenkins build also failed with one test: It looks unrelated and passes locally. |
|
Also updated description, thanks for the reminder @Ngone51 ! It had gotten pretty outdated :) |
… expose correct units
### What changes were proposed in this pull request?
`ExternalBlockHandler` exposes 4 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
private final Timer openBlockRequestLatencyMillis = new Timer();
private final Timer registerExecutorRequestLatencyMillis = new Timer();
private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).
This causes `YarnShuffleServiceMetrics` to expose confusingly-named metrics like `openBlockRequestLatencyMillis_nanos_max` (the actual values are currently in nanos).
This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit. The logic to include the `nanos` suffix in the metric name within `YarnShuffleServiceMetrics` has also been removed, with the assumption that the metric name itself includes the units.
### Does this PR introduce _any_ user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by `ExternalBlockHandler` via `YarnShuffleServiceMetrics` such as `openBlockRequestLatencyMillis_nanos_max` and `openBlockRequestLatencyMillis_nanos_50thPercentile` have been changed to remove the `_nanos` suffix. This would be considered a breaking change, but these names were only exposed as part of #32388, which has not yet been released (slated for 3.2.0). New names are like `openBlockRequestLatencyMillis_max` and `openBlockRequestLatencyMillis_50thPercentile`
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics were also introduced by #32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.
### How was this patch tested?
Unit tests have been updated.
Closes #33116 from xkrogen/xkrogen-SPARK-35259-ess-fix-metric-unit-prefix.
Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 70a1586)
Signed-off-by: yi.wu <yi.wu@databricks.com>
|
Thanks, merged to master/3.2. |
|
@xkrogen could you also backport this to 3.1/3.0? |
…es to expose correct units
`ExternalBlockHandler` exposes 3 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
private final Timer openBlockRequestLatencyMillis = new Timer();
private final Timer registerExecutorRequestLatencyMillis = new Timer();
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).
This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit.
This introduces a user-facing change as the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics are not exposed prior to 3.2, so there is no change. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages, so this PR is primarily code cleanup and to make branch 3.1 match with 3.2/master.
Note that this differs from the master/3.2 version (apache#33116) primarily because there are no changes needed in `YarnShuffleServiceMetrics`.
Unit tests have been updated.
|
Awesome, many thanks @Ngone51 and @dongjoon-hyun ! |
…es to expose correct units
`ExternalBlockHandler` exposes 2 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
private final Timer openBlockRequestLatencyMillis = new Timer();
private final Timer registerExecutorRequestLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).
This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit.
This introduces a user-facing change as the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics are not exposed prior to 3.2, so there is no change. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages, so this PR is primarily code cleanup and to make branch 3.0 match with 3.2/master.
Note that this differs from the master/3.2 version (apache#33116) primarily because there are no changes needed in `YarnShuffleServiceMetrics`.
Unit tests have been updated.
… expose correct units
### What changes were proposed in this pull request?
`ExternalBlockHandler` exposes 4 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
private final Timer openBlockRequestLatencyMillis = new Timer();
private final Timer registerExecutorRequestLatencyMillis = new Timer();
private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).
This causes `YarnShuffleServiceMetrics` to expose confusingly-named metrics like `openBlockRequestLatencyMillis_nanos_max` (the actual values are currently in nanos).
This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit. The logic to include the `nanos` suffix in the metric name within `YarnShuffleServiceMetrics` has also been removed, with the assumption that the metric name itself includes the units.
### Does this PR introduce _any_ user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by `ExternalBlockHandler` via `YarnShuffleServiceMetrics` such as `openBlockRequestLatencyMillis_nanos_max` and `openBlockRequestLatencyMillis_nanos_50thPercentile` have been changed to remove the `_nanos` suffix. This would be considered a breaking change, but these names were only exposed as part of apache#32388, which has not yet been released (slated for 3.2.0). New names are like `openBlockRequestLatencyMillis_max` and `openBlockRequestLatencyMillis_50thPercentile`
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics were also introduced by apache#32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.
### How was this patch tested?
Unit tests have been updated.
Closes apache#33116 from xkrogen/xkrogen-SPARK-35259-ess-fix-metric-unit-prefix.
Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 70a1586)
Signed-off-by: yi.wu <yi.wu@databricks.com>
What changes were proposed in this pull request?
ExternalBlockHandlerexposes 4 metrics which are DropwizardTimermetrics, and are named with amillissuffix:However these Dropwizard Timers by default use nanoseconds (documentation).
This causes
YarnShuffleServiceMetricsto expose confusingly-named metrics likeopenBlockRequestLatencyMillis_nanos_max(the actual values are currently in nanos).This PR adds a new
Timersubclass,TimerWithCustomTimeUnit, which accepts aTimeUnitat creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. TheTimermetrics withinExternalBlockHandlerare updated to use the new class with milliseconds as the unit. The logic to include thenanossuffix in the metric name withinYarnShuffleServiceMetricshas also been removed, with the assumption that the metric name itself includes the units.Does this PR introduce any user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by
ExternalBlockHandlerviaYarnShuffleServiceMetricssuch asopenBlockRequestLatencyMillis_nanos_maxandopenBlockRequestLatencyMillis_nanos_50thPercentilehave been changed to remove the_nanossuffix. This would be considered a breaking change, but these names were only exposed as part of #32388, which has not yet been released (slated for 3.2.0). New names are likeopenBlockRequestLatencyMillis_maxandopenBlockRequestLatencyMillis_50thPercentileSecond, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as
openBlockRequestLatencyMillis_countoropenBlockRequestLatencyMillis_rate1, only theSnapshot-related metrics (max,median, percentiles, etc.). For the YARN case, these metrics were also introduced by #32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.How was this patch tested?
Unit tests have been updated.