Skip to content

[CELEBORN-1485][0.4] Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization#2928

Closed
cfmcgrady wants to merge 2 commits into
apache:branch-0.4from
cfmcgrady:CELEBORN-1485-0.4
Closed

[CELEBORN-1485][0.4] Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization#2928
cfmcgrady wants to merge 2 commits into
apache:branch-0.4from
cfmcgrady:CELEBORN-1485-0.4

Conversation

@cfmcgrady
Copy link
Copy Markdown
Contributor

backport #2593 to branch-0.4

What changes were proposed in this pull request?

Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization.

Why are the changes needed?

addCounter, addGauge and addTimer of AbstractSource checks whether the metric key exist via MetricRegistry#getMetrics which iterates all metrics and put into map at present. It causes that adding counter of active connection count metric for application dimension would increase high CPU utilization when there are many active connections:

image

The implementation of MetricRegistry#getMetrics is as follows:

private <T extends Metric> SortedMap<String, T> getMetrics(Class<T> klass, MetricFilter filter) {
    final TreeMap<String, T> timers = new TreeMap<>();
    for (Map.Entry<String, Metric> entry : metrics.entrySet()) {
        if (klass.isInstance(entry.getValue()) && filter.matches(entry.getKey(), entry.getValue())) {
            timers.put(entry.getKey(), (T) entry.getValue());
        }
    }
    return Collections.unmodifiableSortedMap(timers);
}

Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Cluster test.

image

Closes #2593 from SteNicholas/CELEBORN-1485.

Authored-by: SteNicholas programgeek@163.com

@cfmcgrady cfmcgrady changed the title [CELEBORN-1485] Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization [CELEBORN-1485][0.4] Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization Nov 19, 2024
SteNicholas and others added 2 commits November 20, 2024 11:05
…tSource to reduce CPU utilization

Refactor `addCounter`, `addGauge` and `addTimer` of `AbstractSource` to reduce CPU utilization.

`addCounter`, `addGauge` and `addTimer` of `AbstractSource` checks whether the metric key exist via `MetricRegistry#getMetrics` which iterates all metrics and put into map at present. It causes that adding counter of active connection count metric for application dimension would increase high CPU utilization when there are many active connections:

<img width="1350" alt="image" src="https://github.com/apache/celeborn/assets/10048174/cc882fac-eec1-417b-ba17-f3012053c6c7">

The implementation of `MetricRegistry#getMetrics` is as follows:

```
private <T extends Metric> SortedMap<String, T> getMetrics(Class<T> klass, MetricFilter filter) {
    final TreeMap<String, T> timers = new TreeMap<>();
    for (Map.Entry<String, Metric> entry : metrics.entrySet()) {
        if (klass.isInstance(entry.getValue()) && filter.matches(entry.getKey(), entry.getValue())) {
            timers.put(entry.getKey(), (T) entry.getValue());
        }
    }
    return Collections.unmodifiableSortedMap(timers);
}
```

Refactor `addCounter`, `addGauge` and `addTimer` of `AbstractSource` to reduce CPU utilization.

No.

Cluster test.

<img width="1345" alt="image" src="https://github.com/apache/celeborn/assets/10048174/4c0a7f92-3cc5-45f8-941f-e1d0166043e1">

Closes apache#2593 from SteNicholas/CELEBORN-1485.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
@cfmcgrady
Copy link
Copy Markdown
Contributor Author

cc @SteNicholas @FMX

@cfmcgrady
Copy link
Copy Markdown
Contributor Author

thanks, merging to branch-0.4(v0.4.3).

cfmcgrady added a commit that referenced this pull request Nov 20, 2024
…stractSource to reduce CPU utilization

backport #2593 to branch-0.4

### What changes were proposed in this pull request?

Refactor `addCounter`, `addGauge` and `addTimer` of `AbstractSource` to reduce CPU utilization.

### Why are the changes needed?

`addCounter`, `addGauge` and `addTimer` of `AbstractSource` checks whether the metric key exist via `MetricRegistry#getMetrics` which iterates all metrics and put into map at present. It causes that adding counter of active connection count metric for application dimension would increase high CPU utilization when there are many active connections:

<img width="1350" alt="image" src="https://github.com/apache/celeborn/assets/10048174/cc882fac-eec1-417b-ba17-f3012053c6c7">

The implementation of `MetricRegistry#getMetrics` is as follows:

```
private <T extends Metric> SortedMap<String, T> getMetrics(Class<T> klass, MetricFilter filter) {
    final TreeMap<String, T> timers = new TreeMap<>();
    for (Map.Entry<String, Metric> entry : metrics.entrySet()) {
        if (klass.isInstance(entry.getValue()) && filter.matches(entry.getKey(), entry.getValue())) {
            timers.put(entry.getKey(), (T) entry.getValue());
        }
    }
    return Collections.unmodifiableSortedMap(timers);
}
```

Refactor `addCounter`, `addGauge` and `addTimer` of `AbstractSource` to reduce CPU utilization.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

<img width="1345" alt="image" src="https://github.com/apache/celeborn/assets/10048174/4c0a7f92-3cc5-45f8-941f-e1d0166043e1">

Closes #2593 from SteNicholas/CELEBORN-1485.

Authored-by: SteNicholas <programgeek163.com>

Closes #2928 from cfmcgrady/CELEBORN-1485-0.4.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
@cfmcgrady cfmcgrady closed this Nov 20, 2024
@cfmcgrady cfmcgrady deleted the CELEBORN-1485-0.4 branch November 20, 2024 03:52
cfmcgrady added a commit to cfmcgrady/incubator-celeborn that referenced this pull request Aug 21, 2025
…stractSource to reduce CPU utilization

backport apache#2593 to branch-0.4

Refactor `addCounter`, `addGauge` and `addTimer` of `AbstractSource` to reduce CPU utilization.

`addCounter`, `addGauge` and `addTimer` of `AbstractSource` checks whether the metric key exist via `MetricRegistry#getMetrics` which iterates all metrics and put into map at present. It causes that adding counter of active connection count metric for application dimension would increase high CPU utilization when there are many active connections:

<img width="1350" alt="image" src="https://github.com/apache/celeborn/assets/10048174/cc882fac-eec1-417b-ba17-f3012053c6c7">

The implementation of `MetricRegistry#getMetrics` is as follows:

```
private <T extends Metric> SortedMap<String, T> getMetrics(Class<T> klass, MetricFilter filter) {
    final TreeMap<String, T> timers = new TreeMap<>();
    for (Map.Entry<String, Metric> entry : metrics.entrySet()) {
        if (klass.isInstance(entry.getValue()) && filter.matches(entry.getKey(), entry.getValue())) {
            timers.put(entry.getKey(), (T) entry.getValue());
        }
    }
    return Collections.unmodifiableSortedMap(timers);
}
```

Refactor `addCounter`, `addGauge` and `addTimer` of `AbstractSource` to reduce CPU utilization.

No.

Cluster test.

<img width="1345" alt="image" src="https://github.com/apache/celeborn/assets/10048174/4c0a7f92-3cc5-45f8-941f-e1d0166043e1">

Closes apache#2593 from SteNicholas/CELEBORN-1485.

Authored-by: SteNicholas <programgeek163.com>

Closes apache#2928 from cfmcgrady/CELEBORN-1485-0.4.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants