Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breakdown metrics #564

Merged
merged 2 commits into from
Jul 17, 2019
Merged

Breakdown metrics #564

merged 2 commits into from
Jul 17, 2019

Conversation

axw
Copy link
Member

@axw axw commented Jun 17, 2019

Implementation of agent breakdown metrics (elastic/apm#78)

Checklist/TL;DR version of the design:

Closes #528

@codecov-io
Copy link

codecov-io commented Jun 17, 2019

Codecov Report

Merging #564 into master will increase coverage by 0.48%.
The diff coverage is 90.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
+ Coverage   84.16%   84.64%   +0.48%     
==========================================
  Files         116      120       +4     
  Lines        6894     7190     +296     
==========================================
+ Hits         5802     6086     +284     
- Misses        783      789       +6     
- Partials      309      315       +6
Impacted Files Coverage Δ
env.go 77.77% <ø> (ø) ⬆️
internal/apmlog/logger.go 82.66% <0%> (-2.27%) ⬇️
transaction.go 95% <100%> (+0.21%) ⬆️
builtin_metrics.go 83.33% <100%> (+0.2%) ⬆️
fnv.go 100% <100%> (ø)
metrics.go 90.47% <100%> (+0.15%) ⬆️
modelwriter.go 96.15% <100%> (+0.18%) ⬆️
span.go 89.82% <100%> (+2.6%) ⬆️
utils.go 80.85% <100%> (+0.85%) ⬆️
model/marshal.go 78.37% <100%> (+0.21%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14219ba...98eaf6d. Read the comment docs.

@axw
Copy link
Member Author

axw commented Jun 18, 2019

Here's a description/explanation of some of the more interesting parts.

  1. The Tracer has a "breakdownMetrics" field which has an active/inactive pair of metrics to avoid contention with the metrics gatherer.
  2. Each Transaction and Span has a "childrenTimer" which tracks the duration of direct children, used for computing self-time. Transactions also have a "spanTimings" map, much like in the Java agent.
  3. When a span ends, it notifies its parent's childrenTimer and adds itself to its transaction's spanTimings if the transaction hasn't already ended.
  4. When a transaction ends, it enqueues to the tracer, and the tracer updates the "active" breakdown metrics. If the queue is full, the goroutine that ended the transaction will update the breakdown metrics directly. In either case a lock is held, like the writer critical section in the Java agent implementation. The space for span timings is part of the recyclable object, and so be reused to minimise allocations.
  5. Just before the metrics gatherers run the tracer swaps active/inactive breakdown metrics. The metrics gatherer then translates the breakdown metrics into the metrics structures for serialisation.

Due to the last two points, we do not have long-lived timer objects: we still limit the breakdown metrics to 1000 {transactionType, transactionName, spanType, spanSubtype} tuples, but they may differ between metrics reporting periods.

The locking involved in (4) concerns me. Under the "happy path" where transactions get enqueued to the tracer there will be no contention between transactions, but when the load is high enough that the tracer cannot process transactions quickly enough, we'll introduce contention between transaction goroutines.

@felixbarny
Copy link
Member

we still limit the breakdown metrics to 1000

The idea was to limit the number of unique metricsets to 1000.

The locking involved in (4) concerns me.

Maybe you should consider the writer-reader phaser the Java agent uses as well. It does not block writers at any time.

@axw
Copy link
Member Author

axw commented Jun 18, 2019

The idea was to limit the number of unique metricsets to 1000.

The important thing is to have a bound on agent memory? Does it matter if it's 1000 unique metrics for the process's lifetime vs. at any one time?

Maybe you should consider the writer-reader phaser the Java agent uses as well. It does not block writers at any time.

Indeed. I misunderstood the guarantees of the phaser. I'll take a closer look at that.

@axw axw force-pushed the breakdown-metrics branch 2 times, most recently from 5e1d0cc to 5c067d7 Compare June 19, 2019 06:42
@axw axw marked this pull request as ready for review July 1, 2019 08:29
@axw axw closed this Jul 1, 2019
@axw axw reopened this Jul 1, 2019
@zube zube bot closed this Jul 1, 2019
@zube zube bot reopened this Jul 1, 2019
@zube zube bot closed this Jul 1, 2019
@zube zube bot reopened this Jul 1, 2019
@zube zube bot closed this Jul 1, 2019
@axw axw reopened this Jul 1, 2019
@axw axw closed this Jul 16, 2019
@axw axw reopened this Jul 16, 2019
@zube zube bot closed this Jul 16, 2019
@zube zube bot reopened this Jul 16, 2019
@zube zube bot closed this Jul 16, 2019
@zube zube bot reopened this Jul 16, 2019
@zube zube bot closed this Jul 16, 2019
@zube zube bot reopened this Jul 16, 2019
@axw axw closed this Jul 17, 2019
@axw axw reopened this Jul 17, 2019
@zube zube bot added [zube]: Inbox and removed [zube]: Done labels Jul 17, 2019
@axw axw merged commit be583fe into elastic:master Jul 17, 2019
@axw axw deleted the breakdown-metrics branch July 17, 2019 01:44
@zube zube bot added [zube]: Done and removed [zube]: Inbox labels Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent breakdown metrics
3 participants