Schema for metrics #474

axw · 2019-05-27T05:18:14Z

In elastic/beats#11836 we identified the need to maintain some kind of central definition of metrics that are implemented by both Beats and Elastic APM. The APM agents now produce a subset of system.cpu.*, system.memory.*, and system.process.* metrics defined by Metricbeat. The APM agents are now starting to define language/runtime-specific metrics, and Metricbeat will also start producing at least some of these.

We would like to propose extending ECS to cover metrics.

Organisation

I propose we introduce new field sets for:

JVM (jvm.*)
Go (golang.*)
Node.js (nodejs.*)
.NET (dotnet.*)
Python (python.*)
Ruby (ruby.*)

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Existing system metrics should be renamed to fit into the existing ECS field sets, e.g.

system.cpu.* -> host.cpu.*
system.memory.* -> host.memory.*
system.process.* -> process.*

Naming conventions

Ideally, we would extend the conventions documentation to cover metric naming. Specifically, we should ensure the consistent use of units in metric names, to ensure metrics are discoverable and self-documenting, both of which are important in Metrics Explorer.

One challenge here is that some existing metrics have inconsistent naming. For example:

system.diskio.read.count, system.process.cpu.user.ticks (we should either pick .reads or .ticks.count)
system.process.cpu.total.value (.value isn't meaningful, and there's no unit)
system.memory.total vs. system.memory.used.bytes (both are bytes, but only one says so in the name)

We should review https://github.com/elastic/beats/blob/master/docs/devguide/event-conventions.asciidoc#standardised-names, revising it for inclusion in the ECS conventions.

Key points:

all metrics must specify a unit. This means finding an alternative to the following rule from the Beats conventions, since "value" is not a unit:

If a field name matches the namespace used for nested fields, add .value to the field name.
units are not necessary SI, they may be domain-specific. e.g. "objects", "mallocs" are fine

Alternatively to including units in names, we could wait for elastic/elasticsearch#31244 or elastic/elasticsearch#33267. However, there are existing fields in ECS that include units (network.bytes, http.request.bytes, ...), so it may be best to not wait, and do that a future revision across the board.

Open questions:

What do we do about existing non-compliant naming? A few options:
- Rename in Elastic Stack 8.0. This may mean breaking existing dashboards, or introducing field aliases. This option is implied in the proposed change to splitting the system metrics into host.* and process.* field sets above.
- Include existing metrics as-is, with a TODO to make their naming consistent in the future.
- Don't include these existing metrics in ECS.

Metrics implementation guide

The definition of metrics is sometimes subtle and platform-specific; without providing a guide to this, it is easy for inconsistencies to arise in implementations. We should provide a detailed guide to calculating these metrics either in ECS, or in a companion document linked from ECS.

The text was updated successfully, but these errors were encountered:

watson · 2019-05-27T06:14:49Z

Great write-up @axw 😃

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Could you elaborate on why the metrics would contain "runtime details" like names and versions?

system.diskio.read.count, system.process.cpu.user.ticks (we should either pick .reads or .ticks.count

Do you mean .ticks or ticks.count? If it's the latter, then please elaborate on why .ticks isn't enough.

I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be system.process.cpu.{user|system}.ticks under the existing system metrics.

axw · 2019-05-27T07:18:42Z

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Could you elaborate on why the metrics would contain "runtime details" like names and versions?

I just meant that the field sets (e.g. host.*, jvm.*) would not be restricted to only metrics or only non-metrics. For example, https://www.elastic.co/guide/en/ecs/current/ecs-network.html has network.bytes and network.name.

Do you mean .ticks or ticks.count? If it's the latter, then please elaborate on why .ticks isn't enough.

No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".

I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be system.process.cpu.{user|system}.ticks under the existing system metrics.

👍 In this specific example, I'd say those metric fields belong in the process field set.

ruflin · 2019-05-27T14:10:38Z

Thanks for this great writeup @axw . I think it's a good time to challenge the metricbeat naming convention and improve them. I would really like to have elastic/elasticsearch#31244 but agree we should not be blocked by it and is a reason some fields in ECS already today have .bytes.

For the migration: We have the fields alias feature so I would directly start with the names we want in the future and potentially alias old fields (if possible).

To move this effort forward we could start with a PR for one specific environment like nodejs or Golang (or both). Haven specific fields to discuss will indirectly drive also the discussion about building / updating the convention. @axw @watson Willing to open PR's against https://github.com/elastic/ecs/tree/ecs-metrics ?

axw · 2019-05-28T03:36:55Z

For Go it's a little tricky, since we've got "golang" metrics coming from Metricbeat already. There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :)

watson · 2019-05-28T08:18:46Z

No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".

Got'ya, It's just that you wrote .ticks.count which confused me. But I guess that was a typo right?

There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :)

If you could take it, it would be great 👍

webmat · 2019-05-28T13:27:10Z

Thanks for this detailed writeup, @axw! I love the attention given to improving the consistency of the naming of the various metrics.

One thing I'd like to challenge is the creation of metrics spaces per language. In ECS, I would much prefer having one place to nest runtime metrics, that we fill the best we can, depending on the runtime/language. The goal here is not to mix/consume the metrics of different runtimes together (although that opens up the possibility). In most cases, you'd of course filter for JVM stuff for a JVM-centric dashboard, for example (e.g. runtime.name:jvm). But having all runtimes fill the same common schema would enable the creation of common content or pipelines.

This may sound like a 180 degree change from this comment I made in April. But back then, I didn't expect this to be proposed to be added to ECS right away.

Right now I'm seeing a proposal to add what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes. I'd much rather have the common part of these metricsets be added to ECS in one common place, like runtime, and each runtime that has more than these metrics to expose, we add as custom fields in solutions.

One thing we have steered away from in the Elastic Common Schema is including brand or technology names (e.g. so it's container.*, not docker.*). There's no commonality in brand names :-)

axw · 2019-05-29T06:14:01Z

I'd much rather have the common part of these metricsets be added to ECS in one common place, like runtime, and each runtime that has more than these metrics to expose, we add as custom fields in solutions.

If we do that, we're kind of back to square one on elastic/beats#11836. What we're trying to do here is define a common dictionary of metrics, many of which will be runtime-specific. e.g. "number of goroutines". Even "event loop delay" - while it may be common thing to a subset of languages, it's definitely not common to all so probably doesn't belong directly on "runtime".

If it were just APM implementing these metrics I'd be inclined to agree, we could just document the runtime-specific metrics separately. In fact, that's what we started out doing. Metricbeat will start gathering some of these metrics too.

@webmat Is the issue you have with "[adding] what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes" about the effect it'll have on the readability of the docs? (If so is it a matter of reorganising things somehow, to be nested under a common "runtime" namespace?) Or something else?

ruflin · 2019-05-29T10:10:01Z

I think we should stick to the plan for now to have one prefix for each runtime and we will figure out the generalisation later on. I still think we should do the same for kubernetes etc.

But there is the problem that @webmat mentioned about having in the end too many fields in the ecs template. I think it's time to come up with a solution here, for example introduce different parts of ECS and potentially have 1 template for each. So someone can only use basic if he wants or also get metrics.

In any case, I think this discussion should not block any progress to be made on the metrics branch as it having the fields already there will make the discussion more concrete.

roncohen · 2019-06-11T08:08:56Z

to summarize: we'll stage a proposal in the ECS metrics branch while we sort out the metrics for each runtime. For now we're going with individual namespaces for each runtime because the metrics are going to be quite different between runtimes.
When there's something concrete for each runtime we can look at deriving common things and we'll discuss inclusion into ECS proper.

webmat · 2019-06-12T13:10:14Z

We've started a publicly viewable spreadsheet here

ruflin · 2019-06-13T10:31:53Z

One additional interesting note here that came out of #477: Prometheus adds the units to its metric names: https://prometheus.io/docs/practices/naming/#metric-names Perhaps to prevent future conflicts with objects vs keyword we could also use an underline notation like _seconds for the fields instead of the dot notation. Mainly thinking out loud, I still prefer what we have come up so far.

axw · 2019-06-28T09:25:25Z

@ruflin pointed out that we have more JVM metrics recorded for Logstash via its Node Stats API: https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#jvm-stats. All very similar to what the APM Java agent is reporting, or will report in the future.

ruflin · 2019-10-29T12:04:14Z

Elasticsearch has a new histogram field in the work. This could become relevant for some of the metrics: elastic/elasticsearch#48580

webmat · 2019-11-21T14:00:58Z

@axw @roncohen @watson Hey folks, I'd love to see this move forward again. Would be happy to help you on this :-)

ruflin · 2022-05-02T11:38:02Z

There is now a proposed convention for system metrics from otel: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/system-metrics.md

ruflin · 2022-09-08T12:11:52Z

I think it is really time we move forward here and come to a conclusion. We now also use some system metrics in the upcoming host view: elastic/kibana#138173

I know this issue was initially opened in the context of apm and system metrics but to reduce scope, I propose to scale this down to only minimal set of system metrics. What @neptunian uses in elastic/kibana#138173 seems like a good initial set.

@lalit-satapathy This would also be helpful to standardise on some host metrics from otel.

kgeller · 2022-09-20T13:44:53Z

Hi @ruflin! I think that introducing a minimal set of system metrics into ECS would be nice.

While not exactly the same, there's currently a stage 1 RFC for adding cgroup metrics into ECS.

ruflin · 2023-01-05T14:44:27Z

I opened a proposal PR for host metrics here: #2129 It is still in draft mode but I keep adding content and will convert it to a full RFC soonish.

dijkstrajesse · 2023-08-01T09:52:09Z

@ruflin Any updates to this? I'm working on a couple of projects for infra (network/file/system) monitoring and it would be very helpful for customers to refer to ECS metric field best practices instead of looking up system metrics over here: https://docs.elastic.co/en/integrations/system.

ruflin · 2023-08-02T06:57:41Z

@ChrsMark Can you chime in here? I think there are some moving parts here now related the merger with semconv.

ChrsMark · 2023-08-02T07:40:11Z

Hey! There is a WG taking place these weeks in Otel SemConv: open-telemetry/opentelemetry-specification#3556
As part of this, we keep track of the System metrics alignment with Otel SemConv. I guess that's the best place to keep track of any changes.

This was referenced Jun 5, 2019

Introduce 'nodejs' field set #480

Merged

[ecs-metrics] Introduce 'jvm' field set #482

Merged

ruflin mentioned this issue Jun 26, 2019

Extend logstash.node_stats metricset for logstash_stats stack monitoring data elastic/beats#11511

Merged

andreaturli mentioned this issue Nov 26, 2019

Use case: JVM metrics #639

Open

webmat mentioned this issue Feb 17, 2020

Adds a search field set to represent search events #729

Merged

codebrain mentioned this issue Mar 2, 2020

App Metrics elastic/ecs-logging#22

Open

webmat added the RFC:candidate label Aug 18, 2020

ebeahan added the discuss label Sep 1, 2020

dannycroft mentioned this issue May 12, 2022

Enhance Service Metrics Tab elastic/kibana#132025

Closed

4 tasks

ruflin self-assigned this Sep 8, 2022

ruflin mentioned this issue Sep 26, 2022

Support Elastic Common Schema in OpenTelemetry open-telemetry/oteps#199

Closed

ruflin removed their assignment Dec 1, 2022

dijkstrajesse closed this as completed Aug 1, 2023

dijkstrajesse reopened this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema for metrics #474

Schema for metrics #474

axw commented May 27, 2019

watson commented May 27, 2019

axw commented May 27, 2019

ruflin commented May 27, 2019

axw commented May 28, 2019

watson commented May 28, 2019

webmat commented May 28, 2019

axw commented May 29, 2019

ruflin commented May 29, 2019 •

edited

Loading

roncohen commented Jun 11, 2019

webmat commented Jun 12, 2019

ruflin commented Jun 13, 2019

axw commented Jun 28, 2019

ruflin commented Oct 29, 2019

webmat commented Nov 21, 2019

ruflin commented May 2, 2022

ruflin commented Sep 8, 2022

kgeller commented Sep 20, 2022

ruflin commented Jan 5, 2023

dijkstrajesse commented Aug 1, 2023

ruflin commented Aug 2, 2023

ChrsMark commented Aug 2, 2023 •

edited

Loading

Schema for metrics #474

Schema for metrics #474

Comments

axw commented May 27, 2019

watson commented May 27, 2019

axw commented May 27, 2019

ruflin commented May 27, 2019

axw commented May 28, 2019

watson commented May 28, 2019

webmat commented May 28, 2019

axw commented May 29, 2019

ruflin commented May 29, 2019 • edited Loading

roncohen commented Jun 11, 2019

webmat commented Jun 12, 2019

ruflin commented Jun 13, 2019

axw commented Jun 28, 2019

ruflin commented Oct 29, 2019

webmat commented Nov 21, 2019

ruflin commented May 2, 2022

ruflin commented Sep 8, 2022

kgeller commented Sep 20, 2022

ruflin commented Jan 5, 2023

dijkstrajesse commented Aug 1, 2023

ruflin commented Aug 2, 2023

ChrsMark commented Aug 2, 2023 • edited Loading

ruflin commented May 29, 2019 •

edited

Loading

ChrsMark commented Aug 2, 2023 •

edited

Loading