Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema for metrics #474

Open
axw opened this issue May 27, 2019 · 21 comments
Open

Schema for metrics #474

axw opened this issue May 27, 2019 · 21 comments

Comments

@axw
Copy link
Member

axw commented May 27, 2019

In elastic/beats#11836 we identified the need to maintain some kind of central definition of metrics that are implemented by both Beats and Elastic APM. The APM agents now produce a subset of system.cpu.*, system.memory.*, and system.process.* metrics defined by Metricbeat. The APM agents are now starting to define language/runtime-specific metrics, and Metricbeat will also start producing at least some of these.

We would like to propose extending ECS to cover metrics.

Organisation

I propose we introduce new field sets for:

  • JVM (jvm.*)
  • Go (golang.*)
  • Node.js (nodejs.*)
  • .NET (dotnet.*)
  • Python (python.*)
  • Ruby (ruby.*)

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Existing system metrics should be renamed to fit into the existing ECS field sets, e.g.

  • system.cpu.* -> host.cpu.*
  • system.memory.* -> host.memory.*
  • system.process.* -> process.*

Naming conventions

Ideally, we would extend the conventions documentation to cover metric naming. Specifically, we should ensure the consistent use of units in metric names, to ensure metrics are discoverable and self-documenting, both of which are important in Metrics Explorer.

One challenge here is that some existing metrics have inconsistent naming. For example:

  • system.diskio.read.count, system.process.cpu.user.ticks (we should either pick .reads or .ticks.count)
  • system.process.cpu.total.value (.value isn't meaningful, and there's no unit)
  • system.memory.total vs. system.memory.used.bytes (both are bytes, but only one says so in the name)

We should review https://github.com/elastic/beats/blob/master/docs/devguide/event-conventions.asciidoc#standardised-names, revising it for inclusion in the ECS conventions.

Key points:

  • all metrics must specify a unit. This means finding an alternative to the following rule from the Beats conventions, since "value" is not a unit:

    If a field name matches the namespace used for nested fields, add .value to the field name.

  • units are not necessary SI, they may be domain-specific. e.g. "objects", "mallocs" are fine

Alternatively to including units in names, we could wait for elastic/elasticsearch#31244 or elastic/elasticsearch#33267. However, there are existing fields in ECS that include units (network.bytes, http.request.bytes, ...), so it may be best to not wait, and do that a future revision across the board.

Open questions:

  • What do we do about existing non-compliant naming? A few options:
    • Rename in Elastic Stack 8.0. This may mean breaking existing dashboards, or introducing field aliases. This option is implied in the proposed change to splitting the system metrics into host.* and process.* field sets above.
    • Include existing metrics as-is, with a TODO to make their naming consistent in the future.
    • Don't include these existing metrics in ECS.

Metrics implementation guide

The definition of metrics is sometimes subtle and platform-specific; without providing a guide to this, it is easy for inconsistencies to arise in implementations. We should provide a detailed guide to calculating these metrics either in ECS, or in a companion document linked from ECS.

@watson
Copy link

watson commented May 27, 2019

Great write-up @axw 😃

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Could you elaborate on why the metrics would contain "runtime details" like names and versions?

  • system.diskio.read.count, system.process.cpu.user.ticks (we should either pick .reads or .ticks.count

Do you mean .ticks or ticks.count? If it's the latter, then please elaborate on why .ticks isn't enough.

I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be system.process.cpu.{user|system}.ticks under the existing system metrics.

@axw
Copy link
Member Author

axw commented May 27, 2019

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Could you elaborate on why the metrics would contain "runtime details" like names and versions?

I just meant that the field sets (e.g. host.*, jvm.*) would not be restricted to only metrics or only non-metrics. For example, https://www.elastic.co/guide/en/ecs/current/ecs-network.html has network.bytes and network.name.

Do you mean .ticks or ticks.count? If it's the latter, then please elaborate on why .ticks isn't enough.

No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".

I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be system.process.cpu.{user|system}.ticks under the existing system metrics.

👍 In this specific example, I'd say those metric fields belong in the process field set.

@ruflin
Copy link
Member

ruflin commented May 27, 2019

Thanks for this great writeup @axw . I think it's a good time to challenge the metricbeat naming convention and improve them. I would really like to have elastic/elasticsearch#31244 but agree we should not be blocked by it and is a reason some fields in ECS already today have .bytes.

For the migration: We have the fields alias feature so I would directly start with the names we want in the future and potentially alias old fields (if possible).

To move this effort forward we could start with a PR for one specific environment like nodejs or Golang (or both). Haven specific fields to discuss will indirectly drive also the discussion about building / updating the convention. @axw @watson Willing to open PR's against https://github.com/elastic/ecs/tree/ecs-metrics ?

@axw
Copy link
Member Author

axw commented May 28, 2019

For Go it's a little tricky, since we've got "golang" metrics coming from Metricbeat already. There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :)

@watson
Copy link

watson commented May 28, 2019

No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".

Got'ya, It's just that you wrote .ticks.count which confused me. But I guess that was a typo right?

There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :)

If you could take it, it would be great 👍

@webmat
Copy link
Contributor

webmat commented May 28, 2019

Thanks for this detailed writeup, @axw! I love the attention given to improving the consistency of the naming of the various metrics.

One thing I'd like to challenge is the creation of metrics spaces per language. In ECS, I would much prefer having one place to nest runtime metrics, that we fill the best we can, depending on the runtime/language. The goal here is not to mix/consume the metrics of different runtimes together (although that opens up the possibility). In most cases, you'd of course filter for JVM stuff for a JVM-centric dashboard, for example (e.g. runtime.name:jvm). But having all runtimes fill the same common schema would enable the creation of common content or pipelines.

This may sound like a 180 degree change from this comment I made in April. But back then, I didn't expect this to be proposed to be added to ECS right away.

Right now I'm seeing a proposal to add what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes. I'd much rather have the common part of these metricsets be added to ECS in one common place, like runtime, and each runtime that has more than these metrics to expose, we add as custom fields in solutions.

One thing we have steered away from in the Elastic Common Schema is including brand or technology names (e.g. so it's container.*, not docker.*). There's no commonality in brand names :-)

@axw
Copy link
Member Author

axw commented May 29, 2019

I'd much rather have the common part of these metricsets be added to ECS in one common place, like runtime, and each runtime that has more than these metrics to expose, we add as custom fields in solutions.

If we do that, we're kind of back to square one on elastic/beats#11836. What we're trying to do here is define a common dictionary of metrics, many of which will be runtime-specific. e.g. "number of goroutines". Even "event loop delay" - while it may be common thing to a subset of languages, it's definitely not common to all so probably doesn't belong directly on "runtime".

If it were just APM implementing these metrics I'd be inclined to agree, we could just document the runtime-specific metrics separately. In fact, that's what we started out doing. Metricbeat will start gathering some of these metrics too.

@webmat Is the issue you have with "[adding] what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes" about the effect it'll have on the readability of the docs? (If so is it a matter of reorganising things somehow, to be nested under a common "runtime" namespace?) Or something else?

@ruflin
Copy link
Member

ruflin commented May 29, 2019

I think we should stick to the plan for now to have one prefix for each runtime and we will figure out the generalisation later on. I still think we should do the same for kubernetes etc.

But there is the problem that @webmat mentioned about having in the end too many fields in the ecs template. I think it's time to come up with a solution here, for example introduce different parts of ECS and potentially have 1 template for each. So someone can only use basic if he wants or also get metrics.

In any case, I think this discussion should not block any progress to be made on the metrics branch as it having the fields already there will make the discussion more concrete.

@roncohen
Copy link
Contributor

to summarize: we'll stage a proposal in the ECS metrics branch while we sort out the metrics for each runtime. For now we're going with individual namespaces for each runtime because the metrics are going to be quite different between runtimes.
When there's something concrete for each runtime we can look at deriving common things and we'll discuss inclusion into ECS proper.

@webmat
Copy link
Contributor

webmat commented Jun 12, 2019

We've started a publicly viewable spreadsheet here

@ruflin
Copy link
Member

ruflin commented Jun 13, 2019

One additional interesting note here that came out of #477: Prometheus adds the units to its metric names: https://prometheus.io/docs/practices/naming/#metric-names Perhaps to prevent future conflicts with objects vs keyword we could also use an underline notation like _seconds for the fields instead of the dot notation. Mainly thinking out loud, I still prefer what we have come up so far.

@axw
Copy link
Member Author

axw commented Jun 28, 2019

@ruflin pointed out that we have more JVM metrics recorded for Logstash via its Node Stats API: https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#jvm-stats. All very similar to what the APM Java agent is reporting, or will report in the future.

@ruflin
Copy link
Member

ruflin commented Oct 29, 2019

Elasticsearch has a new histogram field in the work. This could become relevant for some of the metrics: elastic/elasticsearch#48580

@webmat
Copy link
Contributor

webmat commented Nov 21, 2019

@axw @roncohen @watson Hey folks, I'd love to see this move forward again. Would be happy to help you on this :-)

@ruflin
Copy link
Member

ruflin commented May 2, 2022

@ruflin
Copy link
Member

ruflin commented Sep 8, 2022

I think it is really time we move forward here and come to a conclusion. We now also use some system metrics in the upcoming host view: elastic/kibana#138173

I know this issue was initially opened in the context of apm and system metrics but to reduce scope, I propose to scale this down to only minimal set of system metrics. What @neptunian uses in elastic/kibana#138173 seems like a good initial set.

@lalit-satapathy This would also be helpful to standardise on some host metrics from otel.

@ruflin ruflin self-assigned this Sep 8, 2022
@kgeller
Copy link
Contributor

kgeller commented Sep 20, 2022

Hi @ruflin! I think that introducing a minimal set of system metrics into ECS would be nice.

While not exactly the same, there's currently a stage 1 RFC for adding cgroup metrics into ECS.

@ruflin
Copy link
Member

ruflin commented Jan 5, 2023

I opened a proposal PR for host metrics here: #2129 It is still in draft mode but I keep adding content and will convert it to a full RFC soonish.

@dijkstrajesse
Copy link

@ruflin Any updates to this? I'm working on a couple of projects for infra (network/file/system) monitoring and it would be very helpful for customers to refer to ECS metric field best practices instead of looking up system metrics over here: https://docs.elastic.co/en/integrations/system.

@ruflin
Copy link
Member

ruflin commented Aug 2, 2023

@ChrsMark Can you chime in here? I think there are some moving parts here now related the merger with semconv.

@ChrsMark
Copy link
Member

ChrsMark commented Aug 2, 2023

Hey! There is a WG taking place these weeks in Otel SemConv: open-telemetry/opentelemetry-specification#3556
As part of this, we keep track of the System metrics alignment with Otel SemConv. I guess that's the best place to keep track of any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants