Skip to content

Commit

Permalink
Add readme for APM metrics (#103400)
Browse files Browse the repository at this point in the history
this commit adds documentation about the APM metrics usage in Elasticsearch
  • Loading branch information
pgomulka committed Dec 14, 2023
1 parent 9d27c2f commit c39a595
Showing 1 changed file with 171 additions and 0 deletions.
171 changes: 171 additions & 0 deletions modules/apm/METERING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Metrics in Elasticsearch

Elasticsearch has the metrics API available in server's (perhaps we should move to lib?) package
`org.elasticsearch.telemetry.metric`.
This package contains base classes/interfaces for creating and working with metrics.
Please refer to the javadocs provided in these classes in that package for more details.
The entry point for working with metrics is `MeterRegistry`.

## Implementation
We use elastic's apm-java-agent as an implementation of the API we expose.
the implementation can be found in `:modules:apm`
The apm-java-agent is responsible for buffering metrics and upon metrics_interval
send them over to apm server.
Metrics_interval is configured via a `tracing.apm.agent.metrics_interval` setting
The agent also collects a number of JVM metrics.
see https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm


## How to choose an instrument

We support various instruments and might be adding more as we go.
The choice of the right instrument is not always easy as often differences are subtle.
The simplified algorithm could be as follows:

1. You want to measure something (absolute value)
1. values are non-additive
1. use a gauge
2. Example: a cpu temperature
2. values are additive
1. use asynchronous counter
2. Example: total number of requests
2. You want to count something
1. values are monotonously increasing
1. use a counter
2. Example: Recording a failed authentication count
2. values can be decreased
1. use UpDownCounter
2. Example: Number of orders in a queue
3. You want to record a statistics
1. use a histogram
1. Example: A statistics about how long it took to access a value from cache

refer to https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/#instrument-selection
for more details

## How to name an instrument
See the naming guidelines for metrics:
https://docs.google.com/document/d/1jKxuaZi7QAMIRD_Eq3nonkYswVlQXW5TWllJEicoOtM/edit#heading=h.jxn90hx2ayic

### Restarts and overflows
if the instrument is correctly chosen, the apm server will be able to determine if the metrics
were restarted (i.e. node was restarted) or there was a counter overflow
(the metric in ES might use an int internally, but apm backend might have a long )

## How to use an instrument
There are 2 types of usages of an instrument depending on a type.
- For synchronous instrument (counter/UpDownCounter) we need to register an instrument with
`MeterRegistry` and use the returned value to increment a value of that instrument
```java
MeterRegistry registry;
LongCounter longCounter = registry.registerLongCounter("es.test.requests.count", "a test counter", "count");
longCounter.increment();
longCounter.incrementBy(1, Map.of("name", "Alice"));
longCounter.incrementBy(1, Map.of("name", "Bob"));
```

- For asynchronous instrument (gauge/AsynchronousCounter) we register an instrument
and have to provide a callback that will report the absolute measured value.
This callback has to be provided upon registration and cannot be changed.
```java
MeterRegistry registry;
long someValue = 1;
registry.registerLongGauge("es.test.cpu.temperature", "a test gauge", "celcius",
() -> new LongWithAttributes(someValue, Map.of("cpuNumber", 1)));
```

If we don’t have access to ‘state’ that will be fetched on metric event (when callback is executed)
we can use a utility LongGaugeMetric or LongGaugeMetric
```java
MeterRegistry meterRegistry ;
LongGaugeMetric longGaugeMetric = LongGaugeMetric.create(meterRegistry, "es.test.gauge", "a test gauge", "total value");
longGaugeMetric.set(123L);
```
### The use of attributes aka dimensions
Each instrument can attach attributes to a reported value. This helps drilling down into the details
of value that was reported during the metric event


## Development

### Fake http server

The quickest way to verify that your metrics are working is to run `./gradlew run --with-apm-server`.
This will run ES node (or nodes in serverless) and also start a fake http server that will act
as an apm server. This fake http server will log all the http messages it receives from apm-agent

### With APM server in cloud
You can also run local ES node with an apm server in cloud.
Create a new deployment in cloud, then click the 'hamburger :)' on the left, scroll to Observability and click APM under it.
At the upper right corner there is `Add data` link, then scroll down to `ApmAgents` section and pick Java
There you should be able to see `elastic.apm.secret_token` and `elastic.apm.server_url. You will use them in the next step.

Next you should create a file `apm_server_ess.gradle`
in a different directory than your elasticsearch checkout (so that branch changes don't remove it)
The content of the file:
```
rootProject {
if (project.name == 'elasticsearch') {
afterEvaluate {
testClusters.matching { it.name == "runTask" }.configureEach {
setting 'xpack.security.audit.enabled', 'true'
keystore 'tracing.apm.secret_token', 'REDACTED'
setting 'telemetry.metrics.enabled', 'true'
setting 'tracing.apm.agent.server_url', 'https://REDACTED:443'
}
}
}
}
```
Use the secret_token and server_url (REDACTED) from previous step.

you can run your local ES node with APM in ESS with this command
`./gradlew run -I ../apm_enable_statefull.gradle`

#### An init.d gradle setup

Alternatively you can edit your `~/.gradle/init.d/apm.gradle`
```groovy
rootProject {
if (project.name == 'elasticsearch' && Boolean.getBoolean('metrics.enabled')) {
afterEvaluate {
testClusters.matching { it.name == "runTask" }.configureEach {
setting 'xpack.security.audit.enabled', 'true'
keystore 'tracing.apm.secret_token', 'TODO-REPLACE'
setting 'telemetry.metrics.enabled', 'true'
setting 'tracing.apm.agent.server_url', 'https://TODO-REPLACE-URL.apm.eastus2.staging.azure.foundit.no:443'
}
}
}
}
```

The example use:
```
./gradlew :run -Dmetrics.enabled=true
```




#### Logging
with any approach you took to run your ES with APM you will find apm-agent.json file
in ES's logs directory. If there are any problems with connecting to APM you will see WARN/ERROR messages.
We run apm-agent with logs at WARN level, so normally you should not see any logs there.

When running ES in cloud, logs are being also indexed in a logging cluster, so you will be able to find them
in kibana. The `datastream.dataset` is `elasticsearch.apm_agent`


### Testing
We currently provide a base `TestTelemetryPlugin` which should help you write an integration test.
See an example `S3BlobStoreRepositoryTests`




# Links and further reading
https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/

https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html

0 comments on commit c39a595

Please sign in to comment.