-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add readme for APM metrics (#103400)
this commit adds documentation about the APM metrics usage in Elasticsearch
- Loading branch information
Showing
1 changed file
with
171 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
# Metrics in Elasticsearch | ||
|
||
Elasticsearch has the metrics API available in server's (perhaps we should move to lib?) package | ||
`org.elasticsearch.telemetry.metric`. | ||
This package contains base classes/interfaces for creating and working with metrics. | ||
Please refer to the javadocs provided in these classes in that package for more details. | ||
The entry point for working with metrics is `MeterRegistry`. | ||
|
||
## Implementation | ||
We use elastic's apm-java-agent as an implementation of the API we expose. | ||
the implementation can be found in `:modules:apm` | ||
The apm-java-agent is responsible for buffering metrics and upon metrics_interval | ||
send them over to apm server. | ||
Metrics_interval is configured via a `tracing.apm.agent.metrics_interval` setting | ||
The agent also collects a number of JVM metrics. | ||
see https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm | ||
|
||
|
||
## How to choose an instrument | ||
|
||
We support various instruments and might be adding more as we go. | ||
The choice of the right instrument is not always easy as often differences are subtle. | ||
The simplified algorithm could be as follows: | ||
|
||
1. You want to measure something (absolute value) | ||
1. values are non-additive | ||
1. use a gauge | ||
2. Example: a cpu temperature | ||
2. values are additive | ||
1. use asynchronous counter | ||
2. Example: total number of requests | ||
2. You want to count something | ||
1. values are monotonously increasing | ||
1. use a counter | ||
2. Example: Recording a failed authentication count | ||
2. values can be decreased | ||
1. use UpDownCounter | ||
2. Example: Number of orders in a queue | ||
3. You want to record a statistics | ||
1. use a histogram | ||
1. Example: A statistics about how long it took to access a value from cache | ||
|
||
refer to https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/#instrument-selection | ||
for more details | ||
|
||
## How to name an instrument | ||
See the naming guidelines for metrics: | ||
https://docs.google.com/document/d/1jKxuaZi7QAMIRD_Eq3nonkYswVlQXW5TWllJEicoOtM/edit#heading=h.jxn90hx2ayic | ||
|
||
### Restarts and overflows | ||
if the instrument is correctly chosen, the apm server will be able to determine if the metrics | ||
were restarted (i.e. node was restarted) or there was a counter overflow | ||
(the metric in ES might use an int internally, but apm backend might have a long ) | ||
|
||
## How to use an instrument | ||
There are 2 types of usages of an instrument depending on a type. | ||
- For synchronous instrument (counter/UpDownCounter) we need to register an instrument with | ||
`MeterRegistry` and use the returned value to increment a value of that instrument | ||
```java | ||
MeterRegistry registry; | ||
LongCounter longCounter = registry.registerLongCounter("es.test.requests.count", "a test counter", "count"); | ||
longCounter.increment(); | ||
longCounter.incrementBy(1, Map.of("name", "Alice")); | ||
longCounter.incrementBy(1, Map.of("name", "Bob")); | ||
``` | ||
|
||
- For asynchronous instrument (gauge/AsynchronousCounter) we register an instrument | ||
and have to provide a callback that will report the absolute measured value. | ||
This callback has to be provided upon registration and cannot be changed. | ||
```java | ||
MeterRegistry registry; | ||
long someValue = 1; | ||
registry.registerLongGauge("es.test.cpu.temperature", "a test gauge", "celcius", | ||
() -> new LongWithAttributes(someValue, Map.of("cpuNumber", 1))); | ||
``` | ||
|
||
If we don’t have access to ‘state’ that will be fetched on metric event (when callback is executed) | ||
we can use a utility LongGaugeMetric or LongGaugeMetric | ||
```java | ||
MeterRegistry meterRegistry ; | ||
LongGaugeMetric longGaugeMetric = LongGaugeMetric.create(meterRegistry, "es.test.gauge", "a test gauge", "total value"); | ||
longGaugeMetric.set(123L); | ||
``` | ||
### The use of attributes aka dimensions | ||
Each instrument can attach attributes to a reported value. This helps drilling down into the details | ||
of value that was reported during the metric event | ||
|
||
|
||
## Development | ||
|
||
### Fake http server | ||
|
||
The quickest way to verify that your metrics are working is to run `./gradlew run --with-apm-server`. | ||
This will run ES node (or nodes in serverless) and also start a fake http server that will act | ||
as an apm server. This fake http server will log all the http messages it receives from apm-agent | ||
|
||
### With APM server in cloud | ||
You can also run local ES node with an apm server in cloud. | ||
Create a new deployment in cloud, then click the 'hamburger :)' on the left, scroll to Observability and click APM under it. | ||
At the upper right corner there is `Add data` link, then scroll down to `ApmAgents` section and pick Java | ||
There you should be able to see `elastic.apm.secret_token` and `elastic.apm.server_url. You will use them in the next step. | ||
|
||
Next you should create a file `apm_server_ess.gradle` | ||
in a different directory than your elasticsearch checkout (so that branch changes don't remove it) | ||
The content of the file: | ||
``` | ||
rootProject { | ||
if (project.name == 'elasticsearch') { | ||
afterEvaluate { | ||
testClusters.matching { it.name == "runTask" }.configureEach { | ||
setting 'xpack.security.audit.enabled', 'true' | ||
keystore 'tracing.apm.secret_token', 'REDACTED' | ||
setting 'telemetry.metrics.enabled', 'true' | ||
setting 'tracing.apm.agent.server_url', 'https://REDACTED:443' | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
Use the secret_token and server_url (REDACTED) from previous step. | ||
|
||
you can run your local ES node with APM in ESS with this command | ||
`./gradlew run -I ../apm_enable_statefull.gradle` | ||
|
||
#### An init.d gradle setup | ||
|
||
Alternatively you can edit your `~/.gradle/init.d/apm.gradle` | ||
```groovy | ||
rootProject { | ||
if (project.name == 'elasticsearch' && Boolean.getBoolean('metrics.enabled')) { | ||
afterEvaluate { | ||
testClusters.matching { it.name == "runTask" }.configureEach { | ||
setting 'xpack.security.audit.enabled', 'true' | ||
keystore 'tracing.apm.secret_token', 'TODO-REPLACE' | ||
setting 'telemetry.metrics.enabled', 'true' | ||
setting 'tracing.apm.agent.server_url', 'https://TODO-REPLACE-URL.apm.eastus2.staging.azure.foundit.no:443' | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
The example use: | ||
``` | ||
./gradlew :run -Dmetrics.enabled=true | ||
``` | ||
|
||
|
||
|
||
|
||
#### Logging | ||
with any approach you took to run your ES with APM you will find apm-agent.json file | ||
in ES's logs directory. If there are any problems with connecting to APM you will see WARN/ERROR messages. | ||
We run apm-agent with logs at WARN level, so normally you should not see any logs there. | ||
|
||
When running ES in cloud, logs are being also indexed in a logging cluster, so you will be able to find them | ||
in kibana. The `datastream.dataset` is `elasticsearch.apm_agent` | ||
|
||
|
||
### Testing | ||
We currently provide a base `TestTelemetryPlugin` which should help you write an integration test. | ||
See an example `S3BlobStoreRepositoryTests` | ||
|
||
|
||
|
||
|
||
# Links and further reading | ||
https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/ | ||
|
||
https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html |