- Transactions
- Transactions in service inventory page
- System metrics
- Span breakdown metrics
- Service destination metrics
- Common filters
Elastic APM agents capture different types of information from within their instrumented applications. These are known as events, and can be spans, transactions, errors, or metrics. You can find more information here.
You can run the example queries on the edge cluster or any another cluster that contains APM data.
Transactions are stored in two different formats:
A single transaction event where transaction.duration.us
is the latency.
{
"@timestamp": "2021-09-01T10:00:00.000Z",
"processor.event": "transaction",
"transaction.duration.us": 2000,
"event.outcome": "success"
}
or
A pre-aggregated document where _doc_count
is the number of transaction events, and transaction.duration.histogram
is the latency distribution.
{
"_doc_count": 2,
"@timestamp": "2021-09-01T10:00:00.000Z",
"processor.event": "metric",
"metricset.name": "transaction",
"transaction.duration.histogram": {
"counts": [1, 1],
"values": [2000, 3000]
},
"event.outcome": "success"
}
You can find all the APM transaction fields here.
The decision to use aggregated transactions or not is determined in getSearchTransactionsEvents
and then used to specify the transaction index and the latency field
Latency is the duration of a transaction. This can be calculated using transaction events or metric events (aggregated transactions).
Noteworthy fields: transaction.duration.us
, transaction.duration.histogram
GET apm-*-transaction-*,traces-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [{ "terms": { "processor.event": ["transaction"] } }]
}
},
"aggs": {
"latency": { "avg": { "field": "transaction.duration.us" } }
}
}
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "term": { "metricset.name": "transaction" } }
]
}
},
"aggs": {
"latency": { "avg": { "field": "transaction.duration.histogram" } }
}
}
Please note: metricset.name: transaction
was only recently introduced. To retain backwards compatability we still use the old filter { "exists": { "field": "transaction.duration.histogram" }}
when filtering for aggregated transactions (see example).
Throughput is the number of transactions per minute. This can be calculated using transaction events or metric events (aggregated transactions).
Noteworthy fields: None (based on doc_count
)
GET apm-*-transaction-*,traces-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [{ "terms": { "processor.event": ["transaction"] } }]
}
},
"aggs": {
"timeseries": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "60s"
},
"aggs": {
"throughput": {
"rate": {
"unit": "minute"
}
}
}
}
}
}
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "term": { "metricset.name": "transaction" } }
]
}
},
"aggs": {
"timeseries": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "60s"
},
"aggs": {
"throughput": {
"rate": {
"unit": "minute"
}
}
}
}
}
}
Failed transaction rate is the number of transactions with event.outcome=failure
per minute.
Noteworthy fields: event.outcome
GET apm-*-transaction-*,traces-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [{ "terms": { "processor.event": ["transaction"] } }]
}
},
"aggs": {
"outcomes": {
"terms": {
"field": "event.outcome",
"include": ["failure", "success"]
}
}
}
}
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "term": { "metricset.name": "transaction" } }
]
}
},
"aggs": {
"outcomes": {
"terms": {
"field": "event.outcome",
"include": ["failure", "success"]
}
}
}
}
Service transaction metrics are aggregated metric documents that hold latency and throughput metrics pivoted by service.name
, service.environment
and transaction.type
. Additionally, agent.name
and service.language.name
are included as metadata.
We use the response from the GET /internal/apm/time_range_metadata
endpoint to determine what data source is available. A data source is considered available if there is either data before the current time range, or, if there is no data at all before the current time range, if there is data within the current time range. This means that existing deployments will use transaction metrics right after upgrading (instead of using service transaction metrics and seeing a mostly blank screen), but also that new deployments immediately get the benefits of service transaction metrics, instead of falling all the way back to transaction events.
A pre-aggregated document where _doc_count
is the number of transaction events
{
"_doc_count": 4,
"@timestamp": "2021-09-01T10:00:00.000Z",
"processor.event": "metric",
"metricset.name": "service_transaction",
"metricset.interval": "1m",
"service": {
"environment": "production",
"name": "web-go"
},
"transaction": {
"duration.summary": {
"sum": 1000,
"value_count": 4
},
"duration.histogram": {
"counts": [ 4 ],
"values": [ 250 ]
},
"type": "request"
},
"event": {
"success_count": {
"sum": 1,
"value_count": 2
}
}
}
_doc_count
is the number of bucket countstransaction.duration.summary
is an aggregate_metric_double field and holds an aggregated transaction duration summary, for service transaction metricsevent.success_count
holds an aggregate metric double that describes the success rate. E.g., in this example, the success rate is 50% (1/2).
In addition to service_transaction
, service_summary
metrics are also generated. Every service outputs these, even when it does not record any transaction (that also means there is no transaction data on this metric). This means that we can use service_summary
to display services without transactions, i.e. services that only have app/system metrics or errors.
{
"size": 0,
"query": {
"bool": {
"filter": [{ "term": { "metricset.name": "service" } }]
}
},
"aggs": {
"latency": { "avg": { "field": "transaction.duration.summary" }}
}
}
System metrics are captured periodically (every 60 seconds by default). You can find all the System Metrics fields here.
Used in: Metrics section
Noteworthy fields: system.cpu.total.norm.pct
, system.process.cpu.total.norm.pct
{
"@timestamp": "2021-09-01T10:00:00.000Z",
"processor.event": "metric",
"metricset.name": "app",
"system.process.cpu.total.norm.pct": 0.003,
"system.cpu.total.norm.pct": 0.28
}
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "terms": { "metricset.name": ["app"] } }
]
}
},
"aggs": {
"systemCPUAverage": { "avg": { "field": "system.cpu.total.norm.pct" } },
"processCPUAverage": {
"avg": { "field": "system.process.cpu.total.norm.pct" }
}
}
}
Noteworthy fields: system.memory.actual.free
, system.memory.total
,
{
"@timestamp": "2021-09-01T10:00:00.000Z",
"processor.event": "metric",
"metricset.name": "app",
"system.memory.actual.free": 13182939136,
"system.memory.total": 15735697408
}
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] }},
{ "terms": { "metricset.name": ["app"] }},
{ "exists": { "field": "system.memory.actual.free" }},
{ "exists": { "field": "system.memory.total" }}
]
}
},
"aggs": {
"memoryUsedAvg": {
"avg": {
"script": {
"lang": "expression",
"source": "1 - doc['system.memory.actual.free'] / doc['system.memory.total']"
}
}
}
}
}
The above example is overly simplified. In reality we do a bit more to properly calculate memory usage inside containers. Please note that an Exists Query is used in the filter context in the query to ensure that the memory fields exist.
A pre-aggregations of span documents where span.self_time.count
is the number of original spans. Measures the "self-time" for a span type, and optional subtype, within a transaction group.
Span breakdown metrics are used to power the "Time spent by span type" graph. Agents collect summarized metrics about the timings of spans, broken down by span.type
.
Used in: "Time spent by span type" chart
Noteworthy fields: transaction.name
, transaction.type
, span.type
, span.subtype
, span.self_time.*
{
"@timestamp": "2021-09-27T21:59:59.828Z",
"processor.event": "metric",
"metricset.name": "span_breakdown",
"transaction.name": "GET /api/products",
"transaction.type": "request",
"span.self_time.sum.us": 1028,
"span.self_time.count": 12,
"span.type": "db",
"span.subtype": "elasticsearch"
}
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "terms": { "metricset.name": ["span_breakdown"] } }
]
}
},
"aggs": {
"total_self_time": { "sum": { "field": "span.self_time.sum.us" } },
"types": {
"terms": { "field": "span.type" },
"aggs": {
"subtypes": {
"terms": { "field": "span.subtype" },
"aggs": {
"self_time_per_subtype": {
"sum": { "field": "span.self_time.sum.us" }
}
}
}
}
}
}
}
Pre-aggregations of span documents, where span.destination.service.response_time.count
is the number of original spans.
These metrics measure the count and total duration of requests from one service to another service.
Used in: Dependencies (latency), Dependencies (throughput) and Service Map
Noteworthy fields: span.destination.service.*
A pre-aggregated document with 73 span requests from opbeans-ruby to elasticsearch, and a combined latency of 1554ms
{
"@timestamp": "2021-09-01T10:00:00.000Z",
"processor.event": "metric",
"metricset.name": "service_destination",
"service.name": "opbeans-ruby",
"span.destination.service.response_time.count": 73,
"span.destination.service.response_time.sum.us": 1554192,
"span.destination.service.resource": "elasticsearch",
"event.outcome": "success"
}
The latency between a service and an (external) endpoint
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "term": { "metricset.name": "service_destination" } },
{ "term": { "span.destination.service.resource": "elasticsearch" } }
]
}
},
"aggs": {
"latency_sum": {
"sum": { "field": "span.destination.service.response_time.sum.us" }
},
"latency_count": {
"sum": { "field": "span.destination.service.response_time.count" }
}
}
}
Captures the number of requests made from a service to an (external) endpoint
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { "processor.event": ["metric"] } },
{ "term": { "metricset.name": "service_destination" } },
{ "term": { "span.destination.service.resource": "elasticsearch" } }
]
}
},
"aggs": {
"timeseries": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "60s"
},
"aggs": {
"throughput": {
"rate": {
"field": "span.destination.service.response_time.count",
"unit": "minute"
}
}
}
}
}
}
Most Elasticsearch queries will need to have one or more filters. There are a couple of reasons for adding filters:
- correctness: Running an aggregation on unrelated documents will produce incorrect results
- stability: Running an aggregation on unrelated documents could cause the entire query to fail
- performance: limiting the number of documents will make the query faster
GET apm-*-metric-*,metrics-apm*/_search?terminate_after=1000
{
"query": {
"bool": {
"filter": [
{ "term": { "service.name": "opbeans-go" }},
{ "term": { "service.environment": "testing" }},
{ "term": { "transaction.type": "request" }},
{ "terms": { "processor.event": ["metric"] }},
{ "terms": { "metricset.name": ["transaction"] }},
{
"range": {
"@timestamp": {
"gte": 1633000560000,
"lte": 1633001498988,
"format": "epoch_millis"
}
}
}
]
}
}
}
Possible values for processor.event
are: transaction
, span
, metric
, error
.
metricset
is a subtype of processor.event: metric
. Possible values are: transaction
, span_breakdown
, transaction_breakdown
, app
, service_destination
, agent_config