Skip to content

Commit

Permalink
Clarify audit log failure telemetry docs. (hashicorp#27969)
Browse files Browse the repository at this point in the history
* Clarify audit log failure telemetry docs.

* Add the note about the misleading counts
  • Loading branch information
banks authored Aug 6, 2024
1 parent a17121c commit b276c12
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 17 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,23 @@

| Metric type | Value | Description |
|-------------|--------|-------------------------------------------------------------------------------------------|
| gauge | number | Average (mean) number of audit log request failures across all devices during time period |
| counter | number | The number of audit log request failures across all devices |

The number of request failures is a **crucial metric**.

A non-zero value for `vault.audit.log_request_failure` indicates that all
the configured audit devices failed to log a request (or response). If Vault cannot
properly audit a request, or the response to a request, the original request
will fail.
When using Prometheus sink use `rate` or `irate` to convert this into the number
of failures over a specific time period.

The `mean` value for this metric should be monitored, not the `count` which could be misleading.
When using Vault's built-in `/metrics` output format, counters are reported
aggregated over the metrics interval which defaults to 10 seconds. Due to
historical reasons, this counter is recorded in a way that makes the `count`
field misleading - it counts every request whether it failed or not. The `mean`
value however will correctly record the normalized per-second rate at which
audit errors have occurred over the interval.

Any increase in this counter indicates that all the configured audit devices
failed to log a request (or response). If Vault cannot properly audit a request,
or the response to a request, the original request will fail.

Refer to the Vault logs and any device-specific metrics to troubleshoot the
failing audit log device.
Original file line number Diff line number Diff line change
@@ -1,17 +1,24 @@
### vault.audit.log_response_failure ((#vault-audit-log_response_failure))

| Metric type | Value | Description |
|-------------|--------|--------------------------------------------------------------------------------------------|
| gauge | number | Average (mean) number of audit log response failures across all devices during time period |
| Metric type | Value | Description |
|-------------|--------|-------------------------------------------------------------------------------------------|
| counter | number | The number of audit log response failures across all devices |

The number of request failures is a **crucial metric**.
The number of response failures is a **crucial metric**.

A non-zero value for `vault.audit.log_response_failure` indicates that all
the configured audit log devices failed to log a response to a request. If Vault cannot
properly audit a request, or the response to a request, the original request
will fail.
When using Prometheus sink use `rate` or `irate` to convert this into the number
of failures over a specific time period.

The `mean` value for this metric should be monitored, not the `count` which could be misleading.
When using Vault's built-in `/metrics` output format, counters are reported
aggregated over the metrics interval which defaults to 10 seconds. Due to
historical reasons, this counter is recorded in a way that makes the `count`
field misleading - it counts every request whether it failed or not. The `mean`
value however will correctly record the normalized per-second rate at which
audit errors have occurred over the interval.

Refer to the device-specific metrics and logs to troubleshoot the failing audit
log device.
Any increase in this counter indicates that all the configured audit devices
failed to log a request (or response). If Vault cannot properly audit a request,
or the response to a request, the original request will fail.

Refer to the Vault logs and any device-specific metrics to troubleshoot the
failing audit log device.

0 comments on commit b276c12

Please sign in to comment.