Skip to content

Commit

Permalink
Add counter to track collections terminated early
Browse files Browse the repository at this point in the history
Closes #48.
  • Loading branch information
gebn committed Sep 4, 2020
1 parent 3d32c2f commit 1e9e69a
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ These metrics are exposed at `/metrics`, so are an overall view of all scrapes g
| Metric | Description |
|-|-|
| `bmc_collector_initialise_timeouts_total` | If this increases too rapidly, it suggests BMCs have too high latency to complete initialisation before Prometheus times out the scrape. This causes a kind of crash looping behaviour where the BMC never manages to be ready for scraping. The solution is to increase the scrape timeout, or move the exporter closer to the BMC. |
| `bmc_collector_partial_collections_total` | This counts the number of collections where the exporter returned a partial set of metrics to avoid Prometheus timing out the scrape request. If this happens too often the scrape timeout may be too low, or BMCs may be being reticent. |
| `bmc_collector_session_expiries_total` | The specification recommends a timeout of 60s +/- 3s, so if you have deployed the exporter in a pair and scrape every 30s, a high rate of increase indicates a load balancing issue. When the session expires, the exporter will attempt to establish a new one, so this is not a problem in itself; it just results in a few more requests and higher load on BMCs. If your scrape interval is 2m, you would expect every scrape to require a new session. |
| `bmc_provider_credential_failures_total` | Any increase here indicates the credential provider is struggling to fulfil requests, and BMCs cannot be logged into. The only bundled implementation is the file provider, so these errors will not be temporary, and indicates the exporter is being asked to scrape a set of BMCs that has drifted from its secrets config file. |
| `bmc_target_abandoned_requests_total` | A high rate of abandoned requests indicates contention for access to BMCs. This is most likely to be caused by multiple Prometheis scraping a single exporter with a short scrape timeout. These requests did not have time to begin a collection, let alone initialise a session. |
Expand Down
12 changes: 11 additions & 1 deletion bmc/collector/collector.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,13 @@ var (
Name: "session_expiries_total",
Help: "The number of sessions that have stopped working.",
})
partialCollections = promauto.NewCounter(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "partial_collections_total",
Help: "The number of collections we ended prematurely to ensure " +
"Prometheus received at least some data.",
})

// "meta" scrape metrics
up = prometheus.NewDesc(
Expand Down Expand Up @@ -168,7 +175,10 @@ func (c *Collector) Collect(ch chan<- prometheus.Metric) {
// this timestamp is used by GC to determine when this target can be deleted
atomic.StoreInt64(&c.lastCollection, start.UnixNano())

c.collect(ctx, ch) // TODO do something with error?
if err := c.collect(ctx, ch); err != nil {
// context expiry; partial data
partialCollections.Inc()
}

elapsed := time.Since(start)
collectDuration.Observe(elapsed.Seconds())
Expand Down

0 comments on commit 1e9e69a

Please sign in to comment.