Add counter to track collections terminated early

Closes #48.
gebn · Sep 4, 2020 · 1e9e69a · 1e9e69a
1 parent 3d32c2f
commit 1e9e69a
Show file tree

Hide file tree

Showing 2 changed files with 12 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -203,6 +203,7 @@ These metrics are exposed at `/metrics`, so are an overall view of all scrapes g
 | Metric | Description |
 |-|-|
 | `bmc_collector_initialise_timeouts_total` | If this increases too rapidly, it suggests BMCs have too high latency to complete initialisation before Prometheus times out the scrape. This causes a kind of crash looping behaviour where the BMC never manages to be ready for scraping. The solution is to increase the scrape timeout, or move the exporter closer to the BMC. |
+| `bmc_collector_partial_collections_total` | This counts the number of collections where the exporter returned a partial set of metrics to avoid Prometheus timing out the scrape request. If this happens too often the scrape timeout may be too low, or BMCs may be being reticent. |
 | `bmc_collector_session_expiries_total` | The specification recommends a timeout of 60s +/- 3s, so if you have deployed the exporter in a pair and scrape every 30s, a high rate of increase indicates a load balancing issue. When the session expires, the exporter will attempt to establish a new one, so this is not a problem in itself; it just results in a few more requests and higher load on BMCs. If your scrape interval is 2m, you would expect every scrape to require a new session. |
 | `bmc_provider_credential_failures_total` | Any increase here indicates the credential provider is struggling to fulfil requests, and BMCs cannot be logged into. The only bundled implementation is the file provider, so these errors will not be temporary, and indicates the exporter is being asked to scrape a set of BMCs that has drifted from its secrets config file. |
 | `bmc_target_abandoned_requests_total` | A high rate of abandoned requests indicates contention for access to BMCs. This is most likely to be caused by multiple Prometheis scraping a single exporter with a short scrape timeout. These requests did not have time to begin a collection, let alone initialise a session. |

diff --git a/bmc/collector/collector.go b/bmc/collector/collector.go
@@ -45,6 +45,13 @@ var (
 		Name:      "session_expiries_total",
 		Help:      "The number of sessions that have stopped working.",
 	})
+	partialCollections = promauto.NewCounter(prometheus.CounterOpts{
+		Namespace: namespace,
+		Subsystem: subsystem,
+		Name:      "partial_collections_total",
+		Help: "The number of collections we ended prematurely to ensure " +
+			"Prometheus received at least some data.",
+	})
 
 	// "meta" scrape metrics
 	up = prometheus.NewDesc(
@@ -168,7 +175,10 @@ func (c *Collector) Collect(ch chan<- prometheus.Metric) {
 	// this timestamp is used by GC to determine when this target can be deleted
 	atomic.StoreInt64(&c.lastCollection, start.UnixNano())
 
-	c.collect(ctx, ch) // TODO do something with error?
+	if err := c.collect(ctx, ch); err != nil {
+		// context expiry; partial data
+		partialCollections.Inc()
+	}
 
 	elapsed := time.Since(start)
 	collectDuration.Observe(elapsed.Seconds())