improves monitor tserver view#6329
Conversation
Made a few major changes in the this PR all in support of providing an improved tserver page on the monitor. * Gave the RPC thread a consistent name across all server types. This was done to make the thread name findable in a metrics tag using a constant. * Setup a custom monitor metrics registry. This was done because it may not be safe to read from registry in another thread (see micrometer-metrics/micrometer#7417) AND more importantly to get step functionality where metrics like function counters show the delta for the last 30 seconds. * Refactored the SeversView code to be more flexible. It used to directly compute data from a a single metric. Now its easier to do arbitrary reductions on a collection of metrics for the data in a column. * Started collecting executor metrics on thread pools and used those to create some of the tserver columns in the monitor. Using the metrics requires looking for specific thread pool names in the tags. * Added a new meric to track scan errors. * Fixed some incorrect metrics types.
| opts.parseArgs(applicationName, args); | ||
| var siteConfig = opts.getSiteConfiguration(); | ||
| final String newBindParameter = siteConfig.get(Property.RPC_PROCESS_BIND_ADDRESS); | ||
| final String newBindParameter = siteConfig.get(siteConfig |
There was a problem hiding this comment.
This is a bug found while testing this, need to pull this in to a separate PR and some script changes are also needed.
|
There are columns for queued and completed scans and RPCs. Is there a metric for the number that are currently in-progress? |
Yes I believe the following metric is a Gauge that shows this. |
| TSERVER_HOLD("accumulo.ingest.hold", MetricType.GAUGE, | ||
| "Duration for which commits have been held in milliseconds.", MetricDocSection.TABLET_SERVER, | ||
| "Ingest Commit Hold Time", null, NUMBER), | ||
| "Hold Time", null, DURATION), |
There was a problem hiding this comment.
The null parameter here is for the column description. It's used as the text displayed in the Monitor UI when hovering over the column header in a table. When null is present the value defaults to the property description. In some cases we may want to be more descriptive in the monitor as "Hold Time" may not mean something to a novice (or experienced) user. In some cases we could be very descriptive in the Monitor telling the user what the column value means and what may cause it. For example,
Duration for which the TabletServer has not been accepting new mutations. The acceptance of new mutations are held as the TabletServer is waiting for some other activity to complete. This is typically a sync of the write-ahead log or a minor compaction. Frequent small hold times are normal whereas large hold times could indicate a problem that needs to be investigated.

Made a few major changes in the this PR all in support of providing an improved tserver page on the monitor.