Add trace and debug log to consistency check #2583

milleruntime · 2022-03-22T18:03:20Z

Closes Warn user of slow metadata scans #2577
Add trace span and time measurement around consistency check
so we get an idea of how long metadata scans are taking
Create new property tserver.health.check.interval to make it configurable
Create new method watchCriticalFixedDelay() in ThreadPools

milleruntime · 2022-03-22T19:10:18Z

I did some quick testing locally using Uno and CI and this seems like a good gauge for general cluster performance. With the cluster sitting idle, scan times were in the range of 7 - 10 ms. With CI running, the times were around 200 - 400 ms. I modified the thread checker time to check every minute.

EdColeman · 2022-03-22T20:01:23Z

Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend.

Tracing would be helpful if times could be correlated with other activities - but I am not sure that is possible. First, this is running as its own thread, without a parent as part of some other activity, The other issue this will only be a periodic snapshot and not have insight into global activities - and even if it did, I don't know how that would be expressed.

For example, if during bulk ingest or maybe metadata table compaction it may be "normal" if the scan time increased from a baseline "idle".

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

keith-turner · 2022-03-22T21:10:33Z

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();


This lambda is starting to get long, wonder if it should be pulled out to a function.

I agree. It is very hard to read and decipher. I think the two ThreadPools methods could be consolidated.

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

milleruntime · 2022-03-23T10:48:22Z

Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend.

OPs? I don't know metrics code, any tips how to do this in the tserver?

milleruntime · 2022-03-23T11:22:31Z

I was thinking that it might be nice to make the consistency check frequency configurable (something like tserver.health.check.frequency with the default 30-60mins). Having it check more often on a tserver would be a nice health check when debugging. Thoughts?

keith-turner · 2022-03-23T14:34:02Z

This seems like a good short term way to get a sense of metadata table read performance. Longer term it may be better to instrument the scanner and batch scanner to support emitting metrics. Maybe the scanner and batch scanner could have a client property with a value that is a list of tables they would emit metrics for. Then this could be flipped on and we could see metadata table read performance from across the cluster in a metrics system.

If this more general solution was ever implemented it would be good to remove this code.

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

milleruntime · 2022-03-24T13:10:54Z

@keith-turner @ctubbsii I think this PR is good to go.

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

…tServer.java Co-authored-by: Keith Turner <kturner@apache.org>

keith-turner · 2022-03-24T13:29:50Z

core/src/main/java/org/apache/accumulo/core/conf/Property.java

@@ -656,6 +656,8 @@
      "1.4.0"),
  TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
      "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,


As follow on work may be good to investigate if anything else should use this property. There is a check that perodically looks for stuck compactions in the tablet server, maybe it could use this property too.

Any idea where this other check is located?

Is it CompactionWatcher? I noticed this method never gets called:

accumulo/server/base/src/main/java/org/apache/accumulo/server/compaction/CompactionWatcher.java

Line 124 in 7fb1b5d

public static synchronized void startWatching(ServerContext context) {

Yeah that is the code I was thinking about. There is already a property there, the warn time prop. So probably would not make sense to use this new prop there. The frequency of the compaction check should probably be 1/2 the warn time prop.

I noticed this method never gets called:

That is not good.

The run() method does get called though... I am not sure if startWatching() should be called instead though.

The startWatching() method does get called, once in the constructor of MajorCompactor. And it looks like the other times that the run() gets called are just for single, one time checks.

accumulo/server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java

Lines 757 to 759 in dca3ae8

// Run the watcher again to clear out the finished compaction and set the

// stuck count to zero.

watcher.run();

ctubbsii · 2022-03-25T19:40:47Z

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

+                .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+          mdScanSpan.end();
+          duration = Duration.between(start, Instant.now());
+          log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),


Duration probably has a toString that might be better to use. Or use DurationFormat?

Add trace to consistency check

c15c056

milleruntime mentioned this pull request Mar 22, 2022

Consider adding critical thread metrics for monitoring #946

Open

keith-turner reviewed Mar 22, 2022

View reviewed changes

milleruntime added 2 commits March 23, 2022 07:51

updates

b9f9c91

Fix

bd833ff

keith-turner reviewed Mar 23, 2022

View reviewed changes

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java Outdated Show resolved Hide resolved

milleruntime added 2 commits March 23, 2022 13:50

Add new property to control interval

555f1bc

Create new method in ThreadPools to help cleanup tserver code

4f12092

keith-turner reviewed Mar 24, 2022

View reviewed changes

server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java Outdated Show resolved Hide resolved

keith-turner approved these changes Mar 24, 2022

View reviewed changes

Update server/tserver/src/main/java/org/apache/accumulo/tserver/Table…

dc4858b

…tServer.java Co-authored-by: Keith Turner <kturner@apache.org>

keith-turner reviewed Mar 24, 2022

View reviewed changes

Move debug stmt up

2d058d8

milleruntime merged commit dd81d60 into apache:main Mar 24, 2022

milleruntime deleted the md-scan branch March 24, 2022 14:05

ctubbsii added this to In progress in 2.1.0 via automation Mar 25, 2022

ctubbsii moved this from In progress to Done in 2.1.0 Mar 25, 2022

ctubbsii reviewed Mar 25, 2022

View reviewed changes

cshannon mentioned this pull request Apr 29, 2023

remote scan exception halts tserver #3346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trace and debug log to consistency check #2583

Add trace and debug log to consistency check #2583

milleruntime commented Mar 22, 2022 •

edited

milleruntime commented Mar 22, 2022

EdColeman commented Mar 22, 2022

keith-turner Mar 22, 2022

milleruntime Mar 23, 2022

milleruntime commented Mar 23, 2022

milleruntime commented Mar 23, 2022 •

edited

keith-turner commented Mar 23, 2022 •

edited

milleruntime commented Mar 24, 2022

keith-turner Mar 24, 2022

milleruntime Mar 24, 2022

milleruntime Mar 24, 2022

keith-turner Mar 24, 2022

milleruntime Mar 24, 2022

milleruntime Mar 24, 2022

ctubbsii Mar 25, 2022

	// Run the watcher again to clear out the finished compaction and set the
	// stuck count to zero.
	watcher.run();

Add trace and debug log to consistency check #2583

Add trace and debug log to consistency check #2583

Conversation

milleruntime commented Mar 22, 2022 • edited

milleruntime commented Mar 22, 2022

EdColeman commented Mar 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milleruntime commented Mar 23, 2022

milleruntime commented Mar 23, 2022 • edited

keith-turner commented Mar 23, 2022 • edited

milleruntime commented Mar 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milleruntime commented Mar 22, 2022 •

edited

milleruntime commented Mar 23, 2022 •

edited

keith-turner commented Mar 23, 2022 •

edited