New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trace and debug log to consistency check #2583
Conversation
I did some quick testing locally using Uno and CI and this seems like a good gauge for general cluster performance. With the cluster sitting idle, scan times were in the range of 7 - 10 ms. With CI running, the times were around 200 - 400 ms. I modified the thread checker time to check every minute. |
Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend. Tracing would be helpful if times could be correlated with other activities - but I am not sure that is possible. First, this is running as its own thread, without a parent as part of some other activity, The other issue this will only be a periodic snapshot and not have insight into global activities - and even if it did, I don't know how that would be expressed. For example, if during bulk ingest or maybe metadata table compaction it may be "normal" if the scan time increased from a baseline "idle". |
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
Outdated
Show resolved
Hide resolved
final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot(); | ||
|
||
Map<KeyExtent,Long> updateCounts = new HashMap<>(); | ||
Instant start = Instant.now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lambda is starting to get long, wonder if it should be pulled out to a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It is very hard to read and decipher. I think the two ThreadPools methods could be consolidated.
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
Outdated
Show resolved
Hide resolved
OPs? I don't know metrics code, any tips how to do this in the tserver? |
I was thinking that it might be nice to make the consistency check frequency configurable (something like |
This seems like a good short term way to get a sense of metadata table read performance. Longer term it may be better to instrument the scanner and batch scanner to support emitting metrics. Maybe the scanner and batch scanner could have a client property with a value that is a list of tables they would emit metrics for. Then this could be flipped on and we could see metadata table read performance from across the cluster in a metrics system. If this more general solution was ever implemented it would be good to remove this code. |
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
Outdated
Show resolved
Hide resolved
@keith-turner @ctubbsii I think this PR is good to go. |
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
Outdated
Show resolved
Hide resolved
…tServer.java Co-authored-by: Keith Turner <kturner@apache.org>
@@ -656,6 +656,8 @@ | |||
"1.4.0"), | |||
TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION, | |||
"The time to wait for a tablet server to process a bulk import request.", "1.4.3"), | |||
TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As follow on work may be good to investigate if anything else should use this property. There is a check that perodically looks for stuck compactions in the tablet server, maybe it could use this property too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea where this other check is located?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it CompactionWatcher
? I noticed this method never gets called:
accumulo/server/base/src/main/java/org/apache/accumulo/server/compaction/CompactionWatcher.java
Line 124 in 7fb1b5d
public static synchronized void startWatching(ServerContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is the code I was thinking about. There is already a property there, the warn time prop. So probably would not make sense to use this new prop there. The frequency of the compaction check should probably be 1/2 the warn time prop.
I noticed this method never gets called:
That is not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run()
method does get called though... I am not sure if startWatching()
should be called instead though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startWatching()
method does get called, once in the constructor of MajorCompactor
. And it looks like the other times that the run()
gets called are just for single, one time checks.
accumulo/server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java
Lines 757 to 759 in dca3ae8
// Run the watcher again to clear out the finished compaction and set the | |
// stuck count to zero. | |
watcher.run(); |
.fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) { | ||
mdScanSpan.end(); | ||
duration = Duration.between(start, Instant.now()); | ||
log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duration probably has a toString that might be better to use. Or use DurationFormat?
so we get an idea of how long metadata scans are taking