Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trace and debug log to consistency check #2583

Merged
merged 7 commits into from Mar 24, 2022

Conversation

milleruntime
Copy link
Contributor

@milleruntime milleruntime commented Mar 22, 2022

  • Closes Warn user of slow metadata scans #2577
  • Add trace span and time measurement around consistency check
    so we get an idea of how long metadata scans are taking
  • Create new property tserver.health.check.interval to make it configurable
  • Create new method watchCriticalFixedDelay() in ThreadPools

@milleruntime
Copy link
Contributor Author

I did some quick testing locally using Uno and CI and this seems like a good gauge for general cluster performance. With the cluster sitting idle, scan times were in the range of 7 - 10 ms. With CI running, the times were around 200 - 400 ms. I modified the thread checker time to check every minute.

@EdColeman
Copy link
Contributor

Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend.

Tracing would be helpful if times could be correlated with other activities - but I am not sure that is possible. First, this is running as its own thread, without a parent as part of some other activity, The other issue this will only be a periodic snapshot and not have insight into global activities - and even if it did, I don't know how that would be expressed.

For example, if during bulk ingest or maybe metadata table compaction it may be "normal" if the scan time increased from a baseline "idle".

final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();

Map<KeyExtent,Long> updateCounts = new HashMap<>();
Instant start = Instant.now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda is starting to get long, wonder if it should be pulled out to a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It is very hard to read and decipher. I think the two ThreadPools methods could be consolidated.

@milleruntime
Copy link
Contributor Author

Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend.

OPs? I don't know metrics code, any tips how to do this in the tserver?

@milleruntime
Copy link
Contributor Author

milleruntime commented Mar 23, 2022

I was thinking that it might be nice to make the consistency check frequency configurable (something like tserver.health.check.frequency with the default 30-60mins). Having it check more often on a tserver would be a nice health check when debugging. Thoughts?

@keith-turner
Copy link
Contributor

keith-turner commented Mar 23, 2022

This seems like a good short term way to get a sense of metadata table read performance. Longer term it may be better to instrument the scanner and batch scanner to support emitting metrics. Maybe the scanner and batch scanner could have a client property with a value that is a list of tables they would emit metrics for. Then this could be flipped on and we could see metadata table read performance from across the cluster in a metrics system.

If this more general solution was ever implemented it would be good to remove this code.

@milleruntime
Copy link
Contributor Author

@keith-turner @ctubbsii I think this PR is good to go.

…tServer.java

Co-authored-by: Keith Turner <kturner@apache.org>
@@ -656,6 +656,8 @@
"1.4.0"),
TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
"The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As follow on work may be good to investigate if anything else should use this property. There is a check that perodically looks for stuck compactions in the tablet server, maybe it could use this property too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea where this other check is located?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it CompactionWatcher? I noticed this method never gets called:

public static synchronized void startWatching(ServerContext context) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is the code I was thinking about. There is already a property there, the warn time prop. So probably would not make sense to use this new prop there. The frequency of the compaction check should probably be 1/2 the warn time prop.

I noticed this method never gets called:

That is not good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run() method does get called though... I am not sure if startWatching() should be called instead though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startWatching() method does get called, once in the constructor of MajorCompactor. And it looks like the other times that the run() gets called are just for single, one time checks.

// Run the watcher again to clear out the finished compaction and set the
// stuck count to zero.
watcher.run();

@milleruntime milleruntime merged commit dd81d60 into apache:main Mar 24, 2022
@milleruntime milleruntime deleted the md-scan branch March 24, 2022 14:05
@ctubbsii ctubbsii added this to In progress in 2.1.0 via automation Mar 25, 2022
@ctubbsii ctubbsii moved this from In progress to Done in 2.1.0 Mar 25, 2022
.fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
mdScanSpan.end();
duration = Duration.between(start, Instant.now());
log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration probably has a toString that might be better to use. Or use DurationFormat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
2.1.0
  
Done
Development

Successfully merging this pull request may close these issues.

Warn user of slow metadata scans
4 participants