-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow option for HdrHistogram post-correction for coordinated omission #731
Comments
Another thought for the estimate for 'expected interval' is to use the median as recorded in the HdrHistogram. |
Put up a WIP patch which does the proposal above, and it does seem to give me more reasonable "corrected" numbers (eg when I'm testing configurations that I know cause stalls on the server side, I now see high 99p) |
Hi, thanks for pinging :-)
|
I'm not sure I agree with your point that latency has to grow at max throughput. That seems to be true in an open system, but YCSB is a closed system, right? With the closed system and enough threads, it's likely that YCSB saturates the system and doesn't cause uncorrected latencies approaching infinity. The idea with this issue is to do something that would reasonably approximate the latencies that would be seen if the test were re-run with a target just below the max throughput determined by the closed system, without actually re-running.
Given the above behaviors, I think any "exploratory" mode would have to coordinate with the system under test to reset all the data, re-load everything, and run the benchmark on the order of hours for each target throughput, don't you think? |
"I'm not sure I agree with your point that latency has to grow at max throughput. That seems to be true in an open system, but YCSB is a closed system, right? With the closed system and enough threads, it's likely that YCSB saturates the system and doesn't cause uncorrected latencies approaching infinity." I understand the pressure to avoid repeating runs and calibration, but I don't think you can get a good idea of LUL (latency under load) from a pure throughput run. In fact, it is far from trivial to conclude much about the system LUL characteristics without several LUL runs for different loads. As you point out runs need to be "sufficiently" long to be representative, it all adds up to a large electricity/time/AWS bill :-( "I don't think the "exploratory" mode is so simple" - I agree, my comment was rash and poorly thgouht out. For a read load (or rather for a non-steady state load) this is doable, but it is not clear how long you'll have to run to get an answer (increase load until SLA breached, decrease load to meet SLA with some margin, measure over long period, increase load by small amount, measure over long period etc.). You can settle for a low granularity figure to improve convergence time (i.e. minimal load increment). I've built something similar for a messaging system once, it was a hoot, and also sort of useful. I think this makes for an interesting observation that systems have several modes and the profile for each mode is not necessarily composable. What I mean by that is that the system behaviour while no compaction is going on is one mode, during compaction is another, during node recovery a third etc. You can't really compose the lot to a single profile without losing allot of meaning. |
Currently, the approach taken to measuring coordinated omission is to measure all operations starting from their "scheduled" start point rather than their "actual". This means that some target throughput needs to be set ahead of time.
This can be a bit difficult when tuning DB configuration parameters for throughput, since we don't know up front how fast the target system will run, and thus can't set a reasonable target.
For workloads like this, it would be nice to use HdrHistogram's post-correction feature[1] and pass in something like average throughput (or perhaps median of secondwise throughput samples to be more robust). This wouldn't be as accurate as setting a target throughput, but is more accurate than not correcting at all (the current default).
Ping @nitsanw for thoughts (@busbey suggested you might be interested in this one)
[1] http://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#copyCorrectedForCoordinatedOmission(long)
The text was updated successfully, but these errors were encountered: