Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow option for HdrHistogram post-correction for coordinated omission #731

Open
toddlipcon opened this issue Apr 27, 2016 · 5 comments
Open

Comments

@toddlipcon
Copy link
Contributor

Currently, the approach taken to measuring coordinated omission is to measure all operations starting from their "scheduled" start point rather than their "actual". This means that some target throughput needs to be set ahead of time.

This can be a bit difficult when tuning DB configuration parameters for throughput, since we don't know up front how fast the target system will run, and thus can't set a reasonable target.

For workloads like this, it would be nice to use HdrHistogram's post-correction feature[1] and pass in something like average throughput (or perhaps median of secondwise throughput samples to be more robust). This wouldn't be as accurate as setting a target throughput, but is more accurate than not correcting at all (the current default).

Ping @nitsanw for thoughts (@busbey suggested you might be interested in this one)

[1] http://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#copyCorrectedForCoordinatedOmission(long)

@toddlipcon
Copy link
Contributor Author

Another thought for the estimate for 'expected interval' is to use the median as recorded in the HdrHistogram.

toddlipcon added a commit to toddlipcon/YCSB that referenced this issue Apr 27, 2016
@toddlipcon
Copy link
Contributor Author

Put up a WIP patch which does the proposal above, and it does seem to give me more reasonable "corrected" numbers (eg when I'm testing configurations that I know cause stalls on the server side, I now see high 99p)

@nitsanw
Copy link
Contributor

nitsanw commented Apr 28, 2016

Hi, thanks for pinging :-)
Post correction for a particular rate is a problematic proposition, for several reasons. Here's thinking out loud, in no particular order, some issues and observations:

  • If you choose the avg tpt, you should discard the warmup period. It will lower the average thus making you under correct.
  • At max throughput, latency has to grow. Because the system is at capacity. Every new request delays the request prior to it. It follows that expected behaviour at max throughput is latencies of the scale of the test length (once the test reached steady 'max throughput' state). If the max latency is not growing as your test is progressing it means the system is not at capacity.
  • It would be better IMO to try and build an exploratory mode into YCSB which, given an SLA, pushes the throughput until SLA is breached. I.e. trying to answer the question of max acceptable load. This is not so hard to do I think.

@toddlipcon
Copy link
Contributor Author

  • Regarding average throughput, it seems from my testing that choosing the median latency as the expected seems to be fairly robust to warmup.

I'm not sure I agree with your point that latency has to grow at max throughput. That seems to be true in an open system, but YCSB is a closed system, right? With the closed system and enough threads, it's likely that YCSB saturates the system and doesn't cause uncorrected latencies approaching infinity. The idea with this issue is to do something that would reasonably approximate the latencies that would be seen if the test were re-run with a target just below the max throughput determined by the closed system, without actually re-running.

  • I don't think the "exploratory" mode is so simple, actually, since a lot of the databases under test from YCSB are heavily dependent on deferred work. For example, a system might hum along at 90K QPS with 1ms latencies for 20 minutes, and appear stable, but actually the daemon's memory usage is slowly climbing. Then, it hits some wall and basically stops all requests for a minute or two while it catches its breath.

    Even on read-only workloads, behavior like the above can be seen, though often in an "improving" direction instead of a performance crash. For example, if you initiate a read-only workload immediately after finishing the load, performance may be slow as compaction is going on in the background for tens of minutes. Once it completes, a dramatic drop in latencies can be observed.

Given the above behaviors, I think any "exploratory" mode would have to coordinate with the system under test to reset all the data, re-load everything, and run the benchmark on the order of hours for each target throughput, don't you think?

@nitsanw
Copy link
Contributor

nitsanw commented Apr 28, 2016

"I'm not sure I agree with your point that latency has to grow at max throughput. That seems to be true in an open system, but YCSB is a closed system, right? With the closed system and enough threads, it's likely that YCSB saturates the system and doesn't cause uncorrected latencies approaching infinity."
Let's assume a system has an average capacity for 1M ops/sec, any momentary dip in performance is expected to lead to a setback in latency for all ops going forward. The effect of every dip compounds the effect of previous dips and can only be corrected by a matching surge in performance. E.g. (rt is response time)
0:01 1M ops/sec
0:02 0.1M ops/sec
0:03 1M ops/sec (0.9 requests trailing, so expect normal rt+0.9s)
0:04 1M ops/sec (0.9 requests trailing, so expect normal rt+0.9s)
0:05 1.9M ops/sec (0.9 surplus catches up with prev. dip)
0:06 1M ops/sec (expect return to normal rt)
In my experience, dips/hiccups of a large scale are far more common that surges of large scale. While it all averages out, the tail of corrected latencies will grow to infinity (given an infinite time to run the system + a pony).

I understand the pressure to avoid repeating runs and calibration, but I don't think you can get a good idea of LUL (latency under load) from a pure throughput run. In fact, it is far from trivial to conclude much about the system LUL characteristics without several LUL runs for different loads. As you point out runs need to be "sufficiently" long to be representative, it all adds up to a large electricity/time/AWS bill :-(

"I don't think the "exploratory" mode is so simple" - I agree, my comment was rash and poorly thgouht out. For a read load (or rather for a non-steady state load) this is doable, but it is not clear how long you'll have to run to get an answer (increase load until SLA breached, decrease load to meet SLA with some margin, measure over long period, increase load by small amount, measure over long period etc.). You can settle for a low granularity figure to improve convergence time (i.e. minimal load increment). I've built something similar for a messaging system once, it was a hoot, and also sort of useful.

I think this makes for an interesting observation that systems have several modes and the profile for each mode is not necessarily composable. What I mean by that is that the system behaviour while no compaction is going on is one mode, during compaction is another, during node recovery a third etc. You can't really compose the lot to a single profile without losing allot of meaning.
Having said all this, maybe I am wrong and there's some way to post process the histogram logs and come up with an estimated "Max Tolerable Throughput" number. The way to validate this claim is by comparing your projected profile with an LUL profile. I am curious to hear the results :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants