Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latency vs u_latency #54

Closed
ckstettler opened this issue Mar 24, 2018 · 6 comments
Closed

latency vs u_latency #54

ckstettler opened this issue Mar 24, 2018 · 6 comments

Comments

@ckstettler
Copy link

I dont understand the reasoning behind the recorded and uncorrected response times. I created a simple case where the server delays 1 sec.

I started wrk with 1 thread and 1 connection and with a target rate of 5 tps.

Obviously the rate cant be achieved but the response time is going to be 1 sec. When I run for 30 secs I find that the avg response was reported as 16 secs and the rate was indeed only 1 tps.

I dont understand why it would be 16 secs. The server is guaranteed to return in 1 sec and if I print out the u_latency it is indeed 1 sec. What is the reasoning behind the corrected values? We have a fork of this repo and we dump stats on intervals and the behavior is that the avg keeps forever increasing when the server cant keep up with the clients rate. I find that the value is not useful and is miss leading. The rate not meeting target is enough to tell me the server cant keep up.

I made a local change to the code to use u_latency vs latency for the summary stats and it is what I could expect to see. I would like an option select which to see in the summary status. Maybe an option for --u_latency and --latency and another for --details or something along that line. Is that something that would be of interest to you?

@giltene
Copy link
Owner

giltene commented Mar 24, 2018

The service time is 1 second. The response time is 1 second plus whatever amount of time your requests was waiting between when it was scheduled to start (according to your 5 tps schedule) and when it actually got started. The uncorrected latency measures service time, while the corrected latency measures response time.

Response time is the time is takes for a request to get served, as seen by the client. In your setup case, you made sure only one request can be served at a time (only 1 connection), and you scheduled client requests to start each 0.2 seconds (5 tps), making sure that the rate of incoming requests is higher than the server can possibly serve. This means that the longer the test runs, the longer the list of "client waiting but we haven't even started on their request" is, and form the client's point of viewe, the [response time] clock for each such client request is already ticking. The service time for each of those requests (in your setup) will be 1 second. But the response time (the time from when the requests was scheduled to start according to your 5 tps test schedule until a response occurs) will grow over time: The first request will be served 1 second from its scheduled start time. The second will be served 1.8 seconds from its scheduled start time ((N * 1.0) - ((N-1) * 0.2)). The 31th request [issued 30 seconds into the timing, right at the end of your 30 secondsrun] will be served 25 seconds from its scheduled start time. The average response time of the 31 requests will be (N*(N+1)/2)/N seconds, which is exactly 16 seconds....

To see why this measurement is the obviously right measurement for response time (as observed by clients), imagine the following:

When you go to a coffee shop that has one (single threaded) barista that takes 1 minute to make a cup of coffee, and 5 people show up and start waiting in line every minute, and you stop "starting new coffees" after 30 minutes (you shut the door after 30 minutes even though there are people waiting in line, and only serve the poeple who you started serving in those 30 minutes) the 31st person in line (the last person to be served, who came in 6 minutes after you opened) will end up waiting 25 minutes for their cup of coffee. When asked "how long did it take you to get a cup of coffee here?" That last person would say "25 minutes", while the first person would say "1 minute". On average (across all 31 people who's coffee got started on in the 30-minute window), the answer would be "16 minutes" ((31*32/2) / 31).

@giltene
Copy link
Owner

giltene commented Mar 24, 2018

The value of this measurement is much more apparent when you test load of servers that "on the average" can keep up with the incoming request rates, but have temporary periods during the run where they don't. E.g when a service slows down temporarily for some housekeeping work like re-indexing, re-loading catalogs, pausing for garbage collection, or yielding the CPU to another service on the same machine, the rate of incoming requests can be temporarily higher than the service rate, and queue of started (from the client point of view) but not yet served requests will build up. If the server is able to (on the average) keep up with a higher-than -incoming rate, this queue will be drawn down over time, but for each request that a client started during that time, a longer response time will be experienced.

The simplest way to demonstrate the effect is to run a different version of your test: drop you server delay from 1 second to 0.1 seconds, such that your server can handle 2x the incoming 5tps request rate. Now run your 30 second test, but 10 seconds into the run, ^Z the server, and then let it keep running (with "fg") after 5 seconds (or use "kill -STOP ; sleep 5; kill -CONT " for reliable timing). Now compare the response time reported with the service time reported, and think of which one represents what happened from a client point of view... And then imagine the same thing happening in the real world, but with occasional 50msec or 150 msec blips instead of 5-second blips (such that none of your normal tools observe them happening), and which number will tell you what your service response time actually behaved like....

@ckstettler
Copy link
Author

Thank you for the response. I get what you are saying. Its kinda like a wind chill factor in the weather. You are reporting what it will feel like to the client due to the queuing.

I was having trouble understanding why it reported an ever increasing response time. I think I get it though. On other tools I have used, the virtual users are slowed by the server's delay. So 5 virtual users attempting 1 tps would be slowed and I generally would see something like 5 secs per request from each instead of 1 sec. It was more of a multiplier.

In your model you compute response times as if the rate never slows. In my case if the incoming rate is constant at 5 tps and I am servicing at 1 tps, then 5 more showed up before I could service one and by the time the first 5 are processed I have another 25+ more, thus the ever growing latency.

I still think it would be nice to have a choice of which metrics to display in the reporting. You provide a means to see the details in u_latency but you dont allow it in the summary data. I see why you have implemented it your way, but I still think about it the other way.

Again thank you for you time and for the tool. It has been fun dabbling in C again!!

@ckstettler
Copy link
Author

Just another note and I will let this one go. The other thing that confused me about this was that even though the response times were being recorded as 200 secs. There were never any timeouts which told me that the round trip was actually pretty fast. So I assume timeouts are computed on uncorrected data.

@bendizen
Copy link

bendizen commented Apr 3, 2018

@ckstettler I came across the situations you described in a 1hr test using wrk2 caused by a pause for garbage collection. Gil's explanation makes sense yet I also agree it would be helpful to see the u_latency in the summary data. I still don't think I understand the reasoning why no timeout are recorded in this situation.

@giltene
Copy link
Owner

giltene commented Apr 3, 2018

Internal timeout counts in the load generator have no relation to the timeouts that clients of the tested service would experience.

It would certainly be possible to report on timeouts that actual clients would experience under the load scenario, assuming those clients timeout when not receiving a response within X time after starting a request. But that would simply amount to reporting "what % of (or how many total) requests had a response time of greater than X". But since high fidelity percentile output already exist for the response time metric, and those percentiles effectively included all the timeout information you could possibly want (without forcing you to choose a specific timeout level for the test), I don't see much point in adding correct time modeling for a specific timeout level.

@giltene giltene closed this as completed Aug 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants