Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP: response time measurement omits time to connect and send request #2077

Closed
ceeaspb opened this issue Aug 12, 2014 · 19 comments
Closed

HTTP: response time measurement omits time to connect and send request #2077

ceeaspb opened this issue Aug 12, 2014 · 19 comments
Assignees
Milestone

Comments

@ceeaspb
Copy link

@ceeaspb ceeaspb commented Aug 12, 2014

in 2.0 RC2 (likely the same for all previous versions):
running the simple coordinated omission test of @giltene, Gatling does not capture the server system freeze in the response time percentiles.

eg. I run the following wrk test:

$ wrk --latency -d 10s -t 1 -c 1 "<url>"
Running 10s test @ <url>
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.67s     2.93s    6.01s    61.09%
    Req/Sec   592.26    746.02     1.67k    62.88%
  Latency Distribution
     50%    6.01s 
     75%    6.01s 
     90%    6.01s 
     99%    6.01s 
  5760 requests in 10.00s, 2.04MB read
  Socket errors: connect 0, read 0, write 0, timeout 2

and freeze the server for 6 seconds:

kill -TSTP 7090;sleep 6;kill -CONT 7090

we can see from the above that the median is calculated correctly.

when running the following workload similar to wrk above:

constantUsersPerSec(500) during(10 seconds)

the measurements are not capturing the server freeze.

---- Global Information --------------------------------------------------------
> request count                                       5000 (OK=4981   KO=19    )
> min response time                                      1 (OK=1      KO=554   )
> max response time                                   6017 (OK=6017   KO=995   )
> mean response time                                   135 (OK=132    KO=715   )
> std deviation                                        829 (OK=830    KO=147   )
> response time 50th percentile                         10 (OK=10     KO=631   )
> response time 90th percentile                         38 (OK=37     KO=958   )
> mean requests/sec                                  299.7 (OK=298.5  KO=1.14  )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                          4881 ( 98%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                          100 (  2%)
> failed                                                19 (  0%)
---- Errors --------------------------------------------------------------------
> java.net.ConnectException: Connection reset by peer to http://     19 (100.0%)
<url>

I suspect this may be because the measurement is not started before the connection is initiated? I haven't looked at the code yet.

@pdalpra

This comment has been minimized.

Copy link
Contributor

@pdalpra pdalpra commented Aug 12, 2014

It seems that, as you suspected, the measurements do not start until the connection is initiated. As this would have to do with Async HTTP Client, it'll have to wait until @slandelle can have a look at it ;)

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 12, 2014

thanks @pdalpra - just for my learning could you point out the affected line(s) ?
you're doing a good job of holding the fort whilst he's away.

@pdalpra

This comment has been minimized.

Copy link
Contributor

@pdalpra pdalpra commented Aug 12, 2014

@ceeaspb : Ah, my bad, after checking a bit further, the culprit is ResponseBuilder, how it is updated.
Gatling doesn't update those timestamps until it has at least received the response's status.

However, this would change the whole meaning of those timestamps : it would no longer be the time when the first byte is received, but when we started waiting for a response.
And This is already expressed in the reports : the request's latency, which corresponds to firstByteReceived - lastByteSent.
And this is also what shows up in wrk's output, which otherwise does not display the response time.

However, this does not shows up in the console output.
So, from a possible bug in how Gatling handles request/response timings, we're moving to a feature request to extend the console output ;)

And thanks for your kind words, I'm doing my best ;)

@giltene

This comment has been minimized.

Copy link

@giltene giltene commented Aug 13, 2014

I'd still call it a bug, in how percentiles and other latency distribution information is reported. This is a case of a known server behavior (an induced 6 second freeze) that is simply not reported in the measurements. E.g. the 90% number is clearly off by 2+ orders of magnitude, and the reported percent of t < 800 ms (98%) is also clearly very wrong. Whatever the mechanism that causes this bug is, the measurements for all latencies and percentiles are carving out that 6 second event and act like it never happened.

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 13, 2014

I agree this is a bug, and certainly needs to be fixed before 2.0 is released.
Measurement is at the core of perf test tools, without confidence in the numbers all the excellent work on DSL, etc is for nothing.

I disagree that latency (or whatever we want to call the the time the client had to wait) is firstByteReceived - lastByteSent.

This would omit exactly the time in the test above which is the time to connect.
It is also not a standard measurement - for example web page measurement goes back as far as the unload of the previous page, precisely because if it takes time to unload the previous page it will impact the user's experience.

https://dvcs.w3.org/hg/webperf/raw-file/tip/specs/NavigationTiming/Overview.html#processing-model

Where to start the timing is not obvious, see for an example:
http://lists.w3.org/Archives/Public/public-web-perf/2014Jun/0048.html

therefore, I would strongly advise to change the latency to

firstByteReceived - firstByteSent

which should start (I am currently assuming) before the HTTP request is sent, before the tcp connection is initiated.

this is a critical bug in 2.0

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 13, 2014

updateFirstByteSent

Also not sure why we would want to update the timestamp of the first byte? Seems fairly risky

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 13, 2014

I have opened a separate ticket for documentation in #2079

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 13, 2014

so I have been trying a simple change to the requirement to eliminate this risk where the latency ignores the impact to user experience where the connect time is omitted:

firstByteReceived - beforeRequest
gatling-http/src/main/scala/io/gatling/http/response/ResponseBuilder.scala
-  var firstByteSent = nowMillis
+  var beforeRequest = nowMillis

where beforeRequest is measured before the request is passed to AHC (the async http client atop netty).

good and not so good news -

  1. the good news is that under moderate load (ie. 50 requests per second) on my MBP the test roughly passes.
    slightly amended to make the maths easier...

the ^Z is for 5 seconds out of the 10 seconds test
the percentiles reported are 75th and 62nd - which should report around 2500ms and 1250ms respectively.

---- Global Information --------------------------------------------------------
> request count                                        500 (OK=500    KO=0     )
> min response time                                      2 (OK=2      KO=-     )
> max response time                                   4996 (OK=4996   KO=-     )
> mean response time                                  1393 (OK=1393   KO=-     )
> std deviation                                       1721 (OK=1721   KO=-     )
> response time 62nd percentile                       1311 (OK=1311   KO=-     )
> response time 75th percentile                       2725 (OK=2725   KO=-     )
> mean requests/sec                                  49.98 (OK=49.98  KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                           284 ( 57%)
> 800 ms < t < 1200 ms                                  12 (  2%)
> t > 1200 ms                                          204 ( 41%)
> failed                                                 0 (  0%)

we can see some bunching though in the response time distribution. -> this could be caused by my MBP though. better do a next test on a linux machine.

screen shot 2014-08-13 at 16 25 21

  1. the not so good news, which could just be my test rig set up which is not a production one... is that the bunching gets more pronounced as the arrival rate increases - here the 500ups (users per second, 1 request per user)
> response time 62nd percentile                         37 (OK=37     KO=-     )
> response time 75th percentile                       2526 (OK=2526   KO=-     )

screen shot 2014-08-13 at 16 30 30

it's a first stab in any case.

ceeaspb pushed a commit to ceeaspb/gatling that referenced this issue Aug 13, 2014
@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 14, 2014

I will change the title of this as I think this is not a CO(co-ordinated omission) defect, just one of where to sample the start/end timestamps for the latency measurements.

@ceeaspb ceeaspb changed the title gatling fails CO test compared with wrk HTTP: response time measurement omits time to connect and send request Aug 14, 2014
@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 14, 2014

thanks @pdalpra
a minor note as wrk is mentioned here as being "correct" - looking again at the wrk output - looks wrong also but likely for other reasons. I raised a defect on that project too.

@slandelle

This comment has been minimized.

Copy link
Member

@slandelle slandelle commented Aug 23, 2014

@ceeaspb #2125 has Gatling account for connect in response time

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 24, 2014

I'll retest but the source code needs clean up per my pull request.... Eg.
At least remove the update first byte sent.
I'll send another pull

@slandelle

This comment has been minimized.

Copy link
Member

@slandelle slandelle commented Aug 27, 2014

@ceeaspb Did you get any chance to test?

@ceeaspb

This comment has been minimized.

Copy link
Author

@ceeaspb ceeaspb commented Aug 27, 2014

@slandelle yes looks good.

> response time 62nd percentile                       1296 (OK=1296   KO=-     )
> response time 75th percentile                       2715 (OK=2715   KO=-     )

I think there's more work to do like the comment above about the bunching in the percentile distribution, but as a first step it does what it is suppose to do. LGTM.

@ceeaspb ceeaspb closed this Aug 27, 2014
@slandelle

This comment has been minimized.

Copy link
Member

@slandelle slandelle commented Aug 27, 2014

I'll have to have a proper second look at CO, but it looks to me that in our case, it's more a matter of accuracy caused by the temporal bucket width: events much shorter than the bucket width might go unnoticed because they will be outweighed. If the moto is "percentiles are not a silver bullet", I agree.

@glennmartinez

This comment has been minimized.

Copy link

@glennmartinez glennmartinez commented Jul 22, 2015

Is Gatling's response time = to responseEnd? or is it timeToLastByte?

@slandelle

This comment has been minimized.

@tacoben

This comment has been minimized.

Copy link

@tacoben tacoben commented Mar 31, 2016

question: does this mean that CO is now resolved in gatling for high load? or does it mean that high load scenarios CO is still an issue... im having some trouble figuring this out from the comments.

@slandelle

This comment has been minimized.

Copy link
Member

@slandelle slandelle commented Apr 1, 2016

@tacoben Maybe @giltene will disagree on this and will want to correct me, but here's my take:

CO is a (load test design / load test tool implementation) issue.
The root cause is that you might be introducing an artificial coordination between the client (the load test run) and the server (the system under test).

This happens when you try to execute a bunch of unrelated requests sequentially into a loop (or multiple concurrent loops). If the system under test starts lagging, the next request won't be started in due time.

IMO, the root cause here is: "why were those unrelated requests executed sequentially in the first place???". Those should have been scheduled independently!

Beware that it's your job to figure out if your requests are unrelated or not. For example:

  • the tons of orders executed in a HFT system are probably unrelated
  • requests from different virtual users on a website are usually unrelated
  • requests from a given virtual user on a website are definitively related: a human won't proceed to the next page before he has received the first one!

This CO issue looks very similar to me to the difference between closed and open workload models.

In a closed model, you only have a given number of users in your system, and a new user can't enter the system until another one exits.
This model is unrealistic in most cases. It's only valid when you have a queue in front of your system that allows your users to actually enter.
So far, I've only seen this kind of application in some ticketing platforms: when there's too many customers, you can't buy until your number is called.
The consequence of this model is that throughput drops when latency increases, so the test "self heals".

In most system, you want an open workload, where new users keep on arriving whatever your latency, possibly causing your system under test to crash.

Back to Gatling: Gatling emphasizes open workload model: you schedule the virtual users arrival rate.

If you really want a closed workload model, you can do what most other tools do: wrap your scenario into a loop, and reset your virtual users on each iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.