-
Notifications
You must be signed in to change notification settings - Fork 507
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Artillery prevent coordinated omission? #721
Comments
In a word, 'no', it doesn't suffer from coordinated omission. Artillery has a concept of virtual users which arrive and make requests independently of other virtual users. A given user will indeed wait for request 1 to complete before sending request 2, but another virtual user will be trying to send its own requests at the same time too. |
@hassy I stand to be corrected, but I don't believe that is enough to avoid CO. Specifically, how is the response time for user 1 measured when they make their next request? Or does the response time get measured elsewhere? |
@hassy, on reflection you might sidestep CO if each virtual user only ever makes one request - but then how realistic are the test results as a representation of a user experience? Ack Artillery seems to focus on generating loads to test. But at least this use case is currently ruled out:
And this Feature is not one:
|
@bbros-dev - a single virtual user sends requests in a sequence, just like most real-world clients would, so a single VU is a "closed loop", and would indeed "coordinate" with an overloaded/slow server, however other VUs are completely independent - so new VUs will continue to arrive, and existing VUs will continue trying to send their next request regardless. Re smoke testing - this official plugin enables smoke testing with Artillery: This doc describes how you can track custom metrics: |
@hassy, appreciate the clarification.
Again, happy to be corrected: while VU arrivals maybe independent of each other I don't see that eliminating the effects on CO - correcting myself here: single request independent VUs don't mean that:
You're infrastructure is knotted/slow/stressed VU:A has their response data being CO. VU:B arrives independently of VU:A.... say it arrives after the cause of the CO of VU:A has passed.... surely this is a counter example to your claim that multiple VU protects against CO?
I don't believe this has the effect you hope - the counter example above proved that - but this should make clear the opposite would need to be true: Now the data you report from these requests likely is less subject to CO - as soon as you relax the 1-req per VU you'll be back to reporting noise. The dependent VU arrival suggests a possible quick fix for the 1-req per VU scenario. Anyway this issue is open, so it seem to be acknowledged as an open issue. |
Artillery's arrival rates are an open system. CO does not affect open loop systems because new work arrives regardless of what's being processed. CO can happen only when new work is not submitted until work already in the queue has been completed. This is why it affects tools like Consider what would happen if you ran Artillery with arrival rate = 50 on a service which locks up completely for 5s at some point, but responds in 1ms at all other times. You'd have up to 50 * 5 = 250 VUs recording and reporting outsized response times in that time period.
Not really, the issue status does not necessarily mean acknowledgement (especially before Github had support for separate discussions area). Just for the sake for clarity for anyone else reading this discussion - Artillery does not suffer from coordinated omission. I will write this up for Artillery's docs, and we can then close the issue. |
Something else worth clarifying - if you need to test a single endpoint with constant RPS, don't use Artillery, use wrk2 or Vegeta or autocannon instead. Artillery is designed for testing transactional scenarios with dependent steps, e.g. mimicking a large number of clients using an e-commerce API where each client:
Such real-world scenarios by definition cannot be tested at pre-set constant RPS, because requests depend on each other. But just like in the real-world, Artillery lets you model a large number of those clients arriving to use the service independently of each other. And on another note, for anyone looking to understand the difference between open and closed models, and why Artillery's hybrid open-closed model is the best model for most real-world scenarios, this paper is a good read: https://www.usenix.org/legacy/event/nsdi06/tech/full_papers/schroeder/schroeder.pdf |
@hassy thanks for taking the time to clarify. I believe we differ by degree: I think if you carefully configure and operate artillery you can likely mitigate CO. If I understand correctly you suggest regardless of configuration and use artillery is immune from CO? Restating the problem: We're trying to uncover the unknown distribution of the latency of some system, possibly made up of many sub-systems, steps or processes that run sequentially and/or in parallel. But from the PoV of an end user. Better solution is to strip out the randomness of the arrival process. You can do that by ensuring requests are wholly predictable. Any deterministic process would be fine, but the easiest is to use a fixed interval at which requests are started. With this you recover the empirical distribution of the system latency, and leave to the side the question of whether this is the true distribution - this is why we use PRS in the 100's or 1000's because tail estimation is hard and you need very, very large samples before detecting changes in your tail behavior start to become reliable. That solution has nothing to do with thread counts. It has nothing to do with open or closed any thing. Higher thread counts, conditional on how many requests they queue, can mitigate the effects of CO, can that mitigation be enough to consider the effect eliminated? Depends on what is happening on each thread in terms of sample sizes being generated and the CO event(s) distribution. Perhaps to help users, who maybe unaware of the problem, you'd consider adding to the docs Gil's succinct and practical description of the problem with just thinking you can throw more threads at the problem (notice his premise, in the first quote, is that you've accepted the solution is a particular deterministic arrival process):
Yes systems have many components, some sequential some parallel.
That clarification would help, but from experience not all testers have internalized Gil's insights, or similar ones, and don't understand that "constant RPS" means strip out the randomness of the arrivals process (especially the effect of queuing arrivals).
If that scenario makes business sense in your testing case using artillery is (should mostly be?) fine for your purposes. |
What I struggled to find and answer for is does artillery allow me to guarantee those "clients" are one shot clients? |
@bbros-dev Thank you for the discussion! It's an interesting subject, and the thread will help us make our docs better!
That's not how Artillery works - new VUs continue arriving at sending requests regardless of whether other VUs are waiting for some request to complete or not. This characterisation of workload generation describes a closed-loop system - Artillery uses a hybrid model (as described in the paper I linked above).
No, because that does not make sense in the context of systems that Artillery is designed to test. Take Github for example:
Say you want to see the effects of 1,000 users arriving every minute for an hour. That's where you reach for Artillery. Constant RPS makes no sense in this scenario, as there's implicit back-pressure from the server (local to each user) because requests are dependent on each other. It's impossible to try to impose a constant rate of requests in a scenario like this (but you can set a constant rate of arrivals). This describes every real-world system which supports anything resembling transactions (i.e. do thing A, then do thing B depending on the result in A). Does this mean that latency outliers due to a temporary server stall will get hidden? No - because whilst VU A is waiting for a response, a number of other VUs will arrive, send their initial requests, and record outsized response times. Artillery also outputs latency metrics at a configurable interval (10s by default) rather than a single aggregate report at the very end, so stalls like that are visible immediately and don't get smoothed over by smaller measurements from the rest of the duration of the test run. There is of course another class of systems, where all requests are idempotent and commutative - an |
I'd argue the opposite - load generation model is everything, in general and in the context of CO. To restate my points in a slightly different way: Any closed-loop load generator will suffer from CO. A full open-loop load generator will not. An example of open-loop load generation is sending requests at a fixed rate. Gil Tene pioneered that in The problem you run into with fully open-loop load generators is that the type of systems you can test with those is extremely narrow. Your system must satisfy two requirements to be testable at constant RPS:
With that in mind, what do you do when you want to test a system which does not satisfy those requirements? Well, you end up with something like this:
This is Artillery's hybrid model, which also maps exactly onto how such a system would be used in the real world. |
@hassy thanks for the clarifications and additional insights. I agree much of this can be distilled into something generally useful for a novice tester/test-team. Perhaps it is worth framing the guidance in two categories:
Like most situations/configurations, here one is faced with trade-offs and the difficulty is knowing what the impact of those trade-offs is. Something like this illustration would allow users assess how to they want to test-up and configure their testing infrastructure. In these scenarios absent CO you have a continuous curve starting when the system was frozen. For CO vulnerable configurations or tools you have anything not a smooth curve, usually disjoint segments of lines and curves and in worse cases some sort of jump behavior. Of course, the premise of this thread is extremely narrow - someone wants to reliably recover the latency distribution of some system(s). There are many use cases where that is not the primary interest, and I'm not suggesting those be treated as less important, and there are situations where it makes sense to trade-off some CO for other data/insights you get. |
I think the term CO is not particular useful if no model of the traffic and the interaction between the server and the client are given. CO is really the gap between the traffic/interaction model you use to measure and what happens in the real world. So it's not quite fair to say a test tool will suffer from CO, which means it's inherently flawed to omit things. Take a second to think about it: what has really been omitted, for a full closed-loop load generator? The answer is surprisingly nothing. The tool faithfully records latency it observed without modification or omission. But why does people feel they're vulnerable to the so-called CO, while they doesn't omit any data? Because the fully closed-loop model is far away from what happens in the real world, where individual visitors don't wait or block for other visitors. So when consider a load generator, you need to think about both the model used to test and the model of the real world. There could just be a niche for a fully-closed loop load generator, where the real world model has back pressure.
This is a brilliant model that I think applies to a majority of real world systems, and I don't think it has the CO issue in most cases. Note that there's a closed-loop generator for each visitor. But it's fine to have the back pressure for them, if the interaction model is that they need the response to decide what to do next, instead of visitors send requests at a fixed rate. But again, it really depends on what you expect for the real world. In conclusion, it's better to ask "Does the load model that Artillery uses fits my system?" instead of "Does Artillery prevent coordinated omission?". And both answers are "Probably, depending on what you expect for the real world". |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Coordinated omission is a term coined by Gil Tene to describe the phenomenon when the measuring system inadvertently coordinates with the system being measured in a way that avoids measuring outliers.
One example of how this can happen would be if a load tester waits to send a request until the previous one has completed. If the load tester is testing 10 req/s and a request normally takes 50ms each request will return before the next one is due to be sent. However if the whole system occasionally pauses for 5 seconds, the load tester would not send any requests during this 5 second period. The load test would record a single bad outlier that took 5 seconds.
If the load tester was firing requests consistently then it would have made 100 requests during the 5 second pause time, these requests were omitted. If these requests were made during the pause time, then the latency percentiles would look very different and more accurately capture the systems behaviour under load.
There are a number of good videos and blog posts which discuss this more. I was evaluating artillery and wanted to see if it accounted for coordinated omission, but couldn't see any discussion of it in issues or code. Is this something that artillery tries to prevent?
The text was updated successfully, but these errors were encountered: