Add Traffic Ops read-while-writer and small cache#6017
Add Traffic Ops read-while-writer and small cache#6017rob05c wants to merge 4 commits intoapache:masterfrom
Conversation
rawlinp
left a comment
There was a problem hiding this comment.
With any kind of timed cache, isn't there a chance that a cache will end up using stale data to generate its config and clear its update status?
docs/source/admin/traffic_ops.rst
Outdated
There was a problem hiding this comment.
How do these settings work together? If the cache is disabled, does TO still do RWR?
There was a problem hiding this comment.
They're independent, yes. Though RWR acts like a cache for most integrity purposes, users should be aware of that.
There was a problem hiding this comment.
Ok, so new features like this should generally be disabled by default (if e.g. the cdn.conf is left untouched after upgrading), right?
There was a problem hiding this comment.
I'm pretty sure we'll be able to have users on the same tenant with different capabilities.
There was a problem hiding this comment.
I asked @mitchell852 , our resident tenant/roles/permissions expert, when I wrote this. He said Roles&Permissions will be per-Tenant. I just forgot to remove the TODO comment.
Which, I would vote they are. Not just for this feature, but for a lot of components and endpoints, if we make the set of Permissions per-Tenant and require users to make a new sub-tenant if they want to give someone in their organization fewer permissions, that makes a lot of things more scalable and avoids M*N data.
There was a problem hiding this comment.
Maybe I'm a little bit confused. What I envision is that two users in the same tenant can have different roles (and hence, different capabilities). Even if roles (and their associated capabilities) are per-tenant, we should be able to have multiple different roles within the same tenant.
traffic_ops/traffic_ops_golang/routing/middleware/readwhilewriter.go
Outdated
Show resolved
Hide resolved
traffic_ops/traffic_ops_golang/routing/middleware/readwhilewriter.go
Outdated
Show resolved
Hide resolved
In theory. In practice, with a 1s cache, it's extremely unlikely. The operator would have to make a change, then queue updates, and then We could handle that race by adding the cache time to the Update Status endpoint, and then adding code to That said - we need to modify the |
I disagree. I think the race condition is extremely likely to occur, even with a 1 second cache, and especially once
Different TOs could all have different I appreciate the intent of making TO more scalable by adding things like timed caches and RWR, but I'm not really sure it's worth the risk of sacrificing our data consistency. It seems far more safe and scalable to implement something like Cache Config Snapshots, where we can cache that data in-memory with 100% data consistency without the possibility of race conditions. Rather than complicate the entire API by adding a new layer that every endpoint has to go through, we'd have a single endpoint with an in-memory cache (that isn't timed and is easily invalidated whenever a new snapshot is taken). It just seems easier to make a single endpoint really scalable (special-built for We can also improve the endpoints that We also have the IMS changes you recently added to We're also running TODB on relatively puny VMs, so I'm curious how much performance we'll actually get out of baremetal. All that is to say: I think there are other avenues we should pursue and evaluate before we start introducing data consistency risk. |
That problem is out of scope because it's the job of external tooling to ensure that doesn't happen. If it's something that should be in scope, we could just leverage out existing profile/parameter system to just take it out of config files entirely and therefore always be consistent.
Both approaches have merit and precedent. In this case though it feels like the risk vs reward seems to me at least very much in favor of moving ahead with it. Another important piece here is that it's done and not on a roadmap or blueprint. Whenever Cache Config Snapshots comes about, there's nothing that says this extra caching layer has to stay (although even CCS would benefit too). The other improvements that have been suggested are valid too and some are even in progress. When it comes to these TODB/API performance issues, I havn't seen any reproduce case that can be measured or evaluated reliably. This is why a multifaceted approach isn't a bad thing. It's acknowledged that cache configs are some of the heaviest hitters and at large scale depending on how your postgres instance is configured, on what, and where in the network plays a big part in the overall success rates. This is a PR which does significantly help alleviate that scalability concern without requiring any postgres or infrastructure changes. |
But these things are configurable. If 1s is too long for some users, perhaps it's just right for others. TC doesn't have a strict requirement on polling frequency, different users will have different needs. Someone with a slow TO box will really appreciate having a cache and can adjust queue times appropriately. The race here is not "how long after queuing does a cache poll", though, it's "how long after data modification are updates queued". If you change data, wait 1 second, then queue updates, a cache checking for updates after 1ms will still get the updated data. This is practically only a concern in scripted changes, but scripts can work in concert with the cache duration to ensure no race. If an operator turns on the cache, they might need to add a slight delay between changing data and queuing updates. Alternately, they may need to configure caches to wait a short time after observing an update flag and requesting updates. (As a practical matter, operators currently have challenges with automatic queues because of a huge number of other potential races with data changing on the CDN, but those aren't implicated here.) The operator has all the control they need to optimize the system for their situation, so it's a clear win. Maybe some additional docs on how to use the feature and its implications can make a difference, though. If we wanted to bust the race, though, all we'd have to do is invalidate the cache when updates are queued. That's probably feasible if we want to avoid operators needing to dance around the cache. @rob05c , does that seem feasible? This seems like a clear performance win. There are operational concerns, but it seems to have appropriate controls that allow people to make the right decision for their hardware and operational situation. It supports no-cache requests that allow scripts to ensure no caching. Even a moderately large CDN might see multiple orders of magnitude of performance improvement with very conservative values like just a few milliseconds. |
When we change /update_status to be a timestamp (which is on the short roadmap to solve a much bigger race that occurs frequently today), t3c can request the data it needs with a By sending max-age, the server's cache_ms is irrelevant, no TO will serve its cache if it's older than the queue time. Does that address your concern?
We're talking about 100x more requests/second. It's the difference in being able to have 2000 caches get their config every 15 minutes -- which frequently fails so config deployment actually takes an hour or more -- and being able to get config in under 1 minute. That's the goal here: making TO able to handle fast cache config. I am sure it's worth being able to deploy cache config in under a minute instead of an hour. But with the above, it's not sacrificing any consistency.
It's been an ATC goal for years to remove logic from the Traffic Ops monolith, in order to make ATC more scalable, more horizontal, and safer and possible to canary-deploy changes. Adding more logic to Traffic Ops instead of removing it is fundamentally opposed to that goal. Not to mention the data consistency problems with snapshot blobs. There are far more consistency problems, problems which are often unfixable, with putting denormalized blobs in the database. Problems such as client and server compatibility. Problems which are either unfixable; or very slow, dangerous, and bug-prone to "fix" by modifying the blob before serving it.
Middleware is a fundamental part of most services, things like gzip, auth, adding headers. Traffic Ops has a pretty solid system to easily do these things, which this PR uses. That's pretty standard, and doesn't really complicate things.
IMS lets us avoid unnecessarily transferring data when changes haven't occurred. But it actually amplifies the scalability problem when changes have occurred. Because when a change occurs, thousands of caches will all be requesting at once. Likewise, only having Read-While-Writer without a small cache would put constant load on the DB, every time a request finishes another will start, constantly saturating the DB. And only having a small cache without RWR would result in thousands of caches requesting at once, especially for a large slow endpoint, creating thousands of backend DB requests until the endpoint can be served and the cache established. In order to significantly reduce cache config time, we really need all three: IMS, RWR, and a very small cache.
I'd also note, all of these concerns - theoretical races, consistency issues, etc - are all symptoms of the Snapshot-Queue Automation problem (a.k.a. "Kill the Chicken"). Adding more snapshot blobs exacerbates the problem, rather than fixing it. Once we have Server Snapshots which are timestamps, and can request the real non-blob data up to that timestamp, these problems cease to exist. Caches will check the server snapshot time, see that it's changed, and request the real data up to that new time. We need to fix the Snapshot-Queue Automation problem. It's the source of this and countless other operational problems. At this point, we've spent more work adding workarounds to the problem - Delivery Service Requests, Chicken Locks, more denormalized blobs - than it would've taken to just fix the problem. |
It's actually all four. It's how long after data modifications are made, and an operator queues, and t3c runs, and t3c requests /update_status, and t3c determines that it needs to update, and t3c requests the data endpoint. Unless all of that happens in less time than the cache (which defaults to 1s, but can be made lower), the update will get the most recent changes. Which is why it's extremely unlikely. But regardless, there's a simple solution, as above, queue time and max-age.
Correct.
Invalidation on the server side doesn't perfectly solve the problem, because there can be multiple TO's. But max-age does.
+1 |
|
@jhg03a I don't think you asked any questions, but
+1 |
Right, but "t3c runs, and t3c requests /update_status, and t3c determines that it needs to update, and t3c requests the data endpoint" is expected to be very fast, almost all at once. And given that most deployments have caches polling on an interval, it's very likely that even on fairly short timeframes, one of those will poll immediately after a queue. All of that is genuinely very likely, as @rawlinp said. The only part that is very unlikely is changes within a cache-time of a queue. That's unlikely... unless you are scripting, then it's nearly certain and you have a problem. Nevertheless, you've identified the correct way to fix the problem. As the PR stands, it has all the appropriate controls for an operator to handle the potential races, but it does have it on by default. It's fine to provide the feature, but thinking about how frustrated someone would be to be debugging this issue after upgrading, I think it should either be off by default in this PR or this PR should be updated with the information required for t3c to behave correctly, even in the corner cases. The default case shouldn't contain surprises, I think. |
The amount of requests going to the DB would be a factor of how many TO instances there are, and if we're running roughly 10 (which is a lot IMO), TODB should be able to handle 10 concurrent requests for the same thing without tipping over. I know Queueing updates and t3c processing those changes is just one of the races. It sounds like we can solve that problem, but there is still the problem that if I make a change, I might not be able to read that change back for however long the cache time is. So now anything that makes changes needs to understand that and work around it. I think that is actually why the TO API tests are failing -- they can't read their writes back immediately, which fundamentally breaks the tests unless a 1 second sleep is inserted after all writes (which would be absurd). We definitely want this feature to be tested, so I wouldn't want us to just turn it off for the TO API tests. |
That's kind of how HTTP works in the general case. If you want write-read consistency, you can get that by setting I don't think I see correct support for |
We have around 3000 caches. The numbers I'm seeing suggest, depending on the endpoint, depending on the hardware, TO has trouble with as few as 10 requests/second or even less. Our production backs this up. Our production caches frequently get failures from TO, for extended periods, enough the retry completely fails and the
Both the theory I've explained and the practice I've observed suggest RWR alone helps, yes, but all three are necessary. IMS, RWR, and a small cache all work together to solve distinct problems. But moreover, RWR is a cache, keeping it but getting rid of the small-cache doesn't make clients that aren't following HTTP requirements okay. In fact, with a 1s cache, a great many TO requests take longer than that, so the RWR cache is frequently longer than the small-cache, not shorter.
RWR is a cache, for consistency purposes. Clients always need to do proper HTTP. This is part of the HTTP standard, clients should be doing it today. Any client doing a POST and immediate GET without a Moreover, ATC is a CDN. Our developers and operators understand HTTP Caching. It's what we do.
A script needing to do that quickly can send a
Nobody is suggesting we turn off the tests, or add sleeps. The tests which immediately POST and GET can send a
I don't agree. Defaults should be sane. This is something that 99% of ATC operators will want enabled, and it increases scalability 10-100x, and the concerns are all solved by standard HTTP Caching mechanisms, which are well-known and well-understood, HTTP Caching has been around for decades. Having a default 1% of people might want to disable and which decreases scalability 100x is not sane. I really really don't like it, but, this is something I'd be willing to compromise on. If @rawlinp and @alficles would be willing to not block this, I can live with a default I don't consider sane. Would that be acceptable? If it defaults to disabled? And I'll fix the tests of course, I just haven't had time; and the caching is and will be tested. Then it won't be enabled for ATC users unless they intentionally enable it. This is something Comcast needs. I'm willing to compromise, but I'd really like to come to a consensus here, and find a way we can get what we need. It's difficult to overstate the operational value of being able to deploy cache config quickly, having the TP "clocks" clear in a minute instead of an hour. It's a massive operational win, for the ops cost, human cost, safety, and numerous other issues with how long it takes today and how often it fails. |
You're right. I am willing to write the code to add that, if we have consensus on the PR |
That is a fairly large paradigm shift for clients which have historically expected write-read consistency. This is a (mostly) RESTful, HTTP-JSON API, not an HTTP video origin which mostly just provides content without accepting any writes from clients. After updates, TP does an immediate
I would argue that most RESTful APIs do not expect clients to understand and implement HTTP caching 100% to spec. Just because something is using HTTP doesn't mean it has to also understand and fully implement HTTP caching. That would be quite a large requirement for something that simply needs to get data from an API.
Ah, I understand. Forget what I said about just having RWR without a timed cache then. The data consistency issue applies to both.
As I mention above, that is a massively breaking change to all TO clients.
Again, I don't think it's a requirement for all RESTful/HTTP-based APIs and corresponding clients to fully implement HTTP caching just because they use HTTP. That doesn't make clients "wrong".
99%? Only the largest ATC CDNs probably even have issues with DB load caused by |
It doesn't seem like a large requirement. HTTP only has 7 client Cache-Control directives, and clients only need to send 1 to not be broken, either max-age or no-cache, and only under the very specific scenario when they need to get data they just sent.
That does make clients wrong, and broken. They are ignoring and violating the spec. RFC 7230§2.4:
RFC 7231§4.3.1
Therefore, any client which fails to behave correctly after not sending a Cache-Control and receiving a cached response is broken, and any client which requires that an HTTP server MUST NOT send a cached response without a Cache-Control indicating so is violating the HTTP spec. I strongly oppose any ATC clients and servers failing to implement parts of HTTP, especially HTTP Caching, as though a partial protocol were not fundamentally broken and extremely bug-prone. HTTP and HTTP Caching are extremely thoroughly thought-out by extremely smart people, and the protocol handles very nearly every conceivable scenario. Moreover, as above, it's not particularly large or difficult to implement. Failing to implement parts as required is dangerous and bug-prone. Even aside from this feature, we need to fix any broken clients. If we happen to not use JSON arrays, and some clients have JSON libraries that don't implement arrays, and we add an endpoint that uses JSON arrays, surely nobody would say the client was reasonable for not implementing that part of the JSON protocol. In general, you can't just ignore parts of any protocol. In this particular case, the only thing necessary for clients to not be broken is, in the very specific case that they sent data and need to retrieve that data, sending a |
At the HTTP layer, our clients are handling the protocol just fine, and they would handle it just fine (at the HTTP layer) if they were to receive cached responses. It's about breaking the application/business logic layer that matters. Plain and simple, everything that uses the TO API has depended on write-read consistency since its inception. When you take that away, everything in the application/business logic layer that depended on that consistency is now broken. In order to fix the brokenness, we'd be requiring every single TO API client to implement cache-control headers. This is no different than breaking clients by adding new required fields to an existing API. Why is it ok to break clients through data inconsistency but not ok to break clients through adding new required fields to an existing API? It's not ok to force clients to just start sending the new required field, but it's ok to force clients to send cache-control headers now? |
|
So do we need to update the TO clients to send the cache control headers? Would that alone solve the issue since those are the officially supported clients? If we are going to make a breaking change, and I could see Rawlin's argument that it is, we should do it before 6.0 is cut. |
|
That's a good point @dneuman64. If we decide to give up data consistency, it should really only affect TO API v4+. Otherwise, once 6.0 is cut, it should really only affect TO API v5+, and we should really try to have some kind of stability instead of breaking clients in every ATC release. However, if this behavior were opt-in on the client-side rather than opt-out, we wouldn't have to change our clients at all, and this wouldn't be a breaking change. I know that is counter to the HTTP spec, but this is an API that has been around for quite a long time. When APIs have a certain behavior for a very long time, the ecosystem begins to depend on that behavior (and it definitely does).
|
|
I think a major concern is that Traffic Portal definitely does do a put and get back to back within a second. I understand that without adding a CCH to the HTTP handler, that we could be getting stale data back and that could definitely cause confusion. Ultimately this is functionality that we need and that we need to get into the 6.0 release. Hopefully we can come to an agreement (even if it is not ideal) and get this across the finish line. |
|
That sounds agreeable to me.
I checked Traffic Portal, and it looks like it's actually already setting these I believe those were added because Chrome started caching responses after the We should also be sure that this has an entry in the changelog that states that TO API v4 responses will be cached by TO (if enabled), and that clients should set |
|
I'm closing this because @SolidWallOfCode convinced me doing this with ATS (or any proxy really) is a better solution. |
Adds read-while-writer and a small cache to Traffic Ops, which significantly increase scalability.
In my testing, the requests/second Traffic Ops was capable of serving increased 10–100x, depending on the size of the endpoint and request volume.
Includes feature flags to disable both.
Includes tests.
Includes changelog.
Includes docs.
Which Traffic Control components are affected by this PR?
What is the best way to verify this PR?
Run tests. Make requests in rapid succession, observe returned values are correct, observe Age header in cached responses. Make request in rapid succession with
Cache-Control: no-cacheheader, verify no Age header indicating responses are all uncached.If this is a bug fix, what versions of Traffic Control are affected?
Not a bug fix.
The following criteria are ALL met by this PR
Additional Information