Skip to content
This repository was archived by the owner on Nov 24, 2025. It is now read-only.

Add Traffic Ops read-while-writer and small cache#6017

Closed
rob05c wants to merge 4 commits intoapache:masterfrom
rob05c:add-to-cacherwr
Closed

Add Traffic Ops read-while-writer and small cache#6017
rob05c wants to merge 4 commits intoapache:masterfrom
rob05c:add-to-cacherwr

Conversation

@rob05c
Copy link
Member

@rob05c rob05c commented Jul 13, 2021

Adds read-while-writer and a small cache to Traffic Ops, which significantly increase scalability.

In my testing, the requests/second Traffic Ops was capable of serving increased 10–100x, depending on the size of the endpoint and request volume.

Includes feature flags to disable both.

Includes tests.
Includes changelog.
Includes docs.

  • This PR is not related to any other Issue

Which Traffic Control components are affected by this PR?

  • Traffic Ops

What is the best way to verify this PR?

Run tests. Make requests in rapid succession, observe returned values are correct, observe Age header in cached responses. Make request in rapid succession with Cache-Control: no-cache header, verify no Age header indicating responses are all uncached.

If this is a bug fix, what versions of Traffic Control are affected?

Not a bug fix.

The following criteria are ALL met by this PR

  • This PR includes tests OR I have explained why tests are unnecessary
  • This PR includes documentation OR I have explained why documentation is unnecessary
  • This PR includes an update to CHANGELOG.md OR such an update is not necessary
  • This PR includes any and all required license headers
  • This PR DOES NOT FIX A SERIOUS SECURITY VULNERABILITY (see the Apache Software Foundation's security guidelines for details)

Additional Information

@rob05c rob05c added new feature A new feature, capability or behavior Traffic Ops related to Traffic Ops labels Jul 13, 2021
@rob05c rob05c force-pushed the add-to-cacherwr branch from aa51aef to d2835b7 Compare July 13, 2021 04:35
Copy link
Contributor

@rawlinp rawlinp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With any kind of timed cache, isn't there a chance that a cache will end up using stale data to generate its config and clear its update status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these settings work together? If the cache is disabled, does TO still do RWR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're independent, yes. Though RWR acts like a cache for most integrity purposes, users should be aware of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so new features like this should generally be disabled by default (if e.g. the cdn.conf is left untouched after upgrading), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we'll be able to have users on the same tenant with different capabilities.

Copy link
Member Author

@rob05c rob05c Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked @mitchell852 , our resident tenant/roles/permissions expert, when I wrote this. He said Roles&Permissions will be per-Tenant. I just forgot to remove the TODO comment.

Which, I would vote they are. Not just for this feature, but for a lot of components and endpoints, if we make the set of Permissions per-Tenant and require users to make a new sub-tenant if they want to give someone in their organization fewer permissions, that makes a lot of things more scalable and avoids M*N data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm a little bit confused. What I envision is that two users in the same tenant can have different roles (and hence, different capabilities). Even if roles (and their associated capabilities) are per-tenant, we should be able to have multiple different roles within the same tenant.

@rob05c rob05c force-pushed the add-to-cacherwr branch from d2835b7 to 4547b6e Compare July 13, 2021 19:21
@rob05c
Copy link
Member Author

rob05c commented Jul 13, 2021

With any kind of timed cache, isn't there a chance that a cache will end up using stale data to generate its config and clear its update status?

In theory. In practice, with a 1s cache, it's extremely unlikely. The operator would have to make a change, then queue updates, and then t3c would have to get the update status, determine that it needs to update, and start fetching data, all in under a second.

We could handle that race by adding the cache time to the Update Status endpoint, and then adding code to t3c to read that, and wait that long before requesting data. But again, I think the race is extremely unlikely. It's basically impossible with a human clicking "queue," the only way I could see it even being possible would be if an automated script was making the change and queueing, and t3c was also running frequently enough to statistically happen in the same second (which isn't possible today, because TO can't handle those requests/second; but maybe it will be possible for people to run t3c that often with this PR).

That said - we need to modify the /update_status endpoint soon anyway, to fix a different and very common race: queueing, getting data, someone else queues while we're getting and generating config, and unsetting the update flag. We need to add the time queued to that endpoint, so t3c can determine if another queue happened while config was being applied, and not un-set the update or reval flag. I wouldn't object to adding the cache time to the endpoint at the same time, which should be trivially more dev time.

@rawlinp
Copy link
Contributor

rawlinp commented Jul 13, 2021

In practice, with a 1s cache, it's extremely unlikely

I disagree. I think the race condition is extremely likely to occur, even with a 1 second cache, and especially once t3c is polling TO more frequently (which is the goal).

We could handle that race by adding the cache time to the Update Status endpoint

Different TOs could all have different cache_ms settings, so this doesn't really solve the race. They probably shouldn't have different cache settings in practice, but they can, which means we can't really depend on per-TO-instance settings.

I appreciate the intent of making TO more scalable by adding things like timed caches and RWR, but I'm not really sure it's worth the risk of sacrificing our data consistency. It seems far more safe and scalable to implement something like Cache Config Snapshots, where we can cache that data in-memory with 100% data consistency without the possibility of race conditions. Rather than complicate the entire API by adding a new layer that every endpoint has to go through, we'd have a single endpoint with an in-memory cache (that isn't timed and is easily invalidated whenever a new snapshot is taken). It just seems easier to make a single endpoint really scalable (special-built for t3c) instead of trying to make the entire API scalable by introducing data consistency risk.

We can also improve the endpoints that t3c requests which seem to make the biggest impact on TODB load (e.g. deliveryserviceservers, jobs, etc). I'm pretty sure t3c requests every invalidation job that has ever been submitted every single time it processes a revalidation, so we can definitely improve that (#5674), and t3c doesn't need the entirety of the deliveryserviceservers data, although I'm not sure it's necessary to improve that since those are going away in favor of topologies. However, if the alternative is data consistency risk, it might make sense to at least filter that API by CDN.

We also have the IMS changes you recently added to t3c, and I'm sure that will help improve performance a bit.

We're also running TODB on relatively puny VMs, so I'm curious how much performance we'll actually get out of baremetal.

All that is to say: I think there are other avenues we should pursue and evaluate before we start introducing data consistency risk.

@jhg03a
Copy link
Contributor

jhg03a commented Jul 13, 2021

Different TOs could all have different cache_ms settings, so this doesn't really solve the race. They probably shouldn't have different cache settings in practice, but they can, which means we can't really depend on per-TO-instance settings.

That problem is out of scope because it's the job of external tooling to ensure that doesn't happen. If it's something that should be in scope, we could just leverage out existing profile/parameter system to just take it out of config files entirely and therefore always be consistent.

I appreciate the intent of making TO more scalable by adding things like timed caches and RWR, but I'm not really sure it's worth the risk of sacrificing our data consistency. It seems far more safe and scalable to implement something like Cache Config Snapshots, where we can cache that data in-memory with 100% data consistency without the possibility of race conditions. Rather than complicate the entire API by adding a new layer that every endpoint has to go through, we'd have a single endpoint with an in-memory cache (that isn't timed and is easily invalidated whenever a new snapshot is taken). It just seems easier to make a single endpoint really scalable (special-built for t3c) instead of trying to make the entire API scalable by introducing data consistency risk.

Both approaches have merit and precedent. In this case though it feels like the risk vs reward seems to me at least very much in favor of moving ahead with it. Another important piece here is that it's done and not on a roadmap or blueprint. Whenever Cache Config Snapshots comes about, there's nothing that says this extra caching layer has to stay (although even CCS would benefit too). The other improvements that have been suggested are valid too and some are even in progress. When it comes to these TODB/API performance issues, I havn't seen any reproduce case that can be measured or evaluated reliably. This is why a multifaceted approach isn't a bad thing. It's acknowledged that cache configs are some of the heaviest hitters and at large scale depending on how your postgres instance is configured, on what, and where in the network plays a big part in the overall success rates. This is a PR which does significantly help alleviate that scalability concern without requiring any postgres or infrastructure changes.

@alficles
Copy link
Contributor

the race condition is extremely likely to occur, even with a 1-second cache

But these things are configurable. If 1s is too long for some users, perhaps it's just right for others. TC doesn't have a strict requirement on polling frequency, different users will have different needs. Someone with a slow TO box will really appreciate having a cache and can adjust queue times appropriately.

The race here is not "how long after queuing does a cache poll", though, it's "how long after data modification are updates queued". If you change data, wait 1 second, then queue updates, a cache checking for updates after 1ms will still get the updated data.

This is practically only a concern in scripted changes, but scripts can work in concert with the cache duration to ensure no race. If an operator turns on the cache, they might need to add a slight delay between changing data and queuing updates. Alternately, they may need to configure caches to wait a short time after observing an update flag and requesting updates. (As a practical matter, operators currently have challenges with automatic queues because of a huge number of other potential races with data changing on the CDN, but those aren't implicated here.) The operator has all the control they need to optimize the system for their situation, so it's a clear win. Maybe some additional docs on how to use the feature and its implications can make a difference, though.

If we wanted to bust the race, though, all we'd have to do is invalidate the cache when updates are queued. That's probably feasible if we want to avoid operators needing to dance around the cache. @rob05c , does that seem feasible?

This seems like a clear performance win. There are operational concerns, but it seems to have appropriate controls that allow people to make the right decision for their hardware and operational situation. It supports no-cache requests that allow scripts to ensure no caching. Even a moderately large CDN might see multiple orders of magnitude of performance improvement with very conservative values like just a few milliseconds.

@rob05c
Copy link
Member Author

rob05c commented Jul 14, 2021

In practice, with a 1s cache, it's extremely unlikely

I disagree. I think the race condition is extremely likely to occur

We could handle that race by adding the cache time to the Update Status endpoint

Different TOs could all have different cache_ms settings, so this doesn't really solve the race.

When we change /update_status to be a timestamp (which is on the short roadmap to solve a much bigger race that occurs frequently today), t3c can request the data it needs with a Cache-Control: max-age= of the queue timestamp.

By sending max-age, the server's cache_ms is irrelevant, no TO will serve its cache if it's older than the queue time.

Does that address your concern?

I appreciate the intent of making TO more scalable by adding things like timed caches and RWR, but I'm not really sure it's worth the risk of sacrificing our data consistency.

We're talking about 100x more requests/second. It's the difference in being able to have 2000 caches get their config every 15 minutes -- which frequently fails so config deployment actually takes an hour or more -- and being able to get config in under 1 minute. That's the goal here: making TO able to handle fast cache config.

I am sure it's worth being able to deploy cache config in under a minute instead of an hour. But with the above, it's not sacrificing any consistency.

It seems far more safe and scalable to implement something like Cache Config Snapshots

It's been an ATC goal for years to remove logic from the Traffic Ops monolith, in order to make ATC more scalable, more horizontal, and safer and possible to canary-deploy changes. Adding more logic to Traffic Ops instead of removing it is fundamentally opposed to that goal.

Not to mention the data consistency problems with snapshot blobs. There are far more consistency problems, problems which are often unfixable, with putting denormalized blobs in the database. Problems such as client and server compatibility. Problems which are either unfixable; or very slow, dangerous, and bug-prone to "fix" by modifying the blob before serving it.

Rather than complicate the entire API by adding a new layer that every endpoint has to go through

Middleware is a fundamental part of most services, things like gzip, auth, adding headers. Traffic Ops has a pretty solid system to easily do these things, which this PR uses. That's pretty standard, and doesn't really complicate things.

We also have the IMS changes you recently added to t3c, and I'm sure that will help improve performance a bit.

IMS lets us avoid unnecessarily transferring data when changes haven't occurred. But it actually amplifies the scalability problem when changes have occurred. Because when a change occurs, thousands of caches will all be requesting at once.

Likewise, only having Read-While-Writer without a small cache would put constant load on the DB, every time a request finishes another will start, constantly saturating the DB. And only having a small cache without RWR would result in thousands of caches requesting at once, especially for a large slow endpoint, creating thousands of backend DB requests until the endpoint can be served and the cache established.

In order to significantly reduce cache config time, we really need all three: IMS, RWR, and a very small cache.

something like Cache Config Snapshots

I'd also note, all of these concerns - theoretical races, consistency issues, etc - are all symptoms of the Snapshot-Queue Automation problem (a.k.a. "Kill the Chicken"). Adding more snapshot blobs exacerbates the problem, rather than fixing it.

Once we have Server Snapshots which are timestamps, and can request the real non-blob data up to that timestamp, these problems cease to exist. Caches will check the server snapshot time, see that it's changed, and request the real data up to that new time.

We need to fix the Snapshot-Queue Automation problem. It's the source of this and countless other operational problems.

At this point, we've spent more work adding workarounds to the problem - Delivery Service Requests, Chicken Locks, more denormalized blobs - than it would've taken to just fix the problem.

@rob05c
Copy link
Member Author

rob05c commented Jul 14, 2021

@alficles

The race here is not "how long after queuing does a cache poll", though, it's "how long after data modification are updates queued".

It's actually all four. It's how long after data modifications are made, and an operator queues, and t3c runs, and t3c requests /update_status, and t3c determines that it needs to update, and t3c requests the data endpoint. Unless all of that happens in less time than the cache (which defaults to 1s, but can be made lower), the update will get the most recent changes. Which is why it's extremely unlikely.

But regardless, there's a simple solution, as above, queue time and max-age.

This is practically only a concern in scripted changes

Correct.

If we wanted to bust the race, though, all we'd have to do is invalidate the cache when updates are queued. That's probably feasible if we want to avoid operators needing to dance around the cache. @rob05c , does that seem feasible?

Invalidation on the server side doesn't perfectly solve the problem, because there can be multiple TO's. But max-age does.

This seems like a clear performance win. There are operational concerns, but it seems to have appropriate controls that allow people to make the right decision for their hardware and operational situation. It supports no-cache requests that allow scripts to ensure no caching. Even a moderately large CDN might see multiple orders of magnitude of performance improvement with very conservative values like just a few milliseconds.

+1

@rob05c
Copy link
Member Author

rob05c commented Jul 14, 2021

@jhg03a I don't think you asked any questions, but

the risk vs reward seems to me at least very much in favor of moving ahead with it

it seems to have appropriate controls that allow people to make the right decision

a multifaceted approach isn't a bad thing

+1

@alficles
Copy link
Contributor

@rob05c

It's actually all four. It's how long after data modifications are made, and an operator queues, and t3c runs, and t3c requests /update_status, and t3c determines that it needs to update, and t3c requests the data endpoint.

Right, but "t3c runs, and t3c requests /update_status, and t3c determines that it needs to update, and t3c requests the data endpoint" is expected to be very fast, almost all at once. And given that most deployments have caches polling on an interval, it's very likely that even on fairly short timeframes, one of those will poll immediately after a queue. All of that is genuinely very likely, as @rawlinp said. The only part that is very unlikely is changes within a cache-time of a queue. That's unlikely... unless you are scripting, then it's nearly certain and you have a problem.

Nevertheless, you've identified the correct way to fix the problem.

As the PR stands, it has all the appropriate controls for an operator to handle the potential races, but it does have it on by default. It's fine to provide the feature, but thinking about how frustrated someone would be to be debugging this issue after upgrading, I think it should either be off by default in this PR or this PR should be updated with the information required for t3c to behave correctly, even in the corner cases. The default case shouldn't contain surprises, I think.

@rawlinp
Copy link
Contributor

rawlinp commented Jul 14, 2021

only having Read-While-Writer without a small cache would put constant load on the DB, every time a request finishes another will start, constantly saturating the DB

The amount of requests going to the DB would be a factor of how many TO instances there are, and if we're running roughly 10 (which is a lot IMO), TODB should be able to handle 10 concurrent requests for the same thing without tipping over. I know t3c requests all the data at once, so let's say 10 concurrent requests per route. That still doesn't seem like something that should overload TODB. What I'm getting at is that RWR without a small cache probably does improve performance quite a bit, but it doesn't have the data consistency issues that a timed cached has. Maybe RWR alone is enough?

Queueing updates and t3c processing those changes is just one of the races. It sounds like we can solve that problem, but there is still the problem that if I make a change, I might not be able to read that change back for however long the cache time is. So now anything that makes changes needs to understand that and work around it. I think that is actually why the TO API tests are failing -- they can't read their writes back immediately, which fundamentally breaks the tests unless a 1 second sleep is inserted after all writes (which would be absurd). We definitely want this feature to be tested, so I wouldn't want us to just turn it off for the TO API tests.

@alficles
Copy link
Contributor

if I make a change, I might not be able to read that change back for however long the cache time is

That's kind of how HTTP works in the general case. If you want write-read consistency, you can get that by setting Cache-Control: max-age to the difference between your write and your read time. And if you're reading immediately after writing, writers should be sending Cache-Control: no-cache. The fact that a naive cache-control happens to work at the moment is mostly an accident and is only true when there aren't caching proxies for those endpoints in front of TO.

I don't think I see correct support for max-age in the PR, though. @rob05c , that's probably an important feature. no-cache is fine for immediate reads, which might be the common case, but it'd be nice to support it completely.

@rob05c
Copy link
Member Author

rob05c commented Jul 14, 2021

The amount of requests going to the DB would be a factor of how many TO instances there are, and if we're running roughly 10 (which is a lot IMO), TODB should be able to handle 10 concurrent requests for the same thing without tipping over. I know t3c requests all the data at once, so let's say 10 concurrent requests per route. That still doesn't seem like something that should overload TODB. What I'm getting at is that RWR without a small cache probably does improve performance quite a bit, but it doesn't have the data consistency issues that a timed cached has. Maybe RWR alone is enough?

We have around 3000 caches. t3c requests 19 endpoints currently. If we have 10 TO instances, that's 5700 concurrent requests to each, and 57,000 concurrent requests to the database.

The numbers I'm seeing suggest, depending on the endpoint, depending on the hardware, TO has trouble with as few as 10 requests/second or even less.

Our production backs this up. Our production caches frequently get failures from TO, for extended periods, enough the retry completely fails and the t3c run fails. This is with a 15-minute t3c interval.

RWR without a small cache probably does improve performance quite a bit

Both the theory I've explained and the practice I've observed suggest RWR alone helps, yes, but all three are necessary. IMS, RWR, and a small cache all work together to solve distinct problems. But moreover, RWR is a cache, keeping it but getting rid of the small-cache doesn't make clients that aren't following HTTP requirements okay. In fact, with a 1s cache, a great many TO requests take longer than that, so the RWR cache is frequently longer than the small-cache, not shorter.

but it doesn't have the data consistency issues that a timed cached has

RWR is a cache, for consistency purposes. Clients always need to do proper HTTP. This is part of the HTTP standard, clients should be doing it today. Any client doing a POST and immediate GET without a no-cache or max-age is just lucky it works today, they're ignoring large parts of the HTTP spec.

Moreover, ATC is a CDN. Our developers and operators understand HTTP Caching. It's what we do.

there is still the problem that if I make a change, I might not be able to read that change back for however long the cache time is.

A script needing to do that quickly can send a no-cache in its request, and further an Age is sent indicating if the response was cached. Again, this is part of HTTP, and clients are wrong to not be doing it today.

I think that is actually why the TO API tests are failing -- they can't read their writes back immediately, which fundamentally breaks the tests unless a 1 second sleep is inserted after all writes (which would be absurd). We definitely want this feature to be tested, so I wouldn't want us to just turn it off for the TO API tests.

Nobody is suggesting we turn off the tests, or add sleeps. The tests which immediately POST and GET can send a no-cache, and this PR includes tests for the caching itself.

but it does have it on by default.

The default case shouldn't contain surprises

new features like this should generally be disabled by default

I don't agree. Defaults should be sane. This is something that 99% of ATC operators will want enabled, and it increases scalability 10-100x, and the concerns are all solved by standard HTTP Caching mechanisms, which are well-known and well-understood, HTTP Caching has been around for decades. Having a default 1% of people might want to disable and which decreases scalability 100x is not sane.

I really really don't like it, but, this is something I'd be willing to compromise on. If @rawlinp and @alficles would be willing to not block this, I can live with a default I don't consider sane.

Would that be acceptable? If it defaults to disabled? And I'll fix the tests of course, I just haven't had time; and the caching is and will be tested. Then it won't be enabled for ATC users unless they intentionally enable it.

This is something Comcast needs. I'm willing to compromise, but I'd really like to come to a consensus here, and find a way we can get what we need. It's difficult to overstate the operational value of being able to deploy cache config quickly, having the TP "clocks" clear in a minute instead of an hour. It's a massive operational win, for the ops cost, human cost, safety, and numerous other issues with how long it takes today and how often it fails.

@rob05c
Copy link
Member Author

rob05c commented Jul 14, 2021

I don't think I see correct support for max-age in the PR, though. @rob05c , that's probably an important feature. no-cache is fine for immediate reads, which might be the common case, but it'd be nice to support it completely.

You're right. I am willing to write the code to add that, if we have consensus on the PR

@rawlinp
Copy link
Contributor

rawlinp commented Jul 14, 2021

And if you're reading immediately after writing, writers should be sending Cache-Control: no-cache

That is a fairly large paradigm shift for clients which have historically expected write-read consistency. This is a (mostly) RESTful, HTTP-JSON API, not an HTTP video origin which mostly just provides content without accepting any writes from clients. After updates, TP does an immediate GET of the changed resource, and it expects to read the changed resource. If our clients all have to set Cache-Control: no-cache from now on to get the old behavior, that is a massively breaking change. This kind of makes me think that the default behavior, if we even add this timed cache, should be to not use the cache for clients. Then, if cache-smart clients (e.g. t3c) want to use caching and understand the implications that come with it, then they can pass a header to allow cached responses.

Any client doing a POST and immediate GET without a no-cache or max-age is just lucky it works today, they're ignoring large parts of the HTTP spec.

I would argue that most RESTful APIs do not expect clients to understand and implement HTTP caching 100% to spec. Just because something is using HTTP doesn't mean it has to also understand and fully implement HTTP caching. That would be quite a large requirement for something that simply needs to get data from an API.

But moreover, RWR is a cache, keeping it but getting rid of the small-cache doesn't make clients that aren't following HTTP requirements okay. In fact, with a 1s cache, a great many TO requests take longer than that, so the RWR cache is frequently longer than the small-cache, not shorter.

Ah, I understand. Forget what I said about just having RWR without a timed cache then. The data consistency issue applies to both.

Nobody is suggesting we turn off the tests, or add sleeps. The tests which immediately POST and GET can send a no-cache, and this PR includes tests for the caching itself.

As I mention above, that is a massively breaking change to all TO clients.

A script needing to do that quickly can send a no-cache in its request, and further an Age is sent indicating if the response was cached. Again, this is part of HTTP, and clients are wrong to not be doing it today.

Again, I don't think it's a requirement for all RESTful/HTTP-based APIs and corresponding clients to fully implement HTTP caching just because they use HTTP. That doesn't make clients "wrong".

I don't agree. Defaults should be sane. This is something that 99% of ATC operators will want enabled

99%? Only the largest ATC CDNs probably even have issues with DB load caused by t3c. If DB load wasn't a problem, I would strongly suggest preferring the existing data consistency over enabling caching. Caching in general adds a lot of variability and problems into a system, and if it can be avoided, it probably should be IMO. Also, the "sane" default in most cases is whatever the pre-existing behavior was. Systems are built around those behaviors, and surprising everyone by changing the behavior they depend on is not good.

@rob05c
Copy link
Member Author

rob05c commented Jul 14, 2021

I would argue that most RESTful APIs do not expect clients to understand and implement HTTP caching 100% to spec. Just because something is using HTTP doesn't mean it has to also understand and fully implement HTTP caching. That would be quite a large requirement for something that simply needs to get data from an API.

It doesn't seem like a large requirement. HTTP only has 7 client Cache-Control directives, and clients only need to send 1 to not be broken, either max-age or no-cache, and only under the very specific scenario when they need to get data they just sent.

I don't think it's a requirement for all RESTful/HTTP-based APIs and corresponding clients to fully implement HTTP caching just because they use HTTP. That doesn't make clients "wrong".

That does make clients wrong, and broken. They are ignoring and violating the spec. RFC 7230§2.4:

Any client or server MAY employ a cache

RFC 7231§4.3.1

The response to a GET request is cacheable; a cache MAY use it to
satisfy subsequent GET and HEAD requests unless otherwise indicated
by the Cache-Control header field

Therefore, any client which fails to behave correctly after not sending a Cache-Control and receiving a cached response is broken, and any client which requires that an HTTP server MUST NOT send a cached response without a Cache-Control indicating so is violating the HTTP spec.

I strongly oppose any ATC clients and servers failing to implement parts of HTTP, especially HTTP Caching, as though a partial protocol were not fundamentally broken and extremely bug-prone. HTTP and HTTP Caching are extremely thoroughly thought-out by extremely smart people, and the protocol handles very nearly every conceivable scenario. Moreover, as above, it's not particularly large or difficult to implement. Failing to implement parts as required is dangerous and bug-prone. Even aside from this feature, we need to fix any broken clients.

If we happen to not use JSON arrays, and some clients have JSON libraries that don't implement arrays, and we add an endpoint that uses JSON arrays, surely nobody would say the client was reasonable for not implementing that part of the JSON protocol. In general, you can't just ignore parts of any protocol.

In this particular case, the only thing necessary for clients to not be broken is, in the very specific case that they sent data and need to retrieve that data, sending a no-cache or max-age since the time the data was sent. That doesn't seem large, or surprising, or unreasonable.

@rawlinp
Copy link
Contributor

rawlinp commented Jul 14, 2021

Therefore, any client which fails to behave correctly after not sending a Cache-Control and receiving a cached response is broken, and any client which requires that an HTTP server MUST NOT send a cached response without a Cache-Control indicating so is violating the HTTP spec.

At the HTTP layer, our clients are handling the protocol just fine, and they would handle it just fine (at the HTTP layer) if they were to receive cached responses. It's about breaking the application/business logic layer that matters. Plain and simple, everything that uses the TO API has depended on write-read consistency since its inception. When you take that away, everything in the application/business logic layer that depended on that consistency is now broken. In order to fix the brokenness, we'd be requiring every single TO API client to implement cache-control headers.

This is no different than breaking clients by adding new required fields to an existing API. Why is it ok to break clients through data inconsistency but not ok to break clients through adding new required fields to an existing API? It's not ok to force clients to just start sending the new required field, but it's ok to force clients to send cache-control headers now?

@dneuman64
Copy link
Contributor

So do we need to update the TO clients to send the cache control headers? Would that alone solve the issue since those are the officially supported clients? If we are going to make a breaking change, and I could see Rawlin's argument that it is, we should do it before 6.0 is cut.

@rawlinp
Copy link
Contributor

rawlinp commented Jul 14, 2021

That's a good point @dneuman64. If we decide to give up data consistency, it should really only affect TO API v4+. Otherwise, once 6.0 is cut, it should really only affect TO API v5+, and we should really try to have some kind of stability instead of breaking clients in every ATC release.

However, if this behavior were opt-in on the client-side rather than opt-out, we wouldn't have to change our clients at all, and this wouldn't be a breaking change. I know that is counter to the HTTP spec, but this is an API that has been around for quite a long time. When APIs have a certain behavior for a very long time, the ecosystem begins to depend on that behavior (and it definitely does).

t3c is responsible for almost the entire load on TODB, and it would be really nice to make it be cache-smart by having it send a header to TO that says "you can give me cached content", rather than requiring every other client to send cache-control: no-cache. Again, I realize that goes against the HTTP spec @rob05c, but that seems like the best way forward to maintain API stability. Perhaps we can also add a TO config option, e.g. disable_opt_in_for_caching, which we can set to true once all the TO API clients are cache-smart, at which point clients would have to opt-out of caching via cache-control: no-cache. Does that sound agreeable?

@mitchell852 mitchell852 added this to the 6.0.0 milestone Jul 15, 2021
@dneuman64
Copy link
Contributor

I think a major concern is that Traffic Portal definitely does do a put and get back to back within a second. I understand that without adding a CCH to the HTTP handler, that we could be getting stale data back and that could definitely cause confusion.
So, @rawlinp and @rob05c how about this: First of all, we keep the agreed upon idea of making this disabled by default. This way a smaller CDN operator doesn't have to worry about this if they don't want to. Next, we make the cache/RWW only apply starting in API 4.x and higher. This makes it so that API 3.x and older is not "broken" and when clients using the API move to 4.x they can update their client to use CCH if they choose. This also keeps us sticking to the HTTP spec and not adding more code to handle a non-standard thing. Finally, we should probably update the Traffic Portal code that uses API 4.x to also send CCH headers. I am not sure how much work that is or where TP is using 4.x, but it is something we should probably do. This can be in the same PR or a different one, it doesn't really matter to me.

Ultimately this is functionality that we need and that we need to get into the 6.0 release. Hopefully we can come to an agreement (even if it is not ideal) and get this across the finish line.

@rawlinp
Copy link
Contributor

rawlinp commented Jul 15, 2021

That sounds agreeable to me.

Finally, we should probably update the Traffic Portal code that uses API 4.x to also send CCH headers

I checked Traffic Portal, and it looks like it's actually already setting these cache-control headers for every request:

app.js
534:trafficPortal.config(function ($httpProvider) {
535:        $httpProvider.interceptors.push('authInterceptor');
538:        if (!$httpProvider.defaults.headers.get) {
539:                $httpProvider.defaults.headers.get = {};
541:        $httpProvider.defaults.headers.get['Cache-Control'] = 'no-cache, no-store, must-revalidate';
542:        $httpProvider.defaults.headers.get['Pragma'] = 'no-cache';
543:        $httpProvider.defaults.headers.get['Expires'] = 0;

I believe those were added because Chrome started caching responses after the last-modified response header was added during the TO IMS project.

We should also be sure that this has an entry in the changelog that states that TO API v4 responses will be cached by TO (if enabled), and that clients should set cache-control headers if they need to bypass it. Also, big breaking changes like this probably deserve a mailing list announcement as well.

@rob05c rob05c force-pushed the add-to-cacherwr branch from 4547b6e to f71ffe9 Compare July 31, 2021 15:59
@rob05c
Copy link
Member Author

rob05c commented Aug 2, 2021

I'm closing this because @SolidWallOfCode convinced me doing this with ATS (or any proxy really) is a better solution.

@rob05c rob05c closed this Aug 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

new feature A new feature, capability or behavior Traffic Ops related to Traffic Ops

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants