-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the token cache lifetime #17
Conversation
The public key cache lifetime controls how quickly we can revoke a signer key entirely; it also controls how long of an outage we can sustain. Based on the discussions in the WLCG authorization working group, we believe 4 days is a better tradeoff than the original 1 day. This is still a finite time -- but allows token issuers to survive longer (e.g., weekend) outages and have the infrastructure continue to operate.
I'm fully in support of this pull request as-is. I think the pull-request could be improved be updating the document to explain the rational of 4 days -- about surviving a server outage. |
I'm not sure I understand how this is going to help survive an outage: the access tokens (hopefully) will not have a lifetime of 4 days (we currently set a 6 hour max.), so within those 4 days, tokens will need to be renewed several times in any case, requiring the issuer to be available. |
To phrase it differently: I think the change under discussion includes a suggested 4d max for access tokens (rereading the notes from last Thursday). |
Just to mention that, by itself, extending the duration during which the public keys are cacheable carries some risk (longer is worse if the private key of the OP is compromised), however the specific change in this pull-request is (IMHO) relatively inoffensive. However, you're right, the stated goal (allowing "the infrastructure continue to operate") only works if the AT validity is also increased. As background, I understand we're (somewhat) caught between a rock and a hard place. @bbockelm can, perhaps, provide a more authoritative statement; however, I believe the problem here is that OSG will drop support for X.509-based software on a timescale in which 24/7 support for IAM will not be ready. Current support level for IAM is something like "office hours". Reliable authentication is a prerequisite during data taking. So AuthZ-WG are looking how to have OAuth2 survive a possible weekend outage: a problem on Friday late afternoon that isn't fixed until Monday. Hence extending various lifetime durations. During the meeting, I tried to argue against this ... pretty much for the reasons you described @msalle. Also, the risk is greater because the AT is used as a bearer token, unlike a proxy credential. Therefore our experiences with X.509 may not be the best guide. Unfortunately, I felt like a lone voice, with others commending this as a "pragmatic solution". Incidentally, the part about increasing the AT lifetime was also discussed. It was proposed as a "gentlemen's agreement" that we should just ignore the max duration of the AT in the profile. I also objected to this. I felt that any changes to the document really should be documented, along with the rational. If we plan to use ATs with longer validity that 6 hours (already a tad on the long side, IMHO) then we write this down. I really don't like have unwritten verbal agreements that "we" all know: "yeah, that bit of the profile you just ignore". We can always be faced with situations where short-term work-arounds are needed. However, we should take steps to make sure they really are short-term. The history of the grid is filled with short-term fixes that nobody can get rid off. i also fear that this change will be introduced and nobody will be able to get rid of it. |
Hi @paulmillar I fully agree with you and should have been more active on the calls but simply don't have the time )-: |
I, too, have been remiss and have missed many of the AuthZ-WG meetings. It was by chance that I managed to make this particular meeting. I'm perfectly happy with this not going in the profile document, but I think it should be written down somewhere, otherwise people have different recollections and therefore different expectations. Perhaps as an "implementation notes" document? I absolutely agree with having a very clear end-time (as part of our "exit strategy"). This, too, should be documented so people actually enforce it. |
I agree if approved it certainly needs to be documented, but not in this standards and best practice document. I'd say not even as an implementation notes. It would be basically a policy decision by WLCG, actually the GDB probably, to temporarily not follow the agreed best practices put forward in the WLCG AuthZ WG common JWT profile document. |
I much prefer @bbockelm's alternate suggestion in the meeting, which was to register the public key of a separate non-IAM issuer (which issues tokens directly on the pilot job submission node) in the web service that IAM will use for its public key. If that web service is not HA and 24/7 support, that could be a reason for increasing the lifetime of key caches, while still using short-lived access tokens. |
On the other hand, getting HA 24/7 support for a web service is a lot easier to do than it is for IAM, so maybe that could instead be provided in the short term. |
Elsewhere in the profile it says:
I think this pull request could be improved as follows:
|
I agree with this, in particular point 3, which means we need two separate table rows. It should be made clear that the clients MUST be configured to refresh the keys with short update frequencies, and only use the caching time of 4 days in case the server cannot be reached. My worry is that clients will just refresh the keys very infrequently since it says it is the maximum. |
Hi all, For now, this PR provides a simple change that better reflects the current reality and I will thus merge it, thanks! |
Martin, I don't understand why you're talking about token lifetimes here. This PR was about caching token issuer key info. The title of the PR may have misled you. I would change the title from "token cache lifetime" to "token issuer public key cache lifetime" but I am not given the option so I likely don't have the permisisons. |
Hi Dave, |
The public key cache lifetime controls how quickly we can revoke a signer key entirely; it also controls how long of an outage we can sustain.
Based on the discussions in the WLCG authorization working group, we believe 4 days is a better tradeoff than the original 1 day. This is still a finite time -- but allows token issuers to survive longer (e.g., weekend) outages and have the infrastructure continue to operate.