Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Add Vault Cache and Exponential Backoff for Login #236
Add a cache to the Vault credentials manager. The cache is flagable
Add an exponential backoff for login and renewal of vault credentials.
As part of the change refactor so to test both the cache and the retry
This change has not been tested with a full deployment yet (I think the PR will do that for me? ;p)
This is needed because if a user configures secrets in to be passed as params to a check resource it
I considered trying to add a "meta" credentials cache, but I'm not sure that a single caching strategy/manager will be able to correctly abstract all possible credential stores expected caching behaviors. For example k8s already has a sophisticated caching mechanism available in its standard library.
Adding caching at the application level may be masking the issue. It seems your
What have you tried actually scaling out
Concourse is doing the "wrong thing" by not caching a secret that is intended to be cached.
One pipeline with vault secret configured sent our load from 500 TPS to 1300 TPS. (The lack of exponential backoff and a hard-coded 1s retry is also a distributed systems no no).
/cc @EugenMayer - this PR would probably make you happy
also ref. hashicorp/vault#3651
Kind of a bummer that there's no blessed Go client that implements this kind of functionality; I was hesitant to implement it ourselves because caching credentials sounds like a bad idea if there's no precedent or strong need for it. But it sounds like there's a means for it via Vault's API, and precedent in other popular libraries, though it's still kind of a bummer that we have to implement the caching ourselves.
Given that, I want to be super careful about this and make sure we're taking into consideration any "gotchas" with the implementation (e.g. accidentally using an expired credential after it's been rotated), and any security risks involved (e.g. holding all the secrets in the ATC's head). I say this having not yet reviewed the code in detail.
Are there other Go consumers of Vault we can look to as an example? Have other projects run into this pain point?
referenced this pull request
Dec 17, 2017
Agreed on the go client not having caching being a bit of a bummer. But on the other hand the nature of the cache is probably use case specific anyways. Concourse for example is a read mostly cache. Requests for previously un-requested secrets are going to be very low.
Hence having a cache focused on fast reads and low memory overhead for concourse makes sense. We could use golang-lru from hashicorp but it doesn't actively purge records and does cache eviction on the main thread.
The cache I implemented is very efficient for this use case, only performing maintenance when keys expire. However it does not have a strict upper bounds in memory size. This is a complicated matter for vault secrets as they are of arbitrary size so having a max number of secrets only helps some in preventing large memory footprints.
It's also worth mentioning it looks like the original code re-implemented renewer and I followed suite. So that should probably be changed out.
A long while back Vault did in fact issue leases for the
What we usually see is that people will set a value for the lease duration that is significantly less than the actual expiration, but large enough to not cause high load. For example, let's say you rotate a credential every 12 hours and keep the old one active for 2 hours to allow clients to adjust. The writer of the credential may set the lease duration to 90 minutes, which should be enough time for all clients to retrieve the new value (with 5399 less reads from Vault per client in the interim compared to checking every second :-D ).
Although the API could do local caching, you'd need to know how long the cache should hold that value anyways, and still be capable of retrieving new values when outside the caching period. So a built-in cache doesn't buy you much other than the ability to continue checking every second instead of simply setting your timer to check at the interval given by the secret.
Eventually we hope to add an eventing API, but it's not going to appear in a "soon" timeframe. If your underlying data store supports an eventing API you could do some mapping of Vault storage to data store storage and use an event on the underlying data store path to trigger a re-read from Vault clients, but I would generally avoid that unless absolutely necessary for your use-case.
@jefferai Thanks for helping us figure this out!
I think there's one thing to clarify: Concourse doesn't ever explicitly refresh anything or do anything special with leases, let alone refresh once a second. We fetch a credential any time we need it, and throw it away when we're done with it.
Here's what's happening on Concourse's side:
After running the
So from Concourse's perspective, these credentials are short-lived. They aren't used throughout the duration of the Concourse server running, for example. The stack is stateless, so we don't have any 'refreshing' mechanism, we just don't use the credentials for very long in the first place.
Considering that, and given the existence of lease durations as a way for the credential writer to convey a safe validity window (separate from expiry), this feels like a generic client-side cache. I'm just worried about us being the ones to write it. ;)
...And quite worried that this may result in a single process that contains literally all of the credentials for all of the pipelines and all of the teams in its head. Not as a question of memory usage, but as a question of risk, and to what extent that defeats the point of using Vault in the first place.
referenced this pull request
Dec 19, 2017
Thanks for the explanation, that helps me understand what Concourse is doing here.
Not going in order:
I can't really help you assess whether that risk is acceptable or not, but it's certainly a better stance than persisting to disk. Using Vault helps sort out the "not persisting to disk in plaintext" part of the equation (and a lot of other functionality), so even if you are caching things on the client side you're still getting benefit.
Between this and the other discussion of client side caching, I'm a bit confused -- do you keep your API clients around? If so that's not really stateless, but if not, I don't really see how caching in the clients will help.
We're still getting a lot of benefit from Vault in plenty of ways, I'm just wary of any step back we may make in security. CI systems are an especially lucrative attack target. Either way it's on us to characterize the risk involved and figure out what we can/should do.
Sorry, all I meant by stateless was that the credentials themselves aren't e.g. stashed in an object somewhere in need of periodic refreshing. They're fetched and then thrown away. Though I guess that's how a lot of Vault consumers work anyway. We do probably have a single API client, so caching would be fine.
Thanks for the explanation. If you do work on a caching system I'd love for it to be in the API client since I think others could benefit. I have a couple thoughts around it:
Probably the first option would be the way to go for now. I do think there should be controls around disabling caching, perhaps on a per-secret (or per-path?) basis, in case the duration that comes back can't be trusted (or some secret shouldn't be kept around in memory).
Some Design Thoughts:
Current Security Model and Risks
Currently concourse keeps secrets for all team accessible by a single login managed in the ATC. If an attacker achieves some sort of "escalation" he can pull that token, client key, and client cert and access all secrets for all teams. While this information is not kept in memory, the token is, and it would still be readily available.
Escalation for this token is somewhat difficult as it requires attacking the ATC. The "exploit vector" here is largely through the ATC API. The most common methodology here would be to dump the heap or the configuration of the ATC.
Exploits that escalate using containers in pipeline steps are prevented from attacking the ATC if it is run on physically isolated nodes (this is probably a best practice for the future). However, privilege escalations on a worker could be used to attack other pipelines secrets that happen to be local. If the ATC is local this could also be attacked.
The Impact of Caching
Maintaining secrets in memory offers an increased risk in "timeliness" of attack. If the client certificate is ip addressed pinned or the ATC but not the secrets back end is accessible this voids some of its benefits. With a cache a user must only dump the heap to access the secrets as opposed to dumping the heap repeatedly, as secrets are in flight.
In practice this additional risk is low as most heap dumps are the result of existing remote code execution which can be crafted to attack these secrets directly.
Per-Pipeline Secrets Management
Maintaining a per team or per pipeline login to vault could help reduce the magnitude of any breach. The secrets configuration could be set on the pipeline rather then globally for the ATC. The trade off is this information would have to be persisted. However, the secrets back-end could be used for this. The ATC would be configured with a secrets back-end in which to store and access the pipeline specific secrets config.
Changing Where and How Secrets Access Works
Another increase in security would be from "splitting" the secrets access and decryption. In this model the access to the secret would reside with the ATC or Pipeline, but the secret itself would be further [encrypted with the private key of the worker]. The ATC would access the secret and forward it the worker which would then decrypt the secret for use. This adds some security as it requires a working exploit on both the worker and the atc.
Another possibility that is (probably) less work is for secrets resolution to happen in an entirely separate process. A "secrets resolver" could be placed in between the worker and the ATC. It's sole job would be to resolve secrets before handing them off to the worker. This would mean simply dumping the ATC's heap through some exploit would buy you nothing. Think the TSA, but for secrets (or possibly just as part of the TSA ;p).
Add Build Signing
One thing I would love to see is the addition of a build singing infrastructure. In this infrastructure tasks could request that concourse sign a resource with it's private keys so traceable builds could be used. Another usage would be requiring deployments to come from the build infrastructure itself, signing deployment requests with the build infrastructures private keys.
 More correctly the secret would be stored with a randomly generated AES encryption key, and that key would be encrypted with the workers public key.
Add the Cache (in the Vault API Client Code)
Compared to today's model the cache is only a minor increase in risk. However, it is nearly required for most secrets backends. As evidenced by #233 and vault#3651 secrets without caching can cause significant problems.
I very much like jefferai's suggestion to disable caching on a per path basis.
Look towards per-pipeline secrets and build signing first IMO. But that's up to you guys.
A couple of comments:
You may want to think about using
Just in case you haven't seen it,
Be careful with that line of thinking -- Go already prevents such attacks (subject to things working the way they're supposed to) by taking memory management out of your hands. memguard needs to advertise buffer overflow protection because it works around Go's memory management, which opens it up to a host of potential attacks that it then needs to mitigate.
I've been watching memguard for a while, and while I think it's an interesting library, as far as I can tell the main use of memguard is if you are concerned about values living for long periods in memory after you've given them up and don't want to try mitigating by periodically triggering garbage collections in Go. It won't stop someone that has root from getting your memory -- it will just make it harder via obfuscation.
Unless you feel super strongly about needing memguard, I'd just use something that supports expiry and/or running functions on eviction like https://github.com/patrickmn/go-cache or similar.
I'm a k8s user. I'd be happy with a simple caching strategy that applied to all credential managers, with a configurable TTL defaulting to 60 seconds. k8s secrets don't convey anything around expiry, and giving watch or list permissions to implement anything fancier for caching should be avoided:
There is also a PR to fix the k8s secret lookup so that access to all secrets in a namespace isn't assumed: #233.
referenced this pull request
Jan 4, 2018
referenced this pull request
Mar 28, 2018
vito left a comment
Phew, ok, it's been a while since I've looked at this so having a bit of a refresher of where we landed may be necessary.
Here's my understanding:
So tl;dr we're all good with this change being merged in.
Except I totally threw a wrench into things by merging #256 which touches the Vault factory code and changes the login flow a bit. This will need to be rebased and consolidated with that PR.
I also see a few
Thanks again for your patience, this one's just a bit complicated so it takes more time to digest. We're also just now getting back to PRs; the holidays and site revamp took a lot of time.
Seems like this is causing 3 TopGun tests to fail consistently:
Will take a closer look.
Found a couple problems.
First problem: code was blocking on
Second problem: when copying the Vault client and setting a token, I observed a deadlock in the stack that happened because copying the client also copies its mutexes (they're non-pointer values, relying on the struct), and it happened to copy it while the mutex was locked. I'm gonna find a better way to do this since it already had a TODO with a warning on it. The warning was right!
Use https://godoc.org/github.com/hashicorp/vault/api#Client.Clone instead of copying a client.
Can I ask -- is there a reason you wrote ReAuther instead of using https://godoc.org/github.com/hashicorp/vault/api#Renewer ?