Speed up knife vault refresh #178

tfheen · 2015-07-28T06:33:32Z

Use partial search and threads to speed up.

Fixes: #177

Doing a full search on the node index means the entire node object is downloaded for each hit. As we are only interested in the name, only ask for that.

sysbot · 2015-07-28T07:52:33Z

+1

tfheen · 2015-07-28T08:12:15Z

I'm not entirely sure how to get the specs to run properly here, help appreciated. :-)

jf647 · 2015-07-28T19:37:28Z

Thanks for the idea/contribution.

I am however concerned about thread safety here. I don't believe that Chef guarantees that the chef-client code is thread safe, and even if the current implementation is on MRI, there's no guarantee that it will always be that way.

Inside of the thread, you're calling add_client, which in turn calls add on the ChefVault::ItemKeys object. That mutates the ItemKeys object (pushing entries onto the clients array), but without any protection around the update. On MRI that may work because all user code is implicitly single threaded, but if MRI ever improved its threading to match that of JRuby or Rubinius, it would be subject to race collisions.

It seems like the better way to speed things up is to not search for nodes names, then re-issue the search query to get the public key of each node. Is the public key something we can get from the original search query?

Otherwise, I'd want to see race protection and unit tests to exercise the edge cases in order to consider merging this.

sysbot · 2015-07-28T20:44:22Z

I was thinking about batching as well with the suggestion above (search for nodes, then search for keys from list of nodes). I'm concerned over possible race condition as well in between batch updates as there's no guarantee in between batch that the underlaying client info haven't changed.

tfheen · 2015-07-30T16:12:40Z

]] James FitzGibbon

Inside of the thread, you're calling add_client, which in turn calls
add on the ChefVault::ItemKeys object. That mutates the ItemKeys
object (pushing entries onto the clients array), but without any
protection around the update. On MRI that may work because all user
code is implicitly single threaded, but if MRI ever improved its
threading to match that of JRuby or Rubinius, it would be subject to
race collisions.

Aren't pushing items onto an array thread safe in Ruby? Oh well. :-)

It seems like the better way to speed things up is to not search for
nodes names, then re-issue the search query to get the public key of
each node. Is the public key something we can get from the original
search query?

Nodes don't have keys. Clients have keys, so that's basically my other
suggestion, search the client index for

client_name = 'foo' OR client_name = 'bar' OR client_name = 'baz' OR …

This could potentially be even quicker, since fewer roundtrips.

I can whip up a patch to do this and you can see what you think?

Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are

jf647 · 2015-07-30T17:01:25Z

In MRI, only one user thread can run at a time (more than one system thread can run simultaneously), so you get synchroniziation "for free", but it's implementation dependent and not guaranteed. There are also edge cases in which MRI isn't thread safe:

http://www.rubyinside.com/does-the-gil-make-your-ruby-code-thread-safe-6051.html

But even if we put a mutex around the updates to the item keys, there is no guarantee that the chef client code is thread safe or will remain so. The patch as written issues multiple API operations using the same underlying Chef::REST object in multiple threads. I'm not aware of any chef-authored code that does this, and Chef could choose to break this without warning (because it's not guaranteed by the API).

So, to do any of this multithreaded, there would need to be a mutex around the chef API operations and an mutex around the item key mutations, at which point why bother trying to multithread?

So trying to expand the clients into the "NAME or NAME or NAME" style may be a better route, but what happens with multi-thousand node installations? Does SOLR have a limit on how many clauses can be in the query? What's the algorithmic complexity of SOLR searches? I have no idea off the top of my head. Anything up to linear then your approach is worth exploring, but quadratic or worse is probably a show stopper.

I guess what I'm saying is that I don't see a safe, obvious solution to speed this up. Even the new Chef server API v1 doesn't fix this, as you don't get the key back from a client query - you have to pull a sub-resource (e.g. /clients/foo/keys/1) to get it.

I'm open to changes that make chef-vault better, but only if they don't introduce stability problems or require people to use Chef client or server v12 or Ruby 2. There's a huge installed base of people still using chef-client v11, and preventing them from using chef-vault would have to be paired with a major version bump.

Thanks

This will use a partial search for the client names for smaller deployments, but use a full client search for large deployments. It's still a win due to the much lower number of roundtrips.

tfheen · 2015-08-04T11:06:46Z

I've pushed a non-threaded solution. This runs in ~30s for me, as compared with ~15 minutes with master. Again, this will likely require test adjustments.

Happy for input on the approach.

tfheen · 2015-08-11T16:35:01Z

Have you had a chance to think about the approach in the latest patch here? I believe it is correct (but will of course require fixing up the tests so it actually passes still)

sysbot · 2015-08-11T16:41:23Z

@tfheen I'll go ahead and verify the tests with the new patch (is this upstream now or still a branch?). I think it should still work out of the box since partial_search just go through the same chef-zero search mechanism that emulated by test-kitchen databags injecting databags into the local chef-zero environment.

tfheen · 2015-08-11T16:50:42Z

@sysbot it's not merged yet (even if everything else was great, I don't think upstream would merge a patch with failing tests). As you can see from the status on the pull request, the tests fail (but fixing it should be fairly easy, from what I can see).

jf647 · 2015-08-21T16:14:24Z

The API for Chef::Search::Query changed between Chef 11 and Chef 12:

https://github.com/chef/chef/blob/11.18.14/lib/chef/search/query.rb

https://github.com/chef/chef/blob/12.4.1/lib/chef/search/query.rb

Using v11-style args against the v12 args works, but I don't think it works the other way. The hash of options that you're passing will be interpreted as the sort arg in 11, and it will attempt to URL-encode the .to_s representation of the Hash.

Support for chef-client v11 will not be dropped until the next major release of chef-vault.

I appreciate the attempt to solve this problem, and I'm not trying to be purposefully negative, but the speedup you're trying to achive is not trivial to do without breaking existing installations.

Perhaps what is required here is some ancillary library for managing chef-vault data bags, with its own cli. Then you could use threads with ridley or chef-api (not sure if that's threadsafe, so don't quote me) and get the speedup you desire without the change also affecting runtime decryption of secrets, which is the part that has to be compatible with Ruby 1.9 and chef-client v11.

@tfheen

This is based on @tfheen's work in #178, and a suggestion from @rveznaver in #237 Signed-off-by: Thom May <thom@chef.io>

@tfheen

This is based on @tfheen's work in #178, and a suggestion from @rveznaver in #237 Signed-off-by: Thom May <thom@chef.io>

thommay · 2017-01-24T13:49:25Z

Some of this landed in #240, and it's unlikely we'll rewrite to do threads any time soon. Thanks for the PR!

Use partial searches on the node index

f79f326

Doing a full search on the node index means the entire node object is downloaded for each hit. As we are only interested in the name, only ask for that.

tfheen force-pushed the speedups branch 2 times, most recently from 288e941 to efc947c Compare July 28, 2015 06:55

Use a partial search for loading clients as well

0eaf5aa

This will use a partial search for the client names for smaller deployments, but use a full client search for large deployments. It's still a win due to the much lower number of roundtrips.

tfheen force-pushed the speedups branch from efc947c to 0eaf5aa Compare August 4, 2015 11:05

dougireton added enhancement labels Aug 24, 2015

dougireton assigned jf647 Aug 24, 2015

tfheen unassigned jf647 Nov 24, 2015

rveznaver mentioned this pull request Oct 24, 2016

Vault does not scale well with 2000+ nodes #237

Closed

thommay pushed a commit that referenced this pull request Nov 4, 2016

Optimise queries when finding nodes

2ee7880

This is based on @tfheen's work in #178, and a suggestion from @rveznaver in #237 Signed-off-by: Thom May <thom@chef.io>

thommay mentioned this pull request Nov 4, 2016

Optimise queries when finding nodes #240

Merged

thommay pushed a commit that referenced this pull request Nov 7, 2016

Optimise queries when finding nodes

5b3e325

This is based on @tfheen's work in #178, and a suggestion from @rveznaver in #237 Signed-off-by: Thom May <thom@chef.io>

thommay closed this Jan 24, 2017

thommay added Type: Enhancement Adds new functionality. Status: Pending Contributor Response and removed enhancement labels Jan 25, 2017

tas50 added Triage: Needs Information Indicates an issue needs more information in order to work on it. and removed Status: Pending Contributor Response labels Dec 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up knife vault refresh #178

Speed up knife vault refresh #178

tfheen commented Jul 28, 2015

sysbot commented Jul 28, 2015

tfheen commented Jul 28, 2015

jf647 commented Jul 28, 2015

sysbot commented Jul 28, 2015

tfheen commented Jul 30, 2015

jf647 commented Jul 30, 2015

tfheen commented Aug 4, 2015

tfheen commented Aug 11, 2015

sysbot commented Aug 11, 2015

tfheen commented Aug 11, 2015

jf647 commented Aug 21, 2015

thommay commented Jan 24, 2017

Speed up knife vault refresh #178

Speed up knife vault refresh #178

Conversation

tfheen commented Jul 28, 2015

sysbot commented Jul 28, 2015

tfheen commented Jul 28, 2015

jf647 commented Jul 28, 2015

sysbot commented Jul 28, 2015

tfheen commented Jul 30, 2015

jf647 commented Jul 30, 2015

tfheen commented Aug 4, 2015

tfheen commented Aug 11, 2015

sysbot commented Aug 11, 2015

tfheen commented Aug 11, 2015

jf647 commented Aug 21, 2015

thommay commented Jan 24, 2017