Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoize git shell commands #223

Merged
merged 1 commit into from Sep 11, 2023
Merged

Memoize git shell commands #223

merged 1 commit into from Sep 11, 2023

Conversation

adsr
Copy link
Contributor

@adsr adsr commented May 8, 2023

Description

We have a shell script that invokes chef install and chef export against a setup with ~30 cookbooks. It was annoyingly slow so we dug into why that was the case. We found that these commands call ChefCLI::CookbookProfiler::Git in a loop which ends up shelling out the same handful of git commands repeatedly. For example, one run produced the following duplicate git commands (sorted by dupe count):

 74   execve("/usr/bin/git", ["git", "rev-parse", "HEAD"], <snip>
 46   execve("/usr/bin/git", ["git", "rev-parse", "HEAD"], <snip>
 41   execve("/usr/bin/git", ["git", "rev-parse", "--abbrev-ref", "HEAD"], <snip>
 39   execve("/usr/bin/git", ["git", "diff-files", "--quiet"], <snip>
 25   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"], <snip>
 23   execve("/usr/lib/git-core/git", ["/usr/lib/git-core/git", "status", "--porcelain=2", "-uno"], <snip>
 23   execve("/usr/bin/git", ["git", "config", "--get", "branch.HEAD.remote"], <snip>
 21   execve("/usr/bin/git", ["git", "diff-files", "--quiet"], <snip>
 19   execve("/usr/bin/git", ["git", "rev-parse", "--abbrev-ref", "HEAD"], <snip>
 17   execve("/usr/bin/git", ["git", "config", "--get", "branch.<snip>.remote"], <snip>
 16   execve("/usr/bin/git", ["git", "config", "--get", "remote.origin.url"], <snip>
 14   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"], <snip>
 14   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"], <snip>
 12   execve("/usr/bin/git", ["git", "config", "--get", "remote.origin.url"], <snip>
 11   execve("/usr/bin/git", ["git", "config", "--get", "branch.<snip>.remote"], <snip>
  9   execve("/usr/bin/git", ["git", "config", "--get", "branch.HEAD.remote"], <snip>
  7   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"], <snip>
  5   execve("/usr/lib/git-core/git", ["/usr/lib/git-core/git", "status", "--porcelain=2", "-uno"], <snip>

These all seem to be cacheable at the repo-level. Adding a simple memoization brought down our run from ~45s to ~1s.

To be safer, we could wrap this is a git_memo method to make the memoization behavior more explicit. I thought I'd collect some initial feedback on the general idea first. For example this would break things if each cookbook were its own repo. I'm not sure if it's that an expected / supported use-case.

I modified the code to memoize at the repo-level, so cookbook-per-repo should continue to work.

Related Issue

n/a

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Chore (non-breaking change that does not add functionality or fix an issue)

Checklist:

  • I have read the CONTRIBUTING document.
  • I have run the pre-merge tests locally and they pass.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commits have been signed-off for the Developer Certificate of Origin.

@Stromweld
Copy link
Contributor

For example this would break things if each cookbook were its own repo. I'm not sure if it's that an expected / supported use-case.

That is actually recommended best practice and highly used use case to have each cookbook as it's own repo.

@adsr
Copy link
Contributor Author

adsr commented May 9, 2023

For example this would break things if each cookbook were its own repo. I'm not sure if it's that an expected / supported use-case.

That is actually recommended best practice and highly used use case to have each cookbook as it's own repo.

Understood. Any ideas for how we might reap this perf win without forking Chef?

@Stromweld
Copy link
Contributor

Stromweld commented May 9, 2023

Only thing that comes to mind is by adding to knife config file a property to define cookbook mono repo or multiple repo approach and then use that to make decision to run your optimization or run the old loop.

Another approach I just thought of is to use git initially and then check if all cookbooks live in subdirectory and then use your memorize approach and break out of loop otherwise move on through loop for other cookbooks.

Just Ideas, haven't fully looked at code for viability or complexity of this kind of change.

@Stromweld
Copy link
Contributor

@adsr a question I have also is around how your storing all your cookbooks? Are you using central artifact store like artifactory, internal chef supermarket, or chef-server? I have found dep resolution is a lot faster when using one of these options as it's able to download a single index and depsolve against that vs fetching each cookbook from git and parsing files for dependency information.

@adsr
Copy link
Contributor Author

adsr commented May 9, 2023

Only thing that comes to mind is by adding to knife config file a property to define cookbook mono repo or multiple repo approach and then use that to make decision to run your optimization or run the old loop.

Another approach I just thought of is to use git initially and then check if all cookbooks live in subdirectory and then use your memorize approach and break out of loop otherwise move on through loop for other cookbooks.

Just Ideas, haven't fully looked at code for viability or complexity of this kind of change.

OK, I have a few ideas as well after looking at things more carefully. The simplest move to support repo-per-cookbook would be to include repo path as part of the memo key. The other idea I had was to parallelize the commands. That would be a bigger change.

@adsr a question I have also is around how your storing all your cookbooks? Are you using central artifact store like artifactory, internal chef supermarket, or chef-server? I have found dep resolution is a lot faster when using one of these options as it's able to download a single index and depsolve against that vs fetching each cookbook from git and parsing files for dependency information.

In dev context, which is what I'm concerned with here, the cookbook store is the file system. I profiled the code a bit more carefully and found the dep solving code (starting from PolicyfileCompiler::graph_solution) occupies a small sliver of wall time compared to git-related code.

Here are rbspy flamegraphs of chef install and chef export with git.rb parts highlighted:

flamegraph of chef install

flamegraph of chef export

Alright, I just pushed a modified solution that keys the memo cache by repo path (git rev-parse --show-toplevel) as I mentioned above. It's not as big of a win because we are still calling this rev-parse command un-memoized many times. But it still gives us a big boost (~45s -> 8s) and should work with repo-per-cookbook setups too.

@Stromweld Stromweld added Aspect: Performance Works without negatively affecting the system running it. Type: Enhancement Adds new functionality. labels May 9, 2023
@Stromweld
Copy link
Contributor

nice work

@Stromweld
Copy link
Contributor

@adsr looks like there are a couple of cookstyle fixes needed. See buildkite job.

We have a shell script that invokes `chef install` and `chef export`
against a setup with ~30 cookbooks. It was annoyingly slow so we dug
into why that was the case. We found that these commands call
`ChefCLI::CookbookProfiler::Git` in a loop which ends up shelling out
the same handful of git commands repeatedly. For example, one run
produced the following duplicate git commands (sorted by dupe count):

```
 74   execve("/usr/bin/git", ["git", "rev-parse", "HEAD"],
 46   execve("/usr/bin/git", ["git", "rev-parse", "HEAD"],
 41   execve("/usr/bin/git", ["git", "rev-parse", "--abbrev-ref", "HEAD"],
 39   execve("/usr/bin/git", ["git", "diff-files", "--quiet"],
 25   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"],
 23   execve("/usr/lib/git-core/git", ["/usr/lib/git-core/git", "status", "--porcelain=2", "-uno"],
 23   execve("/usr/bin/git", ["git", "config", "--get", "branch.HEAD.remote"],
 21   execve("/usr/bin/git", ["git", "diff-files", "--quiet"],
 19   execve("/usr/bin/git", ["git", "rev-parse", "--abbrev-ref", "HEAD"],
 17   execve("/usr/bin/git", ["git", "config", "--get", "branch.<snip>.remote"],
 16   execve("/usr/bin/git", ["git", "config", "--get", "remote.origin.url"],
 14   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"],
 14   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"],
 12   execve("/usr/bin/git", ["git", "config", "--get", "remote.origin.url"],
 11   execve("/usr/bin/git", ["git", "config", "--get", "branch.<snip>.remote"],
  9   execve("/usr/bin/git", ["git", "config", "--get", "branch.HEAD.remote"],
  7   execve("/usr/bin/git", ["git", "branch", "-r", "--contains", "<snip>"],
  5   execve("/usr/lib/git-core/git", ["/usr/lib/git-core/git", "status", "--porcelain=2", "-uno"],
```

Adding this memoization brings our run from ~45s down to ~8s.

The memo cache is keyed by repo path (`rev-parse --show-toplevel`)
which should support both cookbook-mono-repo and repo-per-cookbook.

Signed-off-by: Adam Saponara <as@php.net>
@sonarcloud
Copy link

sonarcloud bot commented May 23, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@adsr
Copy link
Contributor Author

adsr commented May 23, 2023

@Stromweld Sorry, missed those. Fixed.

@tpowell-progress
Copy link

@vkarve-chef came across this PR and looks like someone from Chef Workstation needs to weigh in to merge.

@vkarve-chef
Copy link
Collaborator

sure @tpowell-progress, we'll take a look in a week or two. thanks @adsr for creating the PR!

@ericnorris
Copy link

Hey @vkarve-chef, did your team have a chance to look at this? This would make a huge difference in the iteration speed of our local development process.

@vkarve-chef
Copy link
Collaborator

@ericnorris @adsr we intend to get to this by around mid-September. Sorry, it's taking us so long!
cc - @clintoncwolfe

@nikhil2611 nikhil2611 merged commit 70a0cce into chef:main Sep 11, 2023
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aspect: Performance Works without negatively affecting the system running it. Type: Enhancement Adds new functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants