Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: repo-server - prevent locked state after unclean git checkout #11805

Merged
merged 5 commits into from Feb 1, 2023

Conversation

arturhoo
Copy link
Contributor

@arturhoo arturhoo commented Dec 21, 2022

Fixes #7898 - in busy systems an installation with a large number of apps defined in deep respositories, checkout operations can cross the defined ARGOCD_EXEC_TIMEOUT (default 90s). When that occurs, the git process receives a SIGKILL with no chance of leaving the local repo checkout in a clean state.

As a result, further git operations fail with "fatal: Unable to create '/private..git/index.lock': File exists".

Depends on argoproj/pkg#270 - now that a custom signal can be passed into to argoproj/pkg/exec, we can send SIGTERM instead and wait for git to exit cleanly before continuing with other operations.

Discussion on slack: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1670965919247029

Testing

  • Small unit test added for the util/exec/exec module.
  • I poked around a bit but didn't see a straight forward way to create an e2e test that simulates the behaviour described in Repo server unable to checkout commit #7898 (comment) - open to feedback though test implemented on 57b6c2a probably flaky though

@arturhoo arturhoo changed the title safe git checkout fix: prevent repo-server from getting into a locked state after unclean git checkout kill Dec 21, 2022
@arturhoo arturhoo changed the title fix: prevent repo-server from getting into a locked state after unclean git checkout kill fix: repo-server - prevent locked state after unclean git checkout kill Dec 21, 2022
@arturhoo arturhoo changed the title fix: repo-server - prevent locked state after unclean git checkout kill fix: repo-server - prevent locked state after unclean git checkout Dec 21, 2022
@codecov
Copy link

codecov bot commented Dec 21, 2022

Codecov Report

Base: 47.43% // Head: 47.44% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (b149eb5) compared to base (06b98c5).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #11805   +/-   ##
=======================================
  Coverage   47.43%   47.44%           
=======================================
  Files         246      246           
  Lines       41826    41833    +7     
=======================================
+ Hits        19840    19847    +7     
  Misses      19990    19990           
  Partials     1996     1996           
Impacted Files Coverage Δ
util/exec/exec.go 100.00% <100.00%> (ø)
util/git/client.go 50.44% <100.00%> (+0.56%) ⬆️
util/settings/settings.go 49.20% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@arturhoo arturhoo marked this pull request as ready for review December 21, 2022 22:06
@crenshaw-dev
Copy link
Collaborator

Would it be feasible to write a test that reproduces the exact problem? i.e. getting the cached repo into a bad state

@arturhoo
Copy link
Contributor Author

arturhoo commented Dec 23, 2022

@crenshaw-dev that is possible: 57b6c2a

I have mixed feelings about the dependency of this test on the timing of the git operations, which could lead to flakiness:

  1. A slow git execution might prevent the test from creating the index file in the first place, causing the test to fail
  2. A fast git execution might actually lead to the checkout being complete before the timeout happens, causing the test to fail

Let me know what you think. I can confirm that the changes from this PR (SIGTERM and waiting for the process to finish) make that test pass. If you're happy with the trade-off, I can cherry pick that commit with the test into this PR.

A potential improvement (albeit much bigger) would be to run those git operations in a container with cgroupv2 IO limits in place.

Artur Rodrigues added 3 commits January 4, 2023 10:44
Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>
Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>
Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>
util/exec/exec_test.go Outdated Show resolved Hide resolved
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Copy link
Collaborator

@crenshaw-dev crenshaw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @arturhoo!

To be extra cautious, I'd like to avoid cherry-picking this back any further than necessary until it's validated in someone's prod environment. I think I'll cherry-pick to 2.5. @harikrongali has an environment where he can validate.

If all looks good, we can cherry-pick it even further back.

@arturhoo
Copy link
Contributor Author

arturhoo commented Feb 1, 2023

Thanks for the approval @crenshaw-dev!

@harikrongali and I actually work together at Lacework now (small world :P ), so I'm happy to get it validated in our prod installation. We're on v2.5, so if you think the cherry pick and a new patch release under 2.5 it is reasonable, getting it rolled out is quite straightforward for us.

@crenshaw-dev
Copy link
Collaborator

Sounds good, will merge and cherry-pick after CI passes. There's one other bugfix currently on release-2.5. I'll see if I can get anything else merged today and plan to cut a release tomorrow.

@harikrongali
Copy link

Thanks @crenshaw-dev for merging the PR. We will validate in our env once 2.5.x is available.

@crenshaw-dev crenshaw-dev merged commit e510a77 into argoproj:master Feb 1, 2023
@crenshaw-dev
Copy link
Collaborator

Cherry-picked onto release-2.6.

The 2.5 cherry-pick bumps a lot of unrelated dependency versions. The downside of abstracting things out to a package with an uncoordinated release schedule I guess.

Would y'all be okay leaving this for 2.6+, considering it goes GA on Monday?

@arturhoo
Copy link
Contributor Author

arturhoo commented Feb 2, 2023

Would y'all be okay leaving this for 2.6+, considering it goes GA on Monday?

Yes, that works too! Thanks

@arturhoo arturhoo deleted the safe-git-checkout branch February 2, 2023 09:22
crenshaw-dev added a commit that referenced this pull request Feb 2, 2023
…7898) (#11805)

* Pull in new version of argoproj/pkg

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>

* Allow timeout behavior to be specified in util/exec/exec

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>

* Git processes receive SIGTERM when timedout

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>

* Update util/exec/exec_test.go

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
schakrad pushed a commit to schakrad/argo-cd that referenced this pull request Mar 14, 2023
…rgoproj#7898) (argoproj#11805)

* Pull in new version of argoproj/pkg

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>

* Allow timeout behavior to be specified in util/exec/exec

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>

* Git processes receive SIGTERM when timedout

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>

* Update util/exec/exec_test.go

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

---------

Signed-off-by: Artur Rodrigues <artur.rodrigues@lacework.net>
Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: schakrad <chakradari.sindhu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repo server unable to checkout commit
3 participants