Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stored: fix authentication race condition / deadlock #1732

Merged
merged 13 commits into from Mar 18, 2024

Conversation

sebsura
Copy link
Contributor

@sebsura sebsura commented Mar 12, 2024

Thank you for contributing to the Bareos Project!

Sometimes the fd and sd do not agree on the authentication status of the connection, which leads to both of them waiting for the other.
This is in part caused by not using condition variables correctly, which causes the sd to not notice that the authenticated condition changed from false to true.

This PR also adds an additional timeout check to our systemtests. If a single run_bconsole invocation takes more than 100 seconds, then the testrunner will create backtraces of the currently running daemons and exit the test with exit code 124.

This should make it easier to debug hangs (like the one above) in our ci pipeline.

Please check

  • Short description and the purpose of this PR is present above this paragraph
  • Your name is present in the AUTHORS file (optional)

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

Make sure you check/merge the PR using devtools/pr-tool to have some simple automated checks run and a proper changelog record added.

General
  • Is the PR title usable as CHANGELOG entry?
  • Purpose of the PR is understood
  • Commit descriptions are understandable and well formatted
  • Check backport line
  • Required backport PRs have been created
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)
  • Required documentation changes are present and part of the PR
Tests
  • Decision taken that a test is required (if not, then remove this paragraph)
  • The choice of the type of test (unit test or systemtest) is reasonable
  • Testname matches exactly what is being tested
  • On a fail, output of the test leads quickly to the origin of the fault

Copy link
Member

@pstorz pstorz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments

core/CTestScript.cmake.in Outdated Show resolved Hide resolved
core/src/filed/restore.cc Outdated Show resolved Hide resolved
sebsura and others added 13 commits March 18, 2024 13:54
The condition variable is not used correctly:

// reader
1|  while (!unprotected) {
2|        wait(cond_var)
    }
// writer

3|  unprotected = true;
4|  signal(cond_var)

The execution order 1->3->4->2 will cause a deadlock.  This is why the
wait command takes a mutex:  Everything that might change the
condition to be true needs to lock the mutex,  this way we can ensure
that we either see the updated value or the wait sees the signal.

Since jcr->authenticate is used all over the place in a lot of
different situations, this problem could not be easily fixed by just
protecting that variable (we do not want weird deadlocks to happen
after all).

We just do not rely on jcr->authenticate anymore when it comes to
waiting on job start.  Instead we have a single, properly protected
bool `client_available` that we can wait on.
This bool obviously needs to be set by whoever authenticates the FD/SD
connection, otherwise the job will deadlock.   But at least that is
easily fixable.
once that timeout is reached, we kill the daemons and create a trace.
@BareosBot BareosBot force-pushed the dev/ssura/master/fix-authentication branch from 603b237 to f229099 Compare March 18, 2024 13:54
@BareosBot BareosBot merged commit 61febc7 into bareos:master Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants