Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[percona-xtrabackup] prevent High memory usage for no reason (IO_CLOSE) #1724

Conversation

banuchka
Copy link
Contributor

@banuchka banuchka commented Mar 1, 2024

In the case of IO_CLOSE inside the plugin, we don't need to wait for the subprocess to finish, because it trying to get the data from xtrabackup, but the destination for such data to backup is unavailable.
It causes high memory usage and OOM in some cases.

Please check

  • Short description and the purpose of this PR is present above this paragraph

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

Make sure you check/merge the PR using devtools/pr-tool to have some simple automated checks run and a proper changelog record added.

General
  • Is the PR title usable as CHANGELOG entry?
  • Purpose of the PR is understood
  • Commit descriptions are understandable and well formatted
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)

@sduehr
Copy link
Member

sduehr commented Mar 12, 2024

Hi, why did you close this PR?
It looks good so far and seems to make sense. I just need to find time to review it properly.
Your contribution is welcome. I will reopen it if you don't mind. If you still want to close it,
it would be nice if you drop a sentence why.

@sduehr sduehr reopened this Mar 12, 2024
@banuchka
Copy link
Contributor Author

Hi @sduehr,
I've found an issue with the restore operation. I want to add more logic and checks before killing the subprocess.
Could you please put the review on hold until additional information or changes from my side?
Does it work for you?
Thanks

@sduehr sduehr added the onhold label Mar 12, 2024
@sduehr
Copy link
Member

sduehr commented Mar 12, 2024

Sure, I've set the on-hold label.

We have a systemtest for the plugin, see
https://github.com/bareos/bareos/tree/master/systemtests/tests/py3plug-fd-percona-xtrabackup
which is running automatically, it failed on restore with

xtrabackup: recognized client arguments: --prepare=1 --target-dir=./tmp/bareos-restores/_percona/4/00000000000000000000_00000000000018129147_0000000001 
/usr/bin/xtrabackup version 8.0.26-18 based on MySQL server 8.0.26 Linux (x86_64) (revision id: 4aecf82)
xtrabackup: cd to /var/lib/jenkins/workspace/bareos_PR-1724/build/systemtests/tests/py3plug-fd-percona-xtrabackup/tmp/bareos-restores/_percona/4/00000000000000000000_00000000000018129147_0000000001/
xtrabackup: Error: cannot open ./xtrabackup_checkpoints
xtrabackup: Error: failed to read metadata from './xtrabackup_checkpoints'
exit(1) is called. Set test to failure and end test.

The systemtest is probably too specific to be used outside of the build environment, but
we are planning to allow contributors access to the test results in the future.

Thanks again, waiting for your notification start review.

Regards,
Stephan

@banuchka banuchka force-pushed the xtrabackup_plugin_terminate_fork_in_case_of_IO_CLOSE branch from 3382006 to 8171df0 Compare March 13, 2024 12:03
@banuchka
Copy link
Contributor Author

Hi @sduehr, is there any way to rerun tests?
It works for me in our environment, but I may have missed something. And please feel free to remove the on hold status.
Thanks.

@banuchka banuchka force-pushed the xtrabackup_plugin_terminate_fork_in_case_of_IO_CLOSE branch from 8171df0 to 2e34195 Compare March 13, 2024 12:31
@sduehr sduehr removed the onhold label Mar 13, 2024
@sduehr
Copy link
Member

sduehr commented Mar 13, 2024

Thanks, build and tests are automatically triggered when new commits are pushed.
The systemtest for the percona-xtrabackup plugin did run successful now.
I'll probably have to suggest some changes during review, as it's required to comply with our guidelines to pass checks for PRs to get merged. For example the first line of a commit message must not be longer than 50 characters, see
https://docs.bareos.org/DeveloperGuide/gitworkflow.html

@banuchka banuchka force-pushed the xtrabackup_plugin_terminate_fork_in_case_of_IO_CLOSE branch from 2e34195 to 051ac71 Compare March 14, 2024 10:26
@banuchka
Copy link
Contributor Author

I rewrote the commit message and added details and explanations about the importance of the changes.

Copy link
Member

@sduehr sduehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider trying self.stream.terminate() first, and only if that fails use self.stream.kill()?

@banuchka
Copy link
Contributor Author

Did you consider trying self.stream.terminate() first, and only if that fails use self.stream.kill()?

Good idea. Honestly, I wrote a separate function for the dump process termination with timeouts and TERM signal instead of killing it straightaway.
Can I do that as an additional commit to the current PR or force push(again), which is the preferred way?
I'm asking because you've already started reviewing it, and I don't want to waste your time.

@banuchka banuchka force-pushed the xtrabackup_plugin_terminate_fork_in_case_of_IO_CLOSE branch 2 times, most recently from 6a2b699 to b9d2b5f Compare March 22, 2024 10:43
@banuchka banuchka requested a review from sduehr March 22, 2024 10:46
@sduehr
Copy link
Member

sduehr commented Mar 22, 2024

Thanks, normally an additional commit is ok. But in this case it's only one file, so force push is ok, too.
I'll review again as soon as possible.

@sduehr
Copy link
Member

sduehr commented Mar 28, 2024

Thanks, looks good so far.
I pushed a small change to comply with our coding guidelines.

Copy link
Member

@sduehr sduehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my suggestions. Are you able to reproduce the case where end_dumper() would go in the second timeout, where the process didn't terminate even after sending SIGTERM?

@banuchka
Copy link
Contributor Author

Please check my suggestions. Are you able to reproduce the case where end_dumper() would go in the second timeout, where the process didn't terminate even after sending SIGTERM?

I'm not sure that is 100% reproducible on my side. I saw an empty return code straight after sending SIGTERM which is why I put an additional 30 seconds waiting for the correct termination.

I agree that that is possible not to wait 2nd 30 sec and jump to the else block with setting 9 to the stream's return code, but prefer to wait a bit more seconds.

@arogge arogge added this to the 24.0.0 milestone Apr 2, 2024
banuchka and others added 5 commits April 9, 2024 10:27
Fix non-terminating subprocess behaviour in case of bareos-storage
unavailable during a Backup job.

This commit terminates the subprocess in case bareos-storage is
unavailable for some reason during a Backup job.
This is a bug fix because the previous behaviour causes high memory
usage that may cause OOM in some cases.

Move the logic of terminating the xtrabackup(dumper) subprocess
in case of IO_CLOSE to separate function end_dumper().
Add timeouts and change the kill signal to terminate.
Add convenient debug and job messages.
Code formatted using black and corrected copyright yeyr to comply with
our coding guidelines.
Co-authored-by: sduehr <stephan.duehr@bareos.com>
Agree, however, I'm not sure that the 9 exit code is the right one here.
I've never been there or seen such a scenario, but potentially it may
happen.

Co-authored-by: sduehr <stephan.duehr@bareos.com>
Co-authored-by: sduehr <stephan.duehr@bareos.com>
@sduehr sduehr force-pushed the xtrabackup_plugin_terminate_fork_in_case_of_IO_CLOSE branch from 3c72f85 to c712412 Compare April 9, 2024 08:33
@sduehr
Copy link
Member

sduehr commented Apr 9, 2024

OK, thanks. Yes, setting 9 to the stream's return isn't perfect, but ok for now.
I pushed some changes to make the commit messages compliant with our rules.

@BareosBot BareosBot merged commit 89eab59 into bareos:master Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants