New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keepalived 2.2.4 memory leak? #2199
Comments
Can you please provide the output of Are there any child processes of any of the keepalived processes, and if so have they been running for long? In order for us to be able to test this ourselves, we will need to have a full copy of your keepalived configuration, and of any scripts that it runs. |
Thx for the fast feedback, but not fast enough. Must restart the process before the oom was doing it. |
hi @pqarmitage here the stats after nearly 24 hours:
here the stats:
Now the commands you requested from PID 325339:
I will anonymize the config and upload it here |
@pqarmitage here comes the config |
@ruben-herold Are you able to provide me the pmap outputs for the second keepalived process (i.e. the one with the middle PID) soon after it starts, so I can see what the differences are. It is the keepalived_checker process that is growing, which having looked at your config I suspected it would have to be that one. It would be interesting to see if |
@ruben-herold Are the TCP_CHECKs always successful, always failing, generally successful, generally failing, or a mixture of each? |
@pqarmitage since restart always successful as far as I see |
@pqarmitage over the weekend int grows again massiv:
|
@pqarmitage so we are now at
So I will restart again this evening. Or do you need any further data? |
@ruben-herold The output of |
@pqarmitage seems not be the sockets:
|
@ruben-herold Many apologies for taking so long to look at this. I have now been able to set up an environment to run most of your configuration, but I still have to sort out setup for the SSL_GET and HTTP_GET, which I hadn't previously noticed in your configuration. Running with just the TCP_CHECKs being able to connect doesn't appear to cause a memory leak. My suspicion is that either the HTTP_GETs or the SSL_GETs are causing the memory leak. Would it be possible for you to run with the HTTP_GETs and the SSL_GETs disabled for a few hours, to see if not using those two types of checker causes the memory leak to stop? If it does, and you were then able to run with just the SSL_GETs disabled, and then just the HTTP_GETs disabled, it would identify which if the checkers (or both) is causing the problem. The HTTP_GET checker uses the EVP digest routines from the OpenSSL/LibreSSL library and the SSL_GET checker uses the SSL routines. It is quite possible that we are not using the library properly causing a memory leak, or alternatively it is possible that there is a memory leak in the library itself. Do you know whether you are using OpenSSL or LibreSSL, and it would be helpful to know which version of the library you are using, so we can attempt to run it on the same version? |
@ruben-herold Looking in more detail at the
This shows that the heap allocation grew by 132kB between the first two commands, and 276kB between the second and third command. On my development system running your configuration with all checkers running successfully, the pmap total size output for the checker process is 508kB larger than for the other two keepalived processes. The checker process has an extra heap segment compared to the other two processes, and it is precisely 508kB (i.e. the difference between the process sizes). I am therefore assuming that when keepalived starts, the checker process has a total size of about 25200kB. You indicate above that after nearly 24 hours it has grown to 513676k, a growth of about 488500k, or 20353k per hour or 340k per minute. I have now been running your configuration on my development system with keepalived version 2.2.4 for one hour, and the size of that heap segment has not increased from when keepalived started. It is using openssl 1.1.1n. I have now run your configuration on a CentOS 9 VM, and I can see the problem occurring although the rate of increase of the heap size is about 64k per minute. Are you running more that 2 SSL_GET checkers in your live configuration? The version of openssl is 3.0.1-41. I have also run both keepalived v2.2.4 and the current code from github on Fedora 36, which uses openssl 3.0.5, and that experiences the problem too. The problem only occurs with the SSL_GET checker, and not with the HTTP_GET or TCP_CHECK checkers. As a workaround for now, can you replace the SSL_GET checkers with HTTP_GET checkers and specify In the mean time I will make sure that the way keepalived is using the SSL library code is correct (but note that it does work without a problem with openssl 1.1.1n) or see if I can reproduce the problem with a simple program just using the openssl SSL code. |
@pqarmitage thx for your work. Yes it is centos 9 with openssl 3. I'm on vaction in the moment but I will try to test it without ssl. |
@ruben-herold You really shouldn't be looking at work emails while on vacation :), but nevertheless thanks for responding. I hope you enjoy the rest of your vacation without being disturbed by work matters, and I look forward to the results of your testing once you are back from your vacation. |
@ruben-herold I have done some further work, modifying the keepalived malloc/free checker to report on OpenSSL mallocs/frees. This has identified that some memory allocated in ssl3_setup_write_buffer() in ssl/record/ssl_buffer.c with Purely by chance I discovered that adding I believe that this should resolve your memory leak, and it would be very useful if you could test it and provide feedback. |
Update to previous comment: There were 15 commits between OpenSSL v3.0.1 and v3.0.5 whose description was that they fixed memory leaks. |
@pqarmitage thx for your great work!! I will create a ticket for centos with the informations and a link to this issue. |
This appears to be fully understood and resolved now. |
hi,
I have migrated an old keepalived setup to 2.2.4. Now it is running since one week, I got a notification about the memory usage:
pmap 9294 | tail -n 1
total 4479348K
Its eating more then 4GB RAM.
OS: Centos stream 9
keepalived-2.2.4-2.el9.x86_64
It does VRRP, IPVS and some http health checks.
The text was updated successfully, but these errors were encountered: