Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

odd cleanup behavior under valgrind #4363

Closed
jmayclin opened this issue Jan 17, 2024 · 3 comments · Fixed by #4369
Closed

odd cleanup behavior under valgrind #4363

jmayclin opened this issue Jan 17, 2024 · 3 comments · Fixed by #4369

Comments

@jmayclin
Copy link
Contributor

Problem:

An odd failure was seen in s2n_examples_test when developing #4351

The failure was seen using

  • gcc 6
  • OpenSSL 1.1.1
  • x86_64
  • ubuntu 18
  • valgrind ?

This failure was also seen using gcc 9

I was not able to replicate this failure with different libcrypto's, and I was not able to replicate this failure when using clang 15 and openssl 1.1.1 on an ARM AL2023 instance.

The stack trace was

==7201== 528 bytes in 4 blocks are still reachable in loss record 128 of 143
==7201==    at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7201==    by 0x500368A: CRYPTO_malloc (mem.c:222)
==7201==    by 0x50036BD: CRYPTO_zalloc (mem.c:230)
==7201==    by 0x4F0DB17: bn_expand_internal (bn_lib.c:280)
==7201==    by 0x4F0DBD9: bn_expand2 (bn_lib.c:304)
==7201==    by 0x4F0F172: bn_wexpand (bn_lib.c:1001)
==7201==    by 0x4F0E000: BN_bin2bn (bn_lib.c:454)
==7201==    by 0x4EE1A7E: bn_c2i (x_bignum.c:123)
==7201==    by 0x4EDBE5B: asn1_ex_c2i (tasn_dec.c:805)
==7201==    by 0x4EDBD8E: asn1_d2i_ex_primitive (tasn_dec.c:780)
==7201==    by 0x4EDA510: asn1_item_embed_d2i (tasn_dec.c:181)
==7201==    by 0x4EDB84E: asn1_template_noexp_d2i (tasn_dec.c:643)
==7201==    by 0x4EDB3AF: asn1_template_ex_d2i (tasn_dec.c:518)
==7201==    by 0x4EDAE40: asn1_item_embed_d2i (tasn_dec.c:382)
==7201==    by 0x4EDA2EF: ASN1_item_ex_d2i (tasn_dec.c:124)
==7201==    by 0x4EDA26F: ASN1_item_d2i (tasn_dec.c:114)
==7201==    by 0x503A5B0: d2i_RSAPublicKey (rsa_asn1.c:111)
==7201==    by 0x5037CB3: rsa_pub_decode (rsa_ameth.c:107)
==7201==    by 0x509ECEE: x509_pubkey_decode (x_pubkey.c:124)
==7201==    by 0x509EA47: pubkey_cb (x_pubkey.c:46)
==7201==    by 0x4EDB080: asn1_item_embed_d2i (tasn_dec.c:432)
==7201==    by 0x4EDB84E: asn1_template_noexp_d2i (tasn_dec.c:643)
==7201==    by 0x4EDB3AF: asn1_template_ex_d2i (tasn_dec.c:518)
==7201==    by 0x4EDAE40: asn1_item_embed_d2i (tasn_dec.c:382)
==7201==    by 0x4EDB84E: asn1_template_noexp_d2i (tasn_dec.c:643)
==7201==    by 0x4EDB3AF: asn1_template_ex_d2i (tasn_dec.c:518)
==7201==    by 0x4EDAE40: asn1_item_embed_d2i (tasn_dec.c:382)
==7201==    by 0x4EDA2EF: ASN1_item_ex_d2i (tasn_dec.c:124)
==7201==    by 0x4EDA26F: ASN1_item_d2i (tasn_dec.c:114)
==7201==    by 0x509F8D7: d2i_X509 (x_x509.c:109)
==7201==    by 0x154070: s2n_openssl_x509_parse_impl (s2n_openssl_x509.c:50)
==7201==    by 0x154070: s2n_openssl_x509_parse_impl (s2n_openssl_x509.c:43)
==7201==    by 0x1541B9: s2n_openssl_x509_parse_without_length_validation (s2n_openssl_x509.c:65)
==7201==    by 0x141C16: s2n_x509_validator_read_cert_chain (s2n_x509_validator.c:424)
==7201==    by 0x141C16: s2n_x509_validator_process_cert_chain (s2n_x509_validator.c:461)
==7201==    by 0x141C16: s2n_x509_validator_validate_cert_chain (s2n_x509_validator.c:677)
==7201==    by 0x18D1BC: s2n_server_cert_recv (s2n_server_cert.c:46)
==7201==    by 0x14B502: s2n_handshake_read_io (s2n_handshake_io.c:1551)
==7201==    by 0x14CB51: s2n_negotiate_impl (s2n_handshake_io.c:1667)
==7201==    by 0x14CB51: s2n_negotiate_impl (s2n_handshake_io.c:1613)
==7201==    by 0x14D29A: s2n_negotiate (s2n_handshake_io.c:1710)
==7201==    by 0x126129: s2n_example_negotiate (s2n_negotiate.c:21)
==7201==    by 0x122550: s2n_test_example_negotiate (s2n_examples_test.c:75)
==7201==    by 0x1216EF: s2n_run_self_talk_test (s2n_examples_test.c:248)
==7201== 

Curiously, the actual unit test also failed (in addition to the reported memory leak)

s2n_result_is_ok(s2n_run_self_talk_test(s2n_test_example_negotiate)) is not true  (s2n_examples_test.c:322)
Error Message: 'a safety check failed'
 Debug String: 'Error encountered in s2n_examples_test.c:274'
 System Error: Success (0)

It is common to see that failing unit tests will report false positive memory leaks, and once the actual reason for the failure is fixed the memory leak will go away. My understanding here is that that is the case.

This test forks a client and server process, does a handshake over local_io, and then shuts down. I added the following logs to get a bit more of an idea what is going on

pid: 5961 - doing example negotiate
pid: 5962 - doing example negotiate
pid 5961 - doing example shutdown
pid 5962 - doing example shutdown
pid 5961 - returning success
pid 5962 - returning success
pid 5961: exiting with status code of 0
pid 5962: exiting with status code of 0
client pid is 5961
server pid is 5962
the client process exited with 2304

This is the code that the orchestrating processes uses the reap the childresn

    RESULT_ENSURE_EQ(waitpid(client_pid, &status, 0), client_pid);
    RESULT_ENSURE_EQ(status, EXIT_SUCCESS); <- this is the line failing
    RESULT_ENSURE_EQ(waitpid(server_pid, &status, 0), server_pid);
    RESULT_ENSURE_EQ(status, EXIT_SUCCESS);

This is the logging code in the client.

        printf("pid %i: exiting with status code of 0);
        exit(EXIT_SUCCESS);

While the logs are a little cluttered, we can see that nothing seems to be going wrong in the test, but something goes wrong with the client tries to exit which results in an unsuccessful process exit, causing the test to fail.

We can get more information from the exist status of 2304

    if WIFEXITED(status_f) {
        printf("process exited normally\n");
        int exit_status = WEXITSTATUS(status_f);
        printf("the exit status was %i", exit_status);
    } else {
        printf("process did not exit normally\n");
    }
    if WIFSIGNALED(status_f) {
        printf("process signalled to exit\n");
        int more_info = WTERMSIG(status_f);
        printf("the signal code was %i\n", more_info);
    } else {
        printf("process not signalled to exit\n");
    }
process exited normally
the exit status was 9
process not signalled to exit

It's rather hard to interpret these symbols, but it looks like something is forcibly exiting the process, possibly with a SIGKILL (exit status 9?)

This seems to be related to DEFER_CLEANUP on process exit. This commit fixed the valgrind failure: 3c624d7

Solution:

We should be able to understand exactly why this test was failing under valgrind, and exactly what was happening.

Requirements / Acceptance Criteria:

  • why was this only happening for OpenSSL 1.1.1
  • what is killing the process?
  • why is the process getting killed?
  • why wasn't this happening to the server?
@jmayclin
Copy link
Contributor Author

On one hand, this has to be related to attribute/cleanup, because if I remove the DEFER_CLEANUP and just allow the process to exit normally there is no failure.

But on the other hand, this can't be related to attribute/cleanup because those aren't executed at exit : test case.

Perhaps there are some atexit shenanigans? I'll be spinning up an x86 instance and trying to reproduce the failures with the docker image.

@jmayclin
Copy link
Contributor Author

Some further investigation. I was able to locally reproduce using the CI docker image.

The other thing killing the process is actually valgrind. the exit code is simply the argument that we pass to valgrind: --error-exitcode=9. Valgrind kills the process because it detects a memory leak.

My current understanding is that my PR did not introduce a memory leak, rather it was some non functional change that made valgrind able to detect the memory leak. s2n_examples_test.c has always been "leaking" memory since DEFER_CLEANUP doesn't run when it's in the same scope as an exit function. But for an unknown reason Valgrind didn't care about that memory. This was confirmed by entirely removing the cleanup function from the server_process, and checking that Valgrind does not see a memory error.

Additionally, If the rest of the tests cases are removed then a memory leak is not reported in s2n_test_example_negoatiate. I tried adding some sleep statements in various places to see if this was related to timing, and that does not seem to be the case.

@jmayclin
Copy link
Contributor Author

My earlier understanding was correct. s2n_examples_test.c (and several others) have been leaking memory, but the Valgrind errors were being suppressed by an overly general valgrind suppression:

Memcheck:Leak
match-leak-kinds: reachable
...
fun:main

This is a wildcard suppression that suppresses all reachable leaks where the call stack starts with a function named main. Since we use ctest, this is effectively all possible memory leaks that we would have.

The reason that the errors in my PR weren't suppressed was because the call stack happened to be too long. We run valgrind with --num-callers=40 which "Specifies the maximum number of entries shown in stack traces that identify program locations".

This stack trace was then long enough that it didn't match the

...
fun:main

wildcard suppression, because fun:main had been dropped off of valgrind's tracking information. We can confirm this be looking at valgrind's auto-generated suppression for the error:

{
   <truncated by me>
   fun:s2n_test_example_negotiate
   fun:s2n_run_self_talk_test
}

We see that the suggested suppression starts with fun:s2n_run_self_talk_test rather than fun:main, which confirms that this was call-stack depth funkiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants