Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][CI] test-conda-cpp-valgrind fails on arrow-dataset-file-parquet-encryption-test #38304

Closed
raulcd opened this issue Oct 17, 2023 · 9 comments · Fixed by #38306
Closed

[C++][CI] test-conda-cpp-valgrind fails on arrow-dataset-file-parquet-encryption-test #38304

raulcd opened this issue Oct 17, 2023 · 9 comments · Fixed by #38306

Comments

@raulcd
Copy link
Member

raulcd commented Oct 17, 2023

Describe the bug, including details regarding any error messages, version, and platform.

The arrow-dataset-file-parquet-encryption-test seems to have an issue when running the valgrind jobs both on the maintenance branch and main:

The error:

66/87 Test #66: arrow-dataset-file-parquet-encryption-test ...***Failed   15.28 sec
==12896== Memcheck, a memory error detector
==12896== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==12896== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==12896== Command: /build/cpp/debug//arrow-dataset-file-parquet-encryption-test
==12896== 
Running main() from /build/cpp/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from DatasetEncryptionTest
[ RUN      ] DatasetEncryptionTest.WriteReadDatasetWithEncryption
[       OK ] DatasetEncryptionTest.WriteReadDatasetWithEncryption (1647 ms)
[ RUN      ] DatasetEncryptionTest.ReadSingleFile
[       OK ] DatasetEncryptionTest.ReadSingleFile (79 ms)
[----------] 2 tests from DatasetEncryptionTest (1733 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (8572 ms total)
[  PASSED  ] 2 tests.
==12896== 
==12896== HEAP SUMMARY:
==12896==     in use at exit: 30,776 bytes in 254 blocks
==12896==   total heap usage: 37,143 allocs, 36,889 frees, 6,508,003 bytes allocated
==12896== 
==12896== 904 bytes in 1 blocks are definitely lost in loss record 59 of 70
==12896==    at 0x4849724: malloc (vg_replace_malloc.c:431)
==12896==    by 0x7053989: CRYPTO_zalloc (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x70094A9: ossl_err_get_state_int (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x700A189: ERR_set_mark (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x6F8D9EF: CONF_modules_load_file_ex (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x6F8DDD5: ossl_config_int (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x705282B: ossl_init_config_ossl_ (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x6C0DEE7: __pthread_once_slow (pthread_once.c:116)
==12896==    by 0x7060919: CRYPTO_THREAD_run_once (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x70530B3: OPENSSL_init_crypto (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x700775C: ossl_engine_table_select (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x71389D5: RAND_get_rand_method (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896==    by 0x7139319: RAND_bytes_ex (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==12896== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==12896== 
==12896== For lists of detected and suppressed errors, rerun with: -s
==12896== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)

Component(s)

C++, Continuous Integration

@raulcd
Copy link
Member Author

raulcd commented Oct 17, 2023

@pitrou this seems to be a possible leak that should be fixed for the release, right?

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

cc @tolleybot @ggershinsky

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

My understanding is that it's an initialization-related leak that is not easy to avoid properly. The reason it appeared in dataset is probably because dataset loads data from multiple threads, and OpenSSL can need extra care from the application to release per-thread memory
(see the NOTES section in https://www.openssl.org/docs/man3.1/man3/OPENSSL_cleanup.html).

As such, this leak is certainly benign and does not need to be addressed for the release.

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

That said, I wonder which of the situations in that NOTES section is being hit, because it seems to me that none should apply here:

  • this is not on Windows
  • OpenSSL is linked dynamically (see reference to libcrypto.so.3 in stack trace)
  • OpenSSL is not loaded using dlopen

So it's a bit of a mystery why this occurs.

Also cc @westonpace

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

I'll try to play with this a bit more, but to reiterate: this isn't release-blocker.

@raulcd
Copy link
Member Author

raulcd commented Oct 17, 2023

Thanks @pitrou !

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

Looking at the Valgrind output in more detail, it seems there are two different leaks, both of which are reported in a thread pool's worker thread:

  • one leak related to OPENSSL_init_crypto:
==9667== 
==9667== 904 bytes in 1 blocks are definitely lost in loss record 59 of 69
==9667==    at 0x4849724: malloc (vg_replace_malloc.c:431)
==9667==    by 0x705B989: CRYPTO_zalloc (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x70114A9: ossl_err_get_state_int (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x7012189: ERR_set_mark (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x6F959EF: CONF_modules_load_file_ex (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x6F95DD5: ossl_config_int (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x705A82B: ossl_init_config_ossl_ (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x6C15EE7: __pthread_once_slow (pthread_once.c:116)
==9667==    by 0x7068919: CRYPTO_THREAD_run_once (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x705B0B3: OPENSSL_init_crypto (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x700F75C: ossl_engine_table_select (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x71409D5: RAND_get_rand_method (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x7141319: RAND_bytes_ex (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x4CA9D3E: parquet::encryption::RandBytes(unsigned char*, int) (encryption_internal.cc:649)
==9667==    by 0x4CADFBB: parquet::encryption::CryptoFactory::GetFileEncryptionProperties(parquet::encryption::KmsConnectionConfig const&, parquet::encryption::EncryptionConfiguration const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<arrow::fs::FileSystem> const&) (crypto_factory.cc:75)
==9667==    by 0x49A9145: arrow::dataset::ParquetFileFormat::MakeWriter(std::shared_ptr<arrow::io::OutputStream>, std::shared_ptr<arrow::Schema>, std::shared_ptr<arrow::dataset::FileWriteOptions>, arrow::fs::FileLocator) const (file_parquet.cc:707)
==9667==    by 0x48F7FD4: arrow::dataset::internal::(anonymous namespace)::OpenWriter(arrow::dataset::FileSystemDatasetWriteOptions const&, std::shared_ptr<arrow::Schema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (dataset_writer.cc:131)
==9667==    by 0x48F8202: arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const (dataset_writer.cc:149)
==9667==    by 0x48F92E4: std::enable_if<((!std::is_void<arrow::Status>::value)&&(!arrow::detail::is_future<arrow::Status>::value))&&((!arrow::Future<arrow::internal::Empty>::is_empty)||std::is_same<arrow::Status, arrow::Status>::value), void>::type arrow::detail::ContinueFuture::operator()<arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() const::{lambda()#1}&, , arrow::Status, arrow::Future<arrow::internal::Empty> >(arrow::Future<arrow::internal::Empty>, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() const::{lambda()#1}&) const (future.h:150)
==9667==    by 0x48F937C: void std::__invoke_impl<void, arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() const::{lambda()#1}&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() const::{lambda()#1}&) (invoke.h:61)
==9667==    by 0x48F93D8: __invoke<arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, const std::string&)::<lambda()>::<lambda()>&> (invoke.h:96)
==9667==    by 0x48F93D8: __call<void, 0, 1> (functional:495)
==9667==    by 0x48F93D8: operator()<> (functional:580)
==9667==    by 0x48F93D8: arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::Start(arrow::util::AsyncTaskScheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda()#1}::operator()() const::{lambda()#1})> >::invoke() (functional.h:152)
==9667==    by 0x56D3C9F: arrow::internal::FnOnce<void ()>::operator()() && (functional.h:140)
==9667==    by 0x56D6524: arrow::internal::WorkerLoop(std::shared_ptr<arrow::internal::ThreadPool::State>, std::_List_iterator<std::thread>) (thread_pool.cc:457)
==9667==    by 0x56D67FD: arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}::operator()() const (thread_pool.cc:618)
==9667==    by 0x56D6852: __invoke_impl<void, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > (invoke.h:61)
==9667==    by 0x56D6852: __invoke<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > (invoke.h:96)
==9667==    by 0x56D6852: _M_invoke<0> (std_thread.h:279)
==9667==    by 0x56D6852: operator() (std_thread.h:286)
==9667==    by 0x56D6852: std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() (std_thread.h:231)
==9667==    by 0x6A41E94: execute_native_thread_routine (thread.cc:104)
==9667==    by 0x6C10AC2: start_thread (pthread_create.c:442)
==9667==    by 0x6CA1BF3: clone (clone.S:100)
==9667== 
  • one leak related to rand_new_drbg:
==9667== 2,464 (40 direct, 2,424 indirect) bytes in 1 blocks are definitely lost in loss record 67 of 69
==9667==    at 0x4849724: malloc (vg_replace_malloc.c:431)
==9667==    by 0x705B989: CRYPTO_zalloc (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x7039036: EVP_RAND_CTX_new (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x71401FC: rand_new_drbg (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x71412E0: RAND_get0_public (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x7141350: RAND_bytes_ex (in /opt/conda/envs/arrow/lib/libcrypto.so.3)
==9667==    by 0x4CAB6BB: parquet::encryption::AesEncryptor::AesEncryptorImpl::Encrypt(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*) (encryption_internal.cc:170)
==9667==    by 0x4CAB77E: parquet::encryption::AesEncryptor::Encrypt(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*) (encryption_internal.cc:304)
==9667==    by 0x4C2608C: parquet::Encryptor::Encrypt(unsigned char const*, int, unsigned char*) (internal_file_encryptor.cc:37)
==9667==    by 0x4BB9C05: parquet::SerializedPageWriter::WriteDictionaryPage(parquet::DictionaryPage const&) (column_writer.cc:291)
==9667==    by 0x4BD09BE: parquet::TypedColumnWriterImpl<parquet::PhysicalType<(parquet::Type::type)2> >::WriteDictionaryPage() (column_writer.cc:1338)
==9667==    by 0x4BC70E4: parquet::ColumnWriterImpl::Close() (column_writer.cc:1065)
==9667==    by 0x4BC727E: parquet::TypedColumnWriterImpl<parquet::PhysicalType<(parquet::Type::type)2> >::Close() (column_writer.cc:1212)
==9667==    by 0x4B5F519: parquet::arrow::(anonymous namespace)::ArrowColumnWriterV2::Write(parquet::ArrowWriteContext*) (writer.cc:160)
==9667==    by 0x4B63138: parquet::arrow::FileWriterImpl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray> const&, long, long) (writer.cc:346)
==9667==    by 0x4B5EAAD: parquet::arrow::FileWriterImpl::WriteTable(arrow::Table const&, long)::{lambda(long, long)#1}::operator()(long, long) const (writer.cc:374)
==9667==    by 0x4B61581: parquet::arrow::FileWriterImpl::WriteTable(arrow::Table const&, long) (writer.cc:387)
==9667==    by 0x49A71CD: arrow::dataset::ParquetFileWriter::Write(std::shared_ptr<arrow::RecordBatch> const&) (file_parquet.cc:748)
==9667==    by 0x48F926E: arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::{lambda()#1}::operator()() const (dataset_writer.cc:236)
==9667==    by 0x48F9404: std::enable_if<((!std::is_void<arrow::Status>::value)&&(!arrow::detail::is_future<arrow::Status>::value))&&((!arrow::Future<arrow::internal::Empty>::is_empty)||std::is_same<arrow::Status, arrow::Status>::value), void>::type arrow::detail::ContinueFuture::operator()<arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::{lambda()#1}&, , arrow::Status, arrow::Future<arrow::internal::Empty> >(arrow::Future<arrow::internal::Empty>, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::{lambda()#1}&) const (future.h:150)
==9667==    by 0x48F949C: void std::__invoke_impl<void, arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::{lambda()#1}&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::{lambda()#1}&) (invoke.h:61)
==9667==    by 0x48F94F8: __invoke<arrow::detail::ContinueFuture&, arrow::Future<arrow::internal::Empty>&, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::<lambda()>&> (invoke.h:96)
==9667==    by 0x48F94F8: __call<void, 0, 1> (functional:495)
==9667==    by 0x48F94F8: operator()<> (functional:580)
==9667==    by 0x48F94F8: arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, arrow::dataset::internal::(anonymous namespace)::DatasetWriterFileQueue::WriteNext(std::shared_ptr<arrow::RecordBatch>)::{lambda()#1})> >::invoke() (functional.h:152)
==9667==    by 0x56D3C9F: arrow::internal::FnOnce<void ()>::operator()() && (functional.h:140)
==9667==    by 0x56D6524: arrow::internal::WorkerLoop(std::shared_ptr<arrow::internal::ThreadPool::State>, std::_List_iterator<std::thread>) (thread_pool.cc:457)
==9667==    by 0x56D67FD: arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}::operator()() const (thread_pool.cc:618)
==9667==    by 0x56D6852: __invoke_impl<void, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > (invoke.h:61)
==9667==    by 0x56D6852: __invoke<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > (invoke.h:96)
==9667==    by 0x56D6852: _M_invoke<0> (std_thread.h:279)
==9667==    by 0x56D6852: operator() (std_thread.h:286)
==9667==    by 0x56D6852: std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() (std_thread.h:231)
==9667==    by 0x6A41E94: execute_native_thread_routine (thread.cc:104)
==9667==    by 0x6C10AC2: start_thread (pthread_create.c:442)
==9667==    by 0x6CA1BF3: clone (clone.S:100)
==9667==

pitrou added a commit to pitrou/arrow that referenced this issue Oct 17, 2023
…set-file-parquet-encryption-test

If OpenSSL initializes itself from a non-main thread, it can fail deallocating all memory at shutdown.

This is really a benign leak, but we don't want any spurious CI errors.
@mapleFU
Copy link
Member

mapleFU commented Oct 17, 2023

So, this memory is some static memory and forget to free them during shutdown?

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

That's my understanding, yes. This is memory that seems to be allocated once when initializing OpenSSL (though rand_new_drbg might be thread-local, I don't know).

@kou kou changed the title [C++][CI] test--conda-cpp-valgrind fails on arrow-dataset-file-parquet-encryption-test [C++][CI] test-conda-cpp-valgrind fails on arrow-dataset-file-parquet-encryption-test Oct 18, 2023
pitrou added a commit that referenced this issue Oct 18, 2023
…le-parquet-encryption-test (#38306)

If OpenSSL initializes itself from a non-main thread, it can fail deallocating all memory at shutdown.

This is really a benign leak, but we don't want any spurious CI errors.

* Closes: #38304

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 15.0.0 milestone Oct 18, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…set-file-parquet-encryption-test (apache#38306)

If OpenSSL initializes itself from a non-main thread, it can fail deallocating all memory at shutdown.

This is really a benign leak, but we don't want any spurious CI errors.

* Closes: apache#38304

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…set-file-parquet-encryption-test (apache#38306)

If OpenSSL initializes itself from a non-main thread, it can fail deallocating all memory at shutdown.

This is really a benign leak, but we don't want any spurious CI errors.

* Closes: apache#38304

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…set-file-parquet-encryption-test (apache#38306)

If OpenSSL initializes itself from a non-main thread, it can fail deallocating all memory at shutdown.

This is really a benign leak, but we don't want any spurious CI errors.

* Closes: apache#38304

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment