Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Enabling CacheOptions::LazyDefault caused Parquet fuzzing failure #38071

Closed
jorisvandenbossche opened this issue Oct 6, 2023 · 5 comments · Fixed by #38073
Closed

[C++] Enabling CacheOptions::LazyDefault caused Parquet fuzzing failure #38071

jorisvandenbossche opened this issue Oct 6, 2023 · 5 comments · Fixed by #38073

Comments

@jorisvandenbossche
Copy link
Member

#37854 introduced a failure in the "AMD64 Ubuntu 22.04 C++ ASAN UBSAN" build (https://github.com/apache/arrow/actions/runs/6430392691/job/17462667620?pr=38069#logs), related to the LazyCache coalesced reads. See details below.

I assume this is an existing bug, given that PR only changed a default for an option a user could already set before as well. But changing the default of course makes it more visible.

Potentially short term option is to only change pre_buffer and keep the current non-lazy default cache_options (if that fixes it). Or revert the PR entirely until this is resolved (I don't have time today to look into more detail).
Although the R bindings also already use the CacheOptions::LazyDefault by default for a while.

2023-10-06T10:40:14.0622194Z Running: /arrow/testing/data/parquet/fuzzing/clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5640198106120192
2023-10-06T10:40:14.0651320Z /arrow/cpp/src/arrow/io/interfaces.cc:457:  Check failed: (left.offset + left.length) <= (right.offset) Some read ranges overlap
2023-10-06T10:40:14.0661169Z /build/cpp/debug/parquet-arrow-fuzz(backtrace+0x5b)[0x55893309d6bb]
2023-10-06T10:40:14.0678721Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x1a5)[0x7fd67d9f5405]
2023-10-06T10:40:14.0694280Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLogD2Ev+0x1f7)[0x7fd67d9f5177]
2023-10-06T10:40:14.0708313Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLogD0Ev+0x61)[0x7fd67d9f5251]
2023-10-06T10:40:14.0722939Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util8ArrowLogD1Ev+0x1d0)[0x7fd67d9f4d80]
2023-10-06T10:40:14.0733586Z /usr/local/lib/libarrow.so.1400(+0xb13f151)[0x7fd67d3cc151]
2023-10-06T10:40:14.0746700Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal18CoalesceReadRangesESt6vectorINS0_9ReadRangeESaIS3_EEll+0x4c1)[0x7fd67d3cac81]
2023-10-06T10:40:14.0762388Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache4Impl5CacheESt6vectorINS0_9ReadRangeESaIS5_EE+0x456)[0x7fd67d2c3be6]
2023-10-06T10:40:14.0775666Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache8LazyImpl5CacheESt6vectorINS0_9ReadRangeESaIS5_EE+0x24a)[0x7fd67d2c1cca]
2023-10-06T10:40:14.0790164Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache5CacheESt6vectorINS0_9ReadRangeESaIS4_EE+0x2a2)[0x7fd67d2bfec2]
2023-10-06T10:40:14.0795950Z /usr/local/lib/libparquet.so.1400(_ZN7parquet14SerializedFile9PreBufferERKSt6vectorIiSaIiEES5_RKN5arrow2io9IOContextERKNS7_12CacheOptionsE+0x1696)[0x7fd69120ef96]
2023-10-06T10:40:14.0801581Z /usr/local/lib/libparquet.so.1400(_ZN7parquet17ParquetFileReader9PreBufferERKSt6vectorIiSaIiEES5_RKN5arrow2io9IOContextERKNS7_12CacheOptionsE+0x360)[0x7fd69120d7c0]
2023-10-06T10:40:14.0808329Z /usr/local/lib/libparquet.so.1400(+0x15435e5)[0x7fd6904885e5]
2023-10-06T10:40:14.0808759Z /usr/local/lib/libparquet.so.1400(+0x1542728)[0x7fd690487728]
2023-10-06T10:40:14.0815343Z /usr/local/lib/libparquet.so.1400(+0x1542c7c)[0x7fd690487c7c]
2023-10-06T10:40:14.0816050Z /usr/local/lib/libparquet.so.1400(_ZN7parquet5arrow8internal10FuzzReaderESt10unique_ptrINS0_10FileReaderESt14default_deleteIS3_EE+0x3e2)[0x7fd69046cdf2]
2023-10-06T10:40:14.0822733Z ==14349== ERROR: libFuzzer: deadly signal
2023-10-06T10:40:14.0823311Z /usr/local/lib/libparquet.so.1400(_ZN7parquet5arrow8internal10FuzzReaderEPKhl+0x1130)[0x7fd69046e950]
2023-10-06T10:40:14.0824114Z /build/cpp/debug/parquet-arrow-fuzz(+0x118e98)[0x558933121e98]
2023-10-06T10:40:14.0825448Z /build/cpp/debug/parquet-arrow-fuzz(+0x3f354)[0x558933048354]
2023-10-06T10:40:14.0826059Z /build/cpp/debug/parquet-arrow-fuzz(+0x290d0)[0x5589330320d0]
2023-10-06T10:40:14.0826543Z /build/cpp/debug/parquet-arrow-fuzz(+0x2ee27)[0x558933037e27]
2023-10-06T10:40:14.0827941Z /build/cpp/debug/parquet-arrow-fuzz(+0x58c43)[0x558933061c43]
2023-10-06T10:40:14.0828405Z /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd6713bfd90]
2023-10-06T10:40:14.0828882Z /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd6713bfe40]
2023-10-06T10:40:14.0829351Z /build/cpp/debug/parquet-arrow-fuzz(+0x23995)[0x55893302c995]
2023-10-06T10:40:15.2094786Z     #0 0x5589330eeab1 in __sanitizer_print_stack_trace (/build/cpp/debug/parquet-arrow-fuzz+0xe5ab1) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2096115Z     #1 0x558933061348 in fuzzer::PrintStackTrace() (/build/cpp/debug/parquet-arrow-fuzz+0x58348) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2097546Z     #2 0x558933046dc3 in fuzzer::Fuzzer::CrashCallback() (/build/cpp/debug/parquet-arrow-fuzz+0x3ddc3) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2098544Z     #3 0x7fd6713d851f  (/lib/x86_64-linux-gnu/libc.so.6+0x4251f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2099481Z     #4 0x7fd67142ca7b in pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x96a7b) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2101878Z     #5 0x7fd6713d8475 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42475) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2102783Z     #6 0x7fd6713be7f2 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f2) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2103486Z     #7 0x7fd67d9f5193 in arrow::util::CerrLog::~CerrLog() /arrow/cpp/src/arrow/util/logging.cc:72:7
2023-10-06T10:40:15.2104144Z     #8 0x7fd67d9f5250 in arrow::util::CerrLog::~CerrLog() /arrow/cpp/src/arrow/util/logging.cc:66:22
2023-10-06T10:40:15.2104793Z     #9 0x7fd67d9f4d7f in arrow::util::ArrowLog::~ArrowLog() /arrow/cpp/src/arrow/util/logging.cc:250:5
2023-10-06T10:40:15.2105719Z     #10 0x7fd67d3cc150 in arrow::io::internal::(anonymous namespace)::ReadRangeCombiner::Coalesce(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/interfaces.cc:457:7
2023-10-06T10:40:15.2106830Z     #11 0x7fd67d3cac80 in arrow::io::internal::CoalesceReadRanges(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >, long, long) /arrow/cpp/src/arrow/io/interfaces.cc:518:19
2023-10-06T10:40:15.2107880Z     #12 0x7fd67d2c3be5 in arrow::io::internal::ReadRangeCache::Impl::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:177:14
2023-10-06T10:40:15.2108897Z     #13 0x7fd67d2c1cc9 in arrow::io::internal::ReadRangeCache::LazyImpl::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:288:34
2023-10-06T10:40:15.2109909Z     #14 0x7fd67d2bfec1 in arrow::io::internal::ReadRangeCache::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:320:17
2023-10-06T10:40:15.2111039Z     #15 0x7fd69120ef95 in parquet::SerializedFile::PreBuffer(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::io::IOContext const&, arrow::io::CacheOptions const&) /arrow/cpp/src/parquet/file_reader.cc:368:5
2023-10-06T10:40:15.2112348Z     #16 0x7fd69120d7bf in parquet::ParquetFileReader::PreBuffer(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::io::IOContext const&, arrow::io::CacheOptions const&) /arrow/cpp/src/parquet/file_reader.cc:862:9
2023-10-06T10:40:15.2113660Z     #17 0x7fd6904885e4 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:1224:23
2023-10-06T10:40:15.2114817Z     #18 0x7fd690487727 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:321:12
2023-10-06T10:40:15.2115872Z     #19 0x7fd690487c7b in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:325:12
2023-10-06T10:40:15.2116737Z     #20 0x7fd69046cdf1 in parquet::arrow::internal::FuzzReader(std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >) /arrow/cpp/src/parquet/arrow/reader.cc:1374:37
2023-10-06T10:40:15.2117736Z     #21 0x7fd69046e94f in parquet::arrow::internal::FuzzReader(unsigned char const*, long) /arrow/cpp/src/parquet/arrow/reader.cc:1399:11
2023-10-06T10:40:15.2118358Z     #22 0x558933121e97 in LLVMFuzzerTestOneInput /arrow/cpp/src/parquet/arrow/fuzz.cc:22:17
2023-10-06T10:40:15.2119357Z     #23 0x558933048353 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/build/cpp/debug/parquet-arrow-fuzz+0x3f353) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2120490Z     #24 0x5589330320cf in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) (/build/cpp/debug/parquet-arrow-fuzz+0x290cf) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2121762Z     #25 0x558933037e26 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/build/cpp/debug/parquet-arrow-fuzz+0x2ee26) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2122720Z     #26 0x558933061c42 in main (/build/cpp/debug/parquet-arrow-fuzz+0x58c42) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2123509Z     #27 0x7fd6713bfd8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2124294Z     #28 0x7fd6713bfe3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2125121Z     #29 0x55893302c994 in _start (/build/cpp/debug/parquet-arrow-fuzz+0x23994) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2125489Z 
2023-10-06T10:40:15.2126159Z NOTE: libFuzzer has rudimentary signal handlers.
2023-10-06T10:40:15.2127161Z       Combine libFuzzer with AddressSanitizer or similar for better crash reports.
2023-10-06T10:40:15.2127655Z SUMMARY: libFuzzer: deadly signal
2023-10-06T10:40:16.9350640Z 77
2023-10-06T10:40:17.0185097Z Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-cpp-sanitizer` exited with a non-zero exit code 77, see the process log above.

Originally posted by @jorisvandenbossche in #37854 (comment)

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

I've re-produce this problem on my Local PC, seems that this will not exists in the production env.

  1. The file is a corrupt file, which give the wrong column range and sizes, which might overlap
  2. Using the overlap range, the file generate PreBuffer requests, which might cause DCHECK in driver failed

I think this could be fixed 🤔 And only DEBUG build will raise this request. And I think change to Default will not improve this 🤔

@jorisvandenbossche
Copy link
Member Author

Thanks for taking a look!

And I think change to Default will not improve this

Indeed, #38072 confirms this. The default CacheOptions show the same issue.

Is the fix to make this an actual error, instead of only a debug check? (because it should still error properly when reading an invalid file?)

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

https://github.com/apache/arrow/pull/38073/files

I've a basic fixing, but I don't know if putting the check here is ok(maybe there're better place). Waiting for @pitrou review

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

Is the fix to make this an actual error, instead of only a debug check? (because it should still error properly when reading an invalid file?)

🤔 Maybe it's ok to change the

-  std::vector<ReadRange> Coalesce(std::vector<ReadRange> ranges)
+ Result<std::vector<ReadRange>> Coalesce(std::vector<ReadRange> ranges)

But here it only affect the debug build, and not report this on release mode. So waiting for your advices...

@lidavidm
Copy link
Member

lidavidm commented Oct 6, 2023

If it's possible for a file with invalid/overlapping ranges to make it this far, then yeah, we should make Coalesce return Result instead of asserting/doing something incorrect silently.

@kou kou closed this as completed in #38073 Oct 8, 2023
kou pushed a commit that referenced this issue Oct 8, 2023
…38073)

### Rationale for this change

The C++ Parquet Arrow fuzz will generate bad Parquet file with bad row-range, this patch change the `CoalesceReadRanges` to return `Result<>`.

### What changes are included in this PR?

Just a checking, change `CoalesceReadRanges` to return `Result<>`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

* Closes: #38071

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…fer (apache#38073)

### Rationale for this change

The C++ Parquet Arrow fuzz will generate bad Parquet file with bad row-range, this patch change the `CoalesceReadRanges` to return `Result<>`.

### What changes are included in this PR?

Just a checking, change `CoalesceReadRanges` to return `Result<>`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

* Closes: apache#38071

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…fer (apache#38073)

### Rationale for this change

The C++ Parquet Arrow fuzz will generate bad Parquet file with bad row-range, this patch change the `CoalesceReadRanges` to return `Result<>`.

### What changes are included in this PR?

Just a checking, change `CoalesceReadRanges` to return `Result<>`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

* Closes: apache#38071

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…fer (apache#38073)

### Rationale for this change

The C++ Parquet Arrow fuzz will generate bad Parquet file with bad row-range, this patch change the `CoalesceReadRanges` to return `Result<>`.

### What changes are included in this PR?

Just a checking, change `CoalesceReadRanges` to return `Result<>`.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

* Closes: apache#38071

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment