Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got orc::ParseError "bad read in nextBuffer" when using SearchArgument with nested struct #1296

Closed
jnwan opened this issue Oct 26, 2022 · 9 comments

Comments

@jnwan
Copy link

jnwan commented Oct 26, 2022

Below is the code to reproduce the issue, it works when removing the empty struct column "col2" or writing small number of rows or changing the value to "rand() % 100"

Am I doing anything wrong?

on version 1.7.2

Code

  WriterOptions options;
  auto stream = writeLocalFile("orc_file_test");
  MemoryPool* pool = getDefaultPool();
  std::unique_ptr<Type> type(Type::buildTypeFromString(
      "struct<col0:struct<col1:int>,col2:struct<col3:int>>"));

  size_t num = 50000;
  std::unique_ptr<Writer> writer = createWriter(*type, stream.get(), options);

  std::unique_ptr<ColumnVectorBatch> batch = writer->createRowBatch(num);
  StructVectorBatch* structBatch =
      dynamic_cast<StructVectorBatch*>(batch.get());
  StructVectorBatch* structBatch2 =
      dynamic_cast<StructVectorBatch*>(structBatch->fields[0]);
  LongVectorBatch* intBatch =
      dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]);

  StructVectorBatch* structBatch3 =
      dynamic_cast<StructVectorBatch*>(structBatch->fields[1]);
  LongVectorBatch* intBatch2 =
      dynamic_cast<LongVectorBatch*>(structBatch3->fields[0]);

  structBatch->numElements = num;
  structBatch2->numElements = num;

  structBatch3->numElements = num;
  structBatch3->hasNulls = true;

  for (int64_t i = 0; i < num; ++i) {
    intBatch->data.data()[i] = rand() % 150000;
    intBatch->notNull[i] = 1;

    intBatch2->notNull[i] = 0;
    intBatch2->hasNulls = true;

    structBatch3->notNull[i] = 0;
  }
  intBatch->hasNulls = false;

  writer->add(*batch);
  writer->close();

  ReaderOptions readOptions;
  readOptions.setMemoryPool(*getDefaultPool());
  auto reader = createReader(readLocalFile("orc_file_test"), readOptions);
  orc::RowReaderOptions rowOptions;
  rowOptions.searchArgument(
      SearchArgumentFactory::newBuilder()
          ->startAnd()
          .equals(2, PredicateDataType::LONG, Literal((int64_t)5))
          .end()
          .build());
  std::unique_ptr<RowReader> rowReader = reader->createRowReader(rowOptions);

  batch = rowReader->createRowBatch(num);
  structBatch = dynamic_cast<StructVectorBatch*>(batch.get());
  structBatch2 = dynamic_cast<StructVectorBatch*>(structBatch->fields[0]);
  intBatch = dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]);

  structBatch3 = dynamic_cast<StructVectorBatch*>(structBatch->fields[1]);

  while (rowReader->next(*batch)) {
    for (size_t i = 0; i < batch->numElements; i++) {
      
    }
  }

Stack trace:

terminate called after throwing an instance of 'orc::ParseError'
  what():  bad read in nextBuffer
*** Aborted at 1666816640 (Unix time, try 'date -d @1666816640') ***
*** Signal 6 (SIGABRT) (0x2035c0002b7ad) received by PID 178093 (pthread TID 0x7ffb12545a80) (linux TID 178093) (maybe from PID 178093, UID 131932) (code: -6), stack trace: ***
    @ 0000000000000000 (unknown)
    @ 000000000009c9d3 __GI___pthread_kill
    @ 00000000000444ec __GI_raise
    @ 000000000002c432 __GI_abort
    @ 00000000000a3fd4 __gnu_cxx::__verbose_terminate_handler()
    @ 00000000000a1b39 __cxxabiv1::__terminate(void (*)())
    @ 00000000000a1ba4 std::terminate()
    @ 00000000000a1e6f __cxa_throw
    @ 0000000001efcd55 __cxa_throw
    @ 00000000075b676c orc::BooleanRleDecoderImpl::seek(orc::PositionProvider&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ByteRLE.cc:526
    @ 00000000075af711 orc::IntegerColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:120
    @ 00000000075af67f orc::StructColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:965
    @ 00000000075af67f orc::StructColumnReader::seekToRowGroup(std::unordered_map<unsigned long, orc::PositionProvider, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, orc::PositionProvider> > >&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/ColumnReader.cc:965
    @ 0000000007598179 orc::RowReaderImpl::seekToRowGroup(unsigned int)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:440
    @ 000000000759d700 orc::RowReaderImpl::startNextStripe()
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:1037
    @ 000000000759daf4 orc::RowReaderImpl::next(orc::ColumnVectorBatch&)
                       /home/engshare/third-party2/apache-orc/1.7.2/src/orc/c++/src/Reader.cc:1055
    @ 0000000002fba9bc main
    @ 000000000002c656 __libc_start_call_main
    @ 000000000002c717 __libc_start_main_alias_2
    @ 0000000002fb2780 _start
@dongjoon-hyun
Copy link
Member

Thank you for reporting, @jnwan .

@wgtmac
Copy link
Member

wgtmac commented Oct 27, 2022

I have reproduced the issue. The root cause is that the reader tried to read col3 (w/ columnId = 4) which does not have any stream (both PRESENT and DATA streams all have ZERO length as listed below). The parent of col3 is col2 (w/ columnId = 3) whose values are all null, which means the reader should stop reading at col2 w/o touching col3.

Rows: 50000
Compression: ZLIB
Compression size: 65536
Calendar: Julian/Gregorian
Type: struct<col0:struct<col1:int>,col2:struct<col3:int>>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 50000 hasNull: false
    Column 1: count: 50000 hasNull: false
    Column 2: count: 50000 hasNull: false min: 0 max: 149992 sum: 3752012883
    Column 3: count: 0 hasNull: true
    Column 4: count: 0 hasNull: true sum: 0

File Statistics:
  Column 0: count: 50000 hasNull: false
  Column 1: count: 50000 hasNull: false
  Column 2: count: 50000 hasNull: false min: 0 max: 149992 sum: 3752012883
  Column 3: count: 0 hasNull: true
  Column 4: count: 0 hasNull: true sum: 0

Stripes:
  Stripe: offset: 3 data: 129019 rows: 50000 tail: 68 index: 216
    Stream: column 0 section ROW_INDEX start: 3 length 17
    Stream: column 1 section ROW_INDEX start: 20 length 17
    Stream: column 2 section ROW_INDEX start: 37 length 122
    Stream: column 3 section ROW_INDEX start: 159 length 35
    Stream: column 4 section ROW_INDEX start: 194 length 25
    Stream: column 2 section DATA start: 219 length 129007
    Stream: column 3 section PRESENT start: 129226 length 12
    Stream: column 4 section PRESENT start: 129238 length 0
    Stream: column 4 section DATA start: 129238 length 0
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT
    Encoding column 4: DIRECT_V2

@wgtmac
Copy link
Member

wgtmac commented Oct 27, 2022

I will file a JIRA and fix it shortly.

@coderex2522
Copy link
Contributor

ColumnReader needs to fix this bug by processing for cases where data stream does not exist.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 27, 2022

Thank you so much, @wgtmac and @coderex2522 !

@coderex2522
Copy link
Contributor

I create a new issue in Jira.

@jnwan
Copy link
Author

jnwan commented Oct 28, 2022

@wgtmac has explained the root cause well. Just want to reemphasize that same issue happens on other complicated columns, like map, empty map will also get "bad read in nextBuffer" error.

@wgtmac
Copy link
Member

wgtmac commented Nov 2, 2022

@wgtmac has explained the root cause well. Just want to reemphasize that same issue happens on other complicated columns, like map, empty map will also get "bad read in nextBuffer" error.

This issue has been fixed into the main branch. Please have a try and let us know if there is any issue. Thanks @jnwan !

@wgtmac wgtmac closed this as completed Nov 2, 2022
@jnwan
Copy link
Author

jnwan commented Nov 10, 2022

Verified the issue got fixed! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants