Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet complex type handling #9187

Closed
wants to merge 2 commits into from

Conversation

jaystarshot
Copy link
Contributor

@jaystarshot jaystarshot commented Mar 20, 2024

Fixes #7776

Parquet has notion of optional and repeated layers which is needed in arrow calls like DefLevelsToBitmap.

This info is passed using arrow:LevelInfo. We were incorrectly computing repeatedAncestor by ignoring optional fields which is fixed in this PR.

Parquet has 3 level structure for nested types
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

// List<String> (list non-null, elements nullable)
1. required group my_list (LIST) {
2.   repeated group list {
3.    optional binary element (UTF8);
  }
}

However when we read this and convert to ParquetTypeWithId in current velox parquet reader, we ignore the intermediated layer 2. repeated group list (grandfather logic) in https://github.com/facebookincubator/velox/pull/9187/files#diff-64787e76c1b0ad12b5764770a94acd62054896a762ccead8f083a71a060f2f44R325.

@facebook-github-bot
Copy link
Contributor

Hi @jaystarshot!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@jaystarshot jaystarshot marked this pull request as draft March 20, 2024 20:59
Copy link

netlify bot commented Mar 20, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 231caed
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/662c0be7c27b070008f773dc

@jaystarshot jaystarshot changed the title [Do not review] test arrow1 [WIP] Fix parquet complex types Mar 22, 2024
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2024
int16_t repeatedAncestor = maxDefine_;
auto node = this;
do {
if (node->isOptional_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly based on https://github.com/apache/arrow/blob/main/cpp/src/parquet/level_conversion.h#L125.
the do() is to make sure that for nested types we are going inside and checking optional since the child layers optional (which we ignored) is needed by arrow

@jaystarshot jaystarshot changed the title [WIP] Fix parquet complex types [WIP] Fix parquet complex type handling Mar 26, 2024
@jaystarshot jaystarshot changed the title [WIP] Fix parquet complex type handling Fix parquet complex type handling Mar 26, 2024
@jaystarshot jaystarshot marked this pull request as ready for review March 26, 2024 21:04
@jaystarshot
Copy link
Contributor Author

jaystarshot commented Mar 26, 2024

cc: @Yuhta @yingsu00 @pranjalssh @rui-mo

@@ -309,6 +318,9 @@ std::shared_ptr<const ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
VELOX_CHECK_EQ(children.size(), 1);
const auto& child = children[0];
auto grandChildren = child->getChildren();
// This level will not be repeated since parquet will have a child layer with the repeated info which we are ignoring here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate more? it's a bit confusing that quote "This level will not be repeated" and next line isRepeated = true

@@ -342,7 +341,7 @@ TEST_F(ParquetTableScanTest, DISABLED_optArrayReqEle) {

// Required array with required elements.
// Core dump is fixed, but the result is incorrect.
Copy link
Collaborator

@hitarth hitarth Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment about result being incorrect now that it is fixed.

Copy link
Collaborator

@qqibrow qqibrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure all related DISABLED_ testes are reenabled. Also, in case we want a UT in ParquetReaderTest.cpp as well is a unit test here https://github.com/facebookincubator/velox/pull/8425/files

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Apr 22, 2024

It does pass locally for me with asan enabled.
I added

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fsanitize=address")

to the makefile, and needed to add libasan.so.8 to the ldconfig

@Yuhta
Copy link
Contributor

Yuhta commented Apr 22, 2024

@jaystarshot Try debug build, it seems the OSS one is only running parquet tests with only release build

@jaystarshot
Copy link
Contributor Author

Couldn't repro locally with the debug build as well
ctest -j ${NUM_THREADS} -VV --output-on-failure -R "parquet_table_scan" command used to run

@Yuhta
Copy link
Contributor

Yuhta commented Apr 22, 2024

Maybe add a VELOX_CHECK_LT(type.column(), fileMetaData_->row_groups[rowGroupIndex].columns.size()) before the crash point:

int64_t ReaderBase::rowGroupUncompressedSize(
    int32_t rowGroupIndex,
    const dwio::common::TypeWithId& type) const {
  if (type.column() != ParquetTypeWithId::kNonLeaf) {
    VELOX_CHECK_LT(type.column(), fileMetaData_->row_groups[rowGroupIndex].columns.size());
    return fileMetaData_->row_groups[rowGroupIndex]
        .columns[type.column()]
        .meta_data.total_uncompressed_size;
  }
  int64_t sum = 0;
  for (auto child : type.getChildren()) {
    sum += rowGroupUncompressedSize(rowGroupIndex, *child);
  }
  return sum;
}

Apparently the type has more columns than the metadata has

@jaystarshot
Copy link
Contributor Author

Was the array test the only failure? If so it could be that there could be some issue in the test file and I can remove this test.

@Yuhta
Copy link
Contributor

Yuhta commented Apr 22, 2024

@jaystarshot It's both array and reqArrayLegacy

@jaystarshot
Copy link
Contributor Author

I am wondering if this could be related to the crash #9223 is fixing.

@jaystarshot
Copy link
Contributor Author

jaystarshot commented Apr 22, 2024

We can either keep those tests disabled in this PR or cherry pick that commit in this same PR and try to merge both

@jaystarshot
Copy link
Contributor Author

@Yuhta Can you please try with this?

@qqibrow
Copy link
Collaborator

qqibrow commented Apr 24, 2024

@jaystarshot I was about to add unit test for #9223 . Let me know whether I should proceed.

@jaystarshot
Copy link
Contributor Author

Lets check if this fixes the crashes observed in meta's environment, if they do then we may not need tests

@Yuhta
Copy link
Contributor

Yuhta commented Apr 25, 2024

Can you rebase to latest to fix the build?

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 37f4700.

Copy link

Conbench analyzed the 1 benchmark run on commit 37f4700b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty collection in array or map lead to incorrect results in parquet reader
7 participants