Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38849: [C++][Parquet]: Add support for list view and large list view #38850

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Nov 22, 2023

Rationale for this change

Parquet itself doesn't has a LIST_VIEW type, however, we can read/write a LIST_VIEW inside parquet.

Some notes:

During loading, I think making a arrow::list is ok, and converting a List to ListView is so convenient. So maybe we can first support write
For parquet::arrow::SchemaManifest, maybe it's ok to just store a list as metadata?

What changes are included in this PR?

TODO:

  • For writer, we need support PathInfo. The logic is so similiar to origin list node, however, we need support a new kind of RangeSelector. We don't need to support a new list because the writer here doesn't make full use of the contigous array
  • For schema, we need to handle the parquet/arrow/schema.h well.
  • Testing the schema casting
  • Testing the path util
  • Testing write, and writen data can be read.

Are these changes tested?

Are there any user-facing changes?

Copy link

⚠️ GitHub issue #38849 has been automatically assigned in GitHub to PR creator.

@@ -113,9 +114,50 @@ class MultipathLevelBuilderTest : public testing::Test {
ArrowWriteContext(default_memory_pool(), arrow_properties_.get());
};

TEST_F(MultipathLevelBuilderTest, NonNullableSingleListNonNullableEntries) {
class MultipathLevelListTool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield I've change the original list test to list and list_view test, would you think it's ok?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 30, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Nov 30, 2023

@pitrou @emkornfield I've tried to upload some logic about list-view and large-list-view, but I don't know where should I add test about schema and read/write, would you mind give some example?

# Conflicts:
#	cpp/src/parquet/arrow/path_internal_test.cc
@mapleFU
Copy link
Member Author

mapleFU commented Dec 7, 2023

@pitrou This is first time I'm trying to add a new nested type support here, so maybe need some help for where should I add the testing...

@emkornfield
Copy link
Contributor

Apologies for the delay I will try to take a look this week.

@mapleFU
Copy link
Member Author

mapleFU commented Mar 19, 2024

@emkornfield Do you want revisit this patch?

@mapleFU mapleFU marked this pull request as ready for review March 27, 2024 16:11
@mapleFU mapleFU requested a review from wgtmac as a code owner March 27, 2024 16:11
@ianmcook ianmcook requested a review from pitrou March 27, 2024 16:11
@wgtmac
Copy link
Member

wgtmac commented Mar 28, 2024

Sorry for missing this! I will take a look recently.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests for actually reading and writing Arrow data, are there any?

info_.path.emplace_back(std::move(node));
nullable_in_parent_ = array.list_type()->value_field()->nullable();
// raw_value_offsets() and raw_value_sizes() accounts for any slice offset/size.
if constexpr (std::is_same<::arrow::ListViewArray, T>::value ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use is_list_view_type from arrow/type_traits.h.

@@ -999,6 +999,17 @@ TEST_F(TestConvertArrowSchema, ParquetLists) {
auto arrow_list = ::arrow::list(arrow_element);
arrow_fields.push_back(::arrow::field("my_list", arrow_list, false));
}
// Same as above but using list_view as arrow type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that changes below is the list factory function, this could perhaps be factored out (also with list_cast_fns below), e.g.:

auto list_type_factory = [](const std::shared_ptr<::arrow::Field> field) {
    return ::arrow::list(field->type());
};

...

for (const auto& list_factory : {list_type_factory, list_view_type_factory}) { ... }

}
};

template <typename ListTool>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to template this by type, since all factories have the same signature anyway.

@@ -34,6 +34,7 @@ namespace parquet::arrow {
using ::arrow::default_memory_pool;
using ::arrow::field;
using ::arrow::fixed_size_list;
using ::arrow::list_view;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this? You seem to always write ::arrow::list_view explicitly below?

@@ -870,6 +872,20 @@ std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory(
return ::arrow::fixed_size_list(std::move(fields[0]), list_size);
};
}
if (origin_type.id() == ::arrow::Type::LIST_VIEW) {
// Reading LIST_VIEW as LIST as a workaround.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the workaround?

// Reading LARGE_LIST_VIEW as LARGE_LIST as a workaround.
return [](FieldVector fields) {
DCHECK_EQ(fields.size(), 1);
return ::arrow::large_list_view(std::move(fields[0]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts the above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet]: Add support for list view and large list view
4 participants