-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38849: [C++][Parquet]: Add support for list view and large list view #38850
base: main
Are you sure you want to change the base?
Conversation
|
a00c40e
to
05f4c39
Compare
@@ -113,9 +114,50 @@ class MultipathLevelBuilderTest : public testing::Test { | |||
ArrowWriteContext(default_memory_pool(), arrow_properties_.get()); | |||
}; | |||
|
|||
TEST_F(MultipathLevelBuilderTest, NonNullableSingleListNonNullableEntries) { | |||
class MultipathLevelListTool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield I've change the original list
test to list
and list_view
test, would you think it's ok?
@pitrou @emkornfield I've tried to upload some logic about list-view and large-list-view, but I don't know where should I add test about schema and read/write, would you mind give some example? |
# Conflicts: # cpp/src/parquet/arrow/path_internal_test.cc
@pitrou This is first time I'm trying to add a new nested type support here, so maybe need some help for where should I add the testing... |
Apologies for the delay I will try to take a look this week. |
@emkornfield Do you want revisit this patch? |
Sorry for missing this! I will take a look recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any tests for actually reading and writing Arrow data, are there any?
info_.path.emplace_back(std::move(node)); | ||
nullable_in_parent_ = array.list_type()->value_field()->nullable(); | ||
// raw_value_offsets() and raw_value_sizes() accounts for any slice offset/size. | ||
if constexpr (std::is_same<::arrow::ListViewArray, T>::value || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use is_list_view_type
from arrow/type_traits.h
.
@@ -999,6 +999,17 @@ TEST_F(TestConvertArrowSchema, ParquetLists) { | |||
auto arrow_list = ::arrow::list(arrow_element); | |||
arrow_fields.push_back(::arrow::field("my_list", arrow_list, false)); | |||
} | |||
// Same as above but using list_view as arrow type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that changes below is the list factory function, this could perhaps be factored out (also with list_cast_fns
below), e.g.:
auto list_type_factory = [](const std::shared_ptr<::arrow::Field> field) {
return ::arrow::list(field->type());
};
...
for (const auto& list_factory : {list_type_factory, list_view_type_factory}) { ... }
} | ||
}; | ||
|
||
template <typename ListTool> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to template this by type, since all factories have the same signature anyway.
@@ -34,6 +34,7 @@ namespace parquet::arrow { | |||
using ::arrow::default_memory_pool; | |||
using ::arrow::field; | |||
using ::arrow::fixed_size_list; | |||
using ::arrow::list_view; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this? You seem to always write ::arrow::list_view
explicitly below?
@@ -870,6 +872,20 @@ std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory( | |||
return ::arrow::fixed_size_list(std::move(fields[0]), list_size); | |||
}; | |||
} | |||
if (origin_type.id() == ::arrow::Type::LIST_VIEW) { | |||
// Reading LIST_VIEW as LIST as a workaround. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the workaround?
// Reading LARGE_LIST_VIEW as LARGE_LIST as a workaround. | ||
return [](FieldVector fields) { | ||
DCHECK_EQ(fields.size(), 1); | ||
return ::arrow::large_list_view(std::move(fields[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contradicts the above?
Rationale for this change
Parquet itself doesn't has a LIST_VIEW type, however, we can read/write a LIST_VIEW inside parquet.
Some notes:
During loading, I think making a arrow::list is ok, and converting a List to ListView is so convenient. So maybe we can first support write
For parquet::arrow::SchemaManifest, maybe it's ok to just store a list as metadata?
What changes are included in this PR?
TODO:
Are these changes tested?
Are there any user-facing changes?