New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struct/Map Integration with Arrow #1831
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good. Some more requests for tests:
def test_struct_roundtrip(self,duckdb_cursor): | ||
if not can_run: | ||
return | ||
compare_results("SELECT a from (SELECT STRUCT_PACK(a := 42, b := 43) as a) as t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more tests with complex types:
- List of structs
- Lists of structs with lists
- These but with different types of lists internally (
struct(list, list, list)
, one is a large list, the other is a small list, the other is a fixed size list) - Maps embedded in a struct
- List of maps
- Map with list as key and/or value
- Map with struct as key and/or value
- Struct that is NULL entirely (e.g.
case when a%2=0 then null else struct_pack(...)
) - Map that is NULL entirely (same case statement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! Some comments:
src/common/types/vector.cpp
Outdated
@@ -838,7 +838,7 @@ void Vector::UTFVerify(idx_t count) { | |||
UTFVerify(FlatVector::INCREMENTAL_SELECTION_VECTOR, count); | |||
} | |||
|
|||
void Vector::Verify(const SelectionVector &sel, idx_t count) { | |||
void Vector::Verify(const SelectionVector &sel, idx_t count, ValidityMask *validity_parent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert this change? This will only work on nested structures with depth=1. I am fixing this differently in the struct storage branch.
@@ -161,6 +161,11 @@ void VectorOperations::Copy(const Vector &source, Vector &target, const Selectio | |||
auto sdata = FlatVector::GetData<list_entry_t>(source); | |||
vector<sel_t> child_rows; | |||
for (idx_t i = 0; i < copy_count; ++i) { | |||
if (parent_validity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, everything checking parent_validity can be reverted.
src/common/types/validity_mask.cpp
Outdated
@@ -46,6 +46,15 @@ string ValidityMask::ToString(idx_t count) const { | |||
return result; | |||
} | |||
|
|||
bool ValidityMask::Equal(ValidityMask &mask, idx_t count) const { | |||
for (idx_t i = 0; i < count; i++) { | |||
if (RowIsValid(i) != mask.RowIsValid(i)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the validity entries directly? That should be significantly faster than checking using RowIsValid
, which does a bunch of bitwise ops per value.
Also can use an early out if both masks are not set (AllValid()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Some more comments:
@@ -24,7 +24,7 @@ static void TemplatedCopy(const Vector &source, const SelectionVector &sel, Vect | |||
} | |||
|
|||
void VectorOperations::Copy(const Vector &source, Vector &target, const SelectionVector &sel_p, idx_t source_count, | |||
idx_t source_offset, idx_t target_offset) { | |||
idx_t source_offset, idx_t target_offset, ValidityMask *parent_validity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parent_validity mask can go here, right?
src/common/types/data_chunk.cpp
Outdated
if (is_map && child_idx == 0 && !child_mask.AllValid()) { | ||
//! Need to check if the child mask and the map mask are both equal | ||
if (!child_mask.Equal(*parent_mask, size)) { | ||
throw InvalidInputException("Arrow does not accept MAPs with NULL values on keys"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some tests that trigger this error?
src/common/types/vector.cpp
Outdated
D_ASSERT(vector.GetType().id() == LogicalTypeId::LIST); | ||
if (vector.GetVectorType() == VectorType::DICTIONARY_VECTOR) { | ||
auto &child = DictionaryVector::Child(vector); | ||
return FlatVector::Validity(ListVector::GetEntry(child)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems dangerous to return a validity mask of a child of a dictionary vector as-is: without the dictionary you do not know how to interpret the validity mask. Perhaps use Orrify
instead?
This PR has the integration of both struct and map types from/to Arrow #1742