Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct/Map Integration with Arrow #1831

Merged
merged 24 commits into from Jun 9, 2021
Merged

Struct/Map Integration with Arrow #1831

merged 24 commits into from Jun 9, 2021

Conversation

pdet
Copy link
Member

@pdet pdet commented Jun 1, 2021

This PR has the integration of both struct and map types from/to Arrow #1742

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good. Some more requests for tests:

def test_struct_roundtrip(self,duckdb_cursor):
if not can_run:
return
compare_results("SELECT a from (SELECT STRUCT_PACK(a := 42, b := 43) as a) as t")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more tests with complex types:

  • List of structs
  • Lists of structs with lists
  • These but with different types of lists internally (struct(list, list, list), one is a large list, the other is a small list, the other is a fixed size list)
  • Maps embedded in a struct
  • List of maps
  • Map with list as key and/or value
  • Map with struct as key and/or value
  • Struct that is NULL entirely (e.g. case when a%2=0 then null else struct_pack(...))
  • Map that is NULL entirely (same case statement)

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! Some comments:

@@ -838,7 +838,7 @@ void Vector::UTFVerify(idx_t count) {
UTFVerify(FlatVector::INCREMENTAL_SELECTION_VECTOR, count);
}

void Vector::Verify(const SelectionVector &sel, idx_t count) {
void Vector::Verify(const SelectionVector &sel, idx_t count, ValidityMask *validity_parent) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change? This will only work on nested structures with depth=1. I am fixing this differently in the struct storage branch.

@@ -161,6 +161,11 @@ void VectorOperations::Copy(const Vector &source, Vector &target, const Selectio
auto sdata = FlatVector::GetData<list_entry_t>(source);
vector<sel_t> child_rows;
for (idx_t i = 0; i < copy_count; ++i) {
if (parent_validity) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, everything checking parent_validity can be reverted.

@@ -46,6 +46,15 @@ string ValidityMask::ToString(idx_t count) const {
return result;
}

bool ValidityMask::Equal(ValidityMask &mask, idx_t count) const {
for (idx_t i = 0; i < count; i++) {
if (RowIsValid(i) != mask.RowIsValid(i)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the validity entries directly? That should be significantly faster than checking using RowIsValid, which does a bunch of bitwise ops per value.

Also can use an early out if both masks are not set (AllValid()).

@pdet pdet requested a review from Mytherin June 5, 2021 13:52
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Some more comments:

src/common/types/validity_mask.cpp Outdated Show resolved Hide resolved
src/common/types/validity_mask.cpp Outdated Show resolved Hide resolved
@@ -24,7 +24,7 @@ static void TemplatedCopy(const Vector &source, const SelectionVector &sel, Vect
}

void VectorOperations::Copy(const Vector &source, Vector &target, const SelectionVector &sel_p, idx_t source_count,
idx_t source_offset, idx_t target_offset) {
idx_t source_offset, idx_t target_offset, ValidityMask *parent_validity) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parent_validity mask can go here, right?

if (is_map && child_idx == 0 && !child_mask.AllValid()) {
//! Need to check if the child mask and the map mask are both equal
if (!child_mask.Equal(*parent_mask, size)) {
throw InvalidInputException("Arrow does not accept MAPs with NULL values on keys");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some tests that trigger this error?

D_ASSERT(vector.GetType().id() == LogicalTypeId::LIST);
if (vector.GetVectorType() == VectorType::DICTIONARY_VECTOR) {
auto &child = DictionaryVector::Child(vector);
return FlatVector::Validity(ListVector::GetEntry(child));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems dangerous to return a validity mask of a child of a dictionary vector as-is: without the dictionary you do not know how to interpret the validity mask. Perhaps use Orrify instead?

@Mytherin Mytherin merged commit 72c2212 into duckdb:master Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants