Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14615: [C++] Refactor nested field refs and add union support #11641

Closed
wants to merge 3 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Nov 8, 2021

No description provided.

@github-actions
Copy link

github-actions bot commented Nov 8, 2021

@github-actions
Copy link

github-actions bot commented Nov 8, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm lidavidm requested a review from cpcloud November 9, 2021 17:01
flattened_null_bitmap->mutable_data());
}

auto flattened_data = child_data->Copy();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child_data was already a copy, so is this necessary?

ASSERT_OK_AND_ASSIGN(flattened, sliced->GetFlattenedField(1));
AssertArraysEqual(*ArrayFromJSON(utf8(), R"(["c"])"), *flattened, /*verbose=*/true);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test with an empty union array?

/// Options for struct_field function
class ARROW_EXPORT StructFieldOptions : public FunctionOptions {
public:
explicit StructFieldOptions(std::vector<int> indices);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this should also accept a FieldRef or field resolution should be left to the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FieldRef is relative to a schema so we'd want/need a variant of this function that operates on a RecordBatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a plain string name could be useful for a StructArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd still need a type to resolve it to an index, right? Unless you mean storing std::vector<std::string> or std::string directly? (Which might be reasonable.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, storing it as strings on the options, I meant. Because when actually executing the kernel, the struct array itself can perfectly resolve the name I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. Want to file a followup? I think we can support having a FieldRef internally, basically. (Though the interpretation will be a little different - it'll be relative to an array, not a schema.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of it, it's probably better to resolve the field up front (using the schema) than pay the cost for every kernel invocation with the same schema.

EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid,
::testing::HasSubstr("out-of-bounds field reference"),
CallFunction("struct_field", {arr}, &invalid2));
EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr("cannot subscript"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be TypeError in this case?

CallFunction("struct_field", {arr}, &invalid2));
EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr("cannot subscript"),
CallFunction("struct_field", {arr}, &invalid3));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with non-nested arr and trivial options? Should it be tested here?

("Given a series of indices, extract the child array or scalar referenced "
"by the index. For union values, mask the child based on the type codes "
"of the union array. The indices are always the child index and not the "
"type code (for unions) - so the first child is always index 0."),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that the indices are given in StructFieldOptions?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @lidavidm

@pitrou pitrou closed this in 140b0b2 Nov 10, 2021
@ursabot
Copy link

ursabot commented Nov 10, 2021

Benchmark runs are scheduled for baseline = a9f2091 and contender = 140b0b2. 140b0b2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.44% ⬆️0.18%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants