GH-33899: [C++] Add NamedTapRel relation as a Substrait extension #33909

rtpsw · 2023-01-28T13:10:41Z

See #33899. This PR adds NamedTapRel and a simple test case with a no-op tap (i.e., just passing-through).

Closes: [C++] Add NamedTapRel relation as a Substrait extension #33899

github-actions · 2023-01-28T13:11:04Z

Closes: [C++] Add NamedTapRel relation as a Substrait extension #33899

westonpace

Took a quick look. The Substrait side of this makes sense but I'm a little unclear on what the plan is for the C++ side of things.

cpp/src/arrow/engine/substrait/options.cc

cpp/proto/substrait/extension_rels.proto

cpp/src/arrow/engine/substrait/options.cc

westonpace · 2023-01-30T17:06:06Z

cpp/src/arrow/engine/substrait/options.h

+struct ARROW_ENGINE_EXPORT NamedTapNodeOptions : public compute::ExecNodeOptions {
+  NamedTapNodeOptions(const std::string& name, std::shared_ptr<Schema> schema)
+      : name(name), schema(std::move(schema)) {}
+
+  std::string name;
+  std::shared_ptr<Schema> schema;
+};


I'm not sure what the plan is here. I thought originally the goal was to mimic the named table. In that case I'd expect to see something like NamedTableProvider (e.g. NamedTapProvider) that translates names into actual nodes (e.g. a TeeNode).

In that case I wouldn't expect there to be any "named tap" equivalent in Acero. I'm not sure what NamedTapNode and AddPassFactory are providing in this PR?

This is a design part to discuss. The current PR implementation takes the kind field (i.e., named_tap_rel.kind()) as the factory name in the declaration. There is no need for a new NamedTapProvider concept when the factory registry is the provider of taps. OTOH, I agree with your mapping idea - adding a mapping from the kind field to the factory name would improve.

So, the plan here is:

NamedTapRel: the kind field selects the tap's kind, the name field configures the tap's action, and the columns field configures the tap's output schema.

NamedTapNodeOptions: the name field configured the tap's action as in NamedTapRel, and the schema field is the output schema configured for the tap. The selection of the tap kind is not done via options, so it's not included in them.
I could add a description along these lines in docstrings.

The test code shows a vanilla example of this. The AddPassFactory tap is registered in the factory registry. It's configuration is not the focus of the test. In a (future) test case of a specific tap, we would cast the options argument of the tap's Make function to NamedTapNodeOptions and use the fields there to configure it.

If you return exec node options (similar to named tap provider) then you could bypass the need to encode properties into the name (you wouldn't even really need kind):

# in python def tap_provider(name): if name == 'one': return TeeNodeOptions('/tmp/dataset_one') elif name == 'two': return TeeNodeOptions('/tmp/dataset_two') else: raise Error(...)

or you could move the name encoding / decoding into python

# in python def tap_provider(name) path = get_path_from_name(name) return TeeNodeOptions(path)

or you could still use the kind mapping

# in python def tap_provider(name) kind = get_kind_from_name(name) if kind == 'tee': return TeeNodeOptions(path)...

However, this feature is still pretty experimental so I don't mind sticking with kind<->factory name mapping for now if that is what you would prefer.

In my mind we should do the following:

(1) Add "NamedTapRel" as a purely abstract relation in Acero. The Acero substrait consumer code itself doesn't care or know how to turn a NamedTapRel message into a Declaration.

(2) Add a C++ extension API that allows downstream users / implementer of the NamedTapRel to tell Acero "hey, if you see a NamedTapRel message, call into the custom substrait extension function that I provide here and I will return a declaration to u" (basically this: #33850)

This way I believe we don't need most the stuff here (including changes to pass conv_opts, named_tap_mapper, etc)

@westonpace @rtpsw WDYT?

I agree a specific structure can be useful however in this case I don't feel it is much more useful than a general one in this case. For example, the specific structure for the NamedTapNode make function creates a NamedTapNode option, and doesn't not allow the user to return a declaration with a custom node option (e.g, in our case, a WriteSmoothNodeOption), so we need to handle that in our code after calling this NamedTapNode make function here (or totally by pass NamedTapNode function here). This an be done but I feel that is unnecessary indirection.

I agree with @rtpsw that going too generic (protobuf.Any in and Declaration out) is not a good idea. This is the exact API for the ExtensionProvider and so we already have this. You could use this for WriteSmoothNodeOption and it would end up looking just like the extension that is in place for AsofJoinNode. If that works for you then there is nothing we need to do :)

So if we want to go less generic then we should look at what Acero provides at each step to simplify things:

Generic Node

The next step down in generality would be to remove protobuf from the picture using Substrait/Arrow literals:

message GenericOptions { message Property { string key = 0; Literal value = 1; } repeated Property properties = 1; } message GenericNode { string kind = 0; GenericOptions options = 1; }

The API in Arrow would be:

using Options = unordered_map<string, Scalar>; void RegisterGenericExtensionHandler(string kind, function<Declaration(Options)> handler);

In other words, given properties as a map of arrow scalars, create a declaration

Pros:

No need for extension author to write proto file

Can be used for any extension type

Cons:

Need to create an entire ExecNode for your custom behavior

Generic Mapper

An even more focused approach could be to remove the need to create an exec node at all. This would be focused on the 1 input in / 1 input out / no accumulation case (e.g. what MapNode handles). From what I know of your write need this would fit this case.

The Substrait would be pretty much identical to the above.

message MapNode { string kind = 0; GenericOptions options = 1; }

However, the Arrow API would be:

using Options = unordered_map<string, Scalar>; using Mapper = function<ExecBatch(ExecBatch)>; void RegisterGenericMapHandler(string kind, function<Mapper(Options)> handler);

Pros:

No need to write proto file

No need to create an exec node

Cons:

Only handles 1-in relations that don't accumulate

Also, sorry, I was starting a bit with a blank slate above. I think, in this PR, the current idea is that instead of using GenericOptions we use string name and encode the options into the name. This is probably less flexible but in many cases users might be more familiar with encoding options into strings (ala URL encoding) instead of dealing with Arrow scalars. I'm fine with either approach.

So sticking with name I think the two approaches are:

Generic Node

void RegisterGenericExtensionHandler(string kind, function<Declaration(string)> handler);

Generic Mapper

using Mapper = function<ExecBatch(ExecBatch)>; void RegisterGenericMapHandler(string kind, function<Mapper(string)> handler);

Somehow missed this message - I am leaning towards "Generic Node" since it allows more flexibility to define node that is not just "map-like" and also we don't need to write protobuf file.

I have a question - if we go the "Generic Node" route it looks like we can achieve register custom handlers via either RegisterGenericExtensionHandler or set_default_extension_provider (7423f03#diff-8c5c127e8db8e52b40519262ee2b7e0179d4a39e5bcfe8b2165173b2b352d6a9R139) so technically we don't need set_default_extension_provider if we have RegisterGenericExtensionHandler?

rtpsw · 2023-01-31T09:54:29Z

@westonpace, in the recent commit, when coding up kDefaultNamedTapKindMapper, I reused the code structure for kDefaultNamedTableProvider. However, it looks like it is not properly initialized - I observed the error message C++ exception with description "bad_function_call" thrown in the test body when calling its operator(). Is this intended? Or shall I fix the initialization of both?

rtpsw · 2023-02-01T05:31:32Z

Note that the factory registry does not support scoping, which could be useful for cleaning up after and isolating testing (here, due to AddPassFactory) and in other use cases, like with other scoped registries. @westonpace, do you agree? open an issue?

westonpace · 2023-02-01T23:56:57Z

@westonpace, in the recent commit, when coding up kDefaultNamedTapKindMapper, I reused the code structure for kDefaultNamedTableProvider. However, it looks like it is not properly initialized - I observed the error message C++ exception with description "bad_function_call" thrown in the test body when calling its operator(). Is this intended? Or shall I fix the initialization of both?

What would you initialize it to? There is no default named table provider. I'm pretty sure we are using kDefaultNamedTapKindMapper as sentinel value only. Since std::function evaluates to false if default initialized:

        if (!conversion_options.named_table_provider) {
          return Status::Invalid(
              "plan contained a named table but a NamedTableProvider has not been "
              "configured");
        }

westonpace · 2023-02-01T23:59:32Z

Note that the factory registry does not support scoping, which could be useful for cleaning up after and isolating testing (here, due to AddPassFactory) and in other use cases, like with other scoped registries. @westonpace, do you agree? open an issue?

I'm not opposed to this but I don't think it is really necessary. The tests all rely on the fact that the initialization methods are setup as "call-once" so calling them repeatedly won't be a problem.

When dealing with the function registry it seemed important to support nested registries because you might have query-specific functions (e.g. if a UDF is embedded this could be one way to handle it). However, the odds of having query-specific nodes seems pretty slim to me.

westonpace

Looking at this further I do worry that kind will be too restrictive. For example, if you are writing to a temporary output directory and the name specifies the path then how do you configure the directory to write to?

Perhaps that might be what you intended with the nesting of node factories? E.g. you would register a node factory that had the output directory hard-coded as a part of the node factory?

cpp/proto/substrait/extension_rels.proto

cpp/src/arrow/engine/substrait/options.cc

cpp/src/arrow/engine/substrait/options.h

cpp/src/arrow/engine/substrait/serde_test.cc

rtpsw · 2023-02-02T15:09:03Z

What would you initialize it to?

I would have the default implementation return the invalid status, like the one you pointed to.

rtpsw · 2023-02-02T15:29:29Z

I'm not opposed to this but I don't think it is really necessary.

The main purpose I had in mind for this is test isolation. I agree it's not strictly necessary.

rtpsw · 2023-02-02T15:40:53Z

Looking at this further I do worry that kind will be too restrictive. For example, if you are writing to a temporary output directory and the name specifies the path then how do you configure the directory to write to?

Something like kind = "write_to_temp" and name = "/path/to/tmp.XXXXXX". The XXXXXX part has the same meaning as in mktemp(1).

Perhaps that might be what you intended with the nesting of node factories? E.g. you would register a node factory that had the output directory hard-coded as a part of the node factory?

One could register some node factory to implement a very specific custom behavior, and it's good that this option is available to users, but the above example is simple enough that it won't be needed for it. I believe this would be the case with many examples. If we settle on this kind+name structure, we'd want to encourage its application, i.e., the recommendation would be to use kind to specify the functionality and name to specify its configuration whenever possible.

cpp/proto/substrait/extension_rels.proto

cpp/src/arrow/engine/substrait/options.cc

cpp/src/arrow/type.h

cpp/src/arrow/engine/substrait/options.h

icexelloss · 2023-02-06T23:47:32Z

@rtpsw With #34042 merged I think we can simply this PR? I.e. we don't really need to use the MakeNamedTapRel in this PR since it will be implemented in the custom substrait consumer anyway?

rtpsw · 2023-02-07T13:48:33Z

My understanding is that even with the most generic solution Weston described, we would need some message defined since the current design has the extension provider check for the message using google::protobuf::Any::Is<...>(). Is there a different place, external to Arrow rather than in its extension_rels.proto, where we could define the message?

rtpsw · 2023-02-07T20:11:04Z

@westonpace, are you leaning toward accepting the current version of the PR, or something close to it? @icexelloss would be OK with reducing the PR to just adding the message, in case you'd prefer this.

icexelloss · 2023-02-07T21:41:33Z

IMO we should reduce this PR to be just the extension protobuf/message. The main reason is that most of the other change in this PR will not be used by us anyway because we will using our custom extension provider to handle the NamedTapMessage (since we will be creating a WriteSmoothNode from that NamedTapMessage) so it will not be hitting many of the code changes in this PR. Therefore introducing a bunch of code that will not be used doesn't seem necessarily.

westonpace · 2023-02-08T15:07:55Z

If we changed NamedTapKindMapper to return a Declaration I would be happy to proceed with something like this.

icexelloss · 2023-02-08T15:36:02Z

@westonpace would u be ok with having just the proto change in
https://github.com/apache/arrow/pull/33909/files#diff-878a08387dc84ac471ccb8f129a6e032154fc82d457f8cc59bfcbe012920339b

I think since the we can register custom substrait extension consumer, we won't really need to call into the implementation for "NamedTapRel" in the default substrait consumer.

If there is a use case can justifies support NamedTapRel in the default consumer in the future I think we can add that later (but currently, our use case doesn't need it). WDTY?

icexelloss · 2023-02-08T15:40:03Z

One thing I am thinking is for testing maybe we can do the following:

construct a plan that contains a NamedTapRel
register a custom substrait consumer that can handle NamedTapRel (returns some declaration, but not necessarily need to execute it)
Verify the Declaration

This way the test serves and both an example for using NamedTapRel and also mimicing how we will be using this internally.

westonpace · 2023-02-08T16:18:49Z

I'd rather not just add proto without any support for processing that proto. A mapper that returns a declaration seems simple enough (and this PR is 80% of the way there) and this makes it clear what the intent of the proto message is even if you don't end up using it.

icexelloss · 2023-02-08T16:29:31Z

A mapper that returns a declaration seems simple enough (and this PR is 80% of the way there) and this makes it clear what the intent of the proto message is even if you don't end up using it.

Fair enough. Let do that @rtpsw

icexelloss · 2023-02-08T22:05:00Z

@rtpsw @westonpace

I realize there might be some confusion so I want to clarify here.

There seems to be two (or three) proposed solutions:
(1) The one that is implemented in the PR now
(2) What Weston wrote 5 days ago here (#33909 (comment)) (The Generic Node / Generic Map) (sorry I somehow missed this message and just read that Today)

It's not clear to me if we have agreement to go with one vs the other? Both seem workable to me. (2) does seem a bit more generic / future proof and (1) seems pretty much done.

Li

westonpace · 2023-02-09T03:08:08Z

If 2 is acceptable then I would ask that we please move to that. I think the only change is to have the mapper return a declaration instead of a string. If we feel there is still some ambiguity I'd be happy to make a PR / diff real quick to show what I am thinking.

rtpsw · 2023-02-09T06:54:52Z

The recent commit has NamedTapKindMapper returning a Declaration. @westonpace, let me know if anything is missing.

icexelloss · 2023-02-09T15:39:43Z

2 is acceptable to me. @westonpace are u asking for "Generic Node" or "Generic Mapper"?

westonpace

Sorry, I hadn't noticed that commit. I agree this looks close. I have a few more questions / minor suggestions

westonpace · 2023-02-09T18:19:26Z

cpp/src/arrow/engine/substrait/options.h

+using NamedTapKindMapper = std::function<Result<compute::Declaration>(
+    const std::string&, std::vector<compute::Declaration::Input>,
+    std::shared_ptr<compute::ExecNodeOptions>)>;


This looks great. I think we can do one more small tweak.

Suggested change

using NamedTapKindMapper = std::function<Result<compute::Declaration>(

const std::string&, std::vector<compute::Declaration::Input>,

std::shared_ptr<compute::ExecNodeOptions>)>;

using NamedTapKindMapper = std::function<Result<compute::Declaration>(

const std::string&, std::vector<compute::Declaration::Input>,

std::shared_ptr<Schema>, std::string name)>;

westonpace · 2023-02-09T18:21:51Z

cpp/src/arrow/engine/substrait/options.h

                                       const ExtensionDetails& ext_details,
                                       const ExtensionSet& ext_set) = 0;
 };

 ARROW_ENGINE_EXPORT std::shared_ptr<ExtensionProvider> default_extension_provider();

+struct ARROW_ENGINE_EXPORT NamedTapNodeOptions : public compute::ExecNodeOptions {


We can pass the schema and name directly instead of using this class. If we still want to wrap the two of them in some kind of object (to reduce the # of args passed into the mapper?) then we should include kind, do not extend ExecNodeOptions, and call it something like NamedTap or NamedTapOptions.

In other words, we shouldn't try to form the node options from the protobuf. That will be the mapper's job. We just need to pass whatever was in the protobuf along to the mapper and let the mapper figure out what node options are appropriate.

westonpace · 2023-02-09T18:22:29Z

cpp/src/arrow/engine/substrait/options.h

+  /// \brief A custom strategy to be used for mapping a tap kind to a function name
+  ///
+  /// The default mapper returns a declaration whose factory name is equal to the tap kind
+  NamedTapKindMapper named_tap_mapper = kDefaultNamedTapKindMapper;


How do you feel about renaming this to NamedTapProvider for consistency?

westonpace · 2023-02-09T18:25:21Z

cpp/src/arrow/engine/substrait/options.h

+using NamedTapKindMapper = std::function<Result<compute::Declaration>(
+    const std::string&, std::vector<compute::Declaration::Input>,
+    std::shared_ptr<compute::ExecNodeOptions>)>;
+static NamedTapKindMapper kDefaultNamedTapKindMapper =


Do you think we will want to make the default named tap mapper (or the default named table provider for that matter) configurable?

I made it configurable.

westonpace · 2023-02-09T18:26:44Z

cpp/src/arrow/engine/substrait/options.h

+    [](const std::string& kind, std::vector<compute::Declaration::Input> inputs,
+       std::shared_ptr<compute::ExecNodeOptions> options)
+    -> Result<compute::Declaration> {
+  return compute::Declaration(kind, inputs, options);


If we remove NamedTapNodeOptions like some of my other comments suggest then I suppose there is no more meaningful default mapper. That being said, I don't know of any use case where this default would be correct. Perhaps we just return an error here like we do with named tables?

icexelloss

LGTM

westonpace

I think this works. Thank you for your persistence.

cpp/src/arrow/engine/substrait/options.cc

ursabot · 2023-02-11T08:16:29Z

Benchmark runs are scheduled for baseline = 8fed97f and contender = 24e5a58. 24e5a58 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.95% ⬆️0.12%] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 24e5a580 ec2-t3-xlarge-us-east-2
[Failed] 24e5a580 test-mac-arm
[Finished] 24e5a580 ursa-i9-9960x
[Finished] 24e5a580 ursa-thinkcentre-m75q
[Finished] 8fed97fa ec2-t3-xlarge-us-east-2
[Finished] 8fed97fa test-mac-arm
[Finished] 8fed97fa ursa-i9-9960x
[Finished] 8fed97fa ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-02-11T08:19:27Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

…on (apache#33909) See apache#33899. This PR adds `NamedTapRel` and a simple test case with a no-op tap (i.e., just passing-through). * Closes: apache#33899 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

apacheGH-33899: [C++] Add NamedTapRel relation as a Substrait extension

65bc8e2

rtpsw requested a review from westonpace as a code owner January 28, 2023 13:10

github-actions bot added the Component: C++ label Jan 28, 2023

westonpace requested changes Jan 30, 2023

View reviewed changes

rtpsw added 2 commits January 31, 2023 02:48

requested fixes

04a4b6c

add named tap mapper

65542c3

rtpsw requested a review from westonpace January 31, 2023 20:39

westonpace requested changes Feb 2, 2023

View reviewed changes

requested fixes

eed1eb6

rtpsw requested a review from westonpace February 2, 2023 15:57

icexelloss reviewed Feb 2, 2023

View reviewed changes

cpp/proto/substrait/extension_rels.proto Show resolved Hide resolved

icexelloss reviewed Feb 2, 2023

View reviewed changes

cpp/src/arrow/engine/substrait/options.cc Show resolved Hide resolved

icexelloss reviewed Feb 2, 2023

View reviewed changes

cpp/src/arrow/type.h Show resolved Hide resolved

icexelloss reviewed Feb 2, 2023

View reviewed changes

cpp/src/arrow/engine/substrait/options.h Outdated Show resolved Hide resolved

fix doc and conv_opts

6aa420b

named tap mapper to declaration

0b1147b

westonpace requested changes Feb 9, 2023

View reviewed changes

rtpsw added 2 commits February 9, 2023 15:43

requested changes

2f39fe3

fix var name

07dc843

rtpsw requested a review from westonpace February 10, 2023 14:06

icexelloss approved these changes Feb 10, 2023

View reviewed changes

westonpace approved these changes Feb 10, 2023

View reviewed changes

cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved

westonpace and others added 2 commits February 10, 2023 07:03

Update cpp/src/arrow/engine/substrait/options.cc

14722af

lint

7a1d916

rtpsw requested a review from westonpace February 10, 2023 16:40

westonpace merged commit 24e5a58 into apache:master Feb 10, 2023

rtpsw deleted the GH-33899 branch February 18, 2023 08:02

GH-33899: [C++] Add NamedTapRel relation as a Substrait extension #33909

GH-33899: [C++] Add NamedTapRel relation as a Substrait extension #33909

Conversation

rtpsw commented Jan 28, 2023 • edited by github-actions bot Loading

github-actions bot commented Jan 28, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtpsw Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Generic Node

Generic Mapper

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Generic Node

Generic Mapper

icexelloss Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

rtpsw commented Jan 31, 2023

rtpsw commented Feb 1, 2023

westonpace commented Feb 1, 2023

westonpace commented Feb 1, 2023

westonpace left a comment

Choose a reason for hiding this comment

rtpsw commented Feb 2, 2023

rtpsw commented Feb 2, 2023 • edited Loading

rtpsw commented Feb 2, 2023 • edited Loading

icexelloss commented Feb 6, 2023

rtpsw commented Feb 7, 2023

rtpsw commented Feb 7, 2023

icexelloss commented Feb 7, 2023 • edited Loading

westonpace commented Feb 8, 2023

icexelloss commented Feb 8, 2023 • edited Loading

icexelloss commented Feb 8, 2023

westonpace commented Feb 8, 2023

icexelloss commented Feb 8, 2023

icexelloss commented Feb 8, 2023

westonpace commented Feb 9, 2023

rtpsw commented Feb 9, 2023

icexelloss commented Feb 9, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

ursabot commented Feb 11, 2023

ursabot commented Feb 11, 2023

rtpsw commented Jan 28, 2023 •

edited by github-actions bot

Loading

rtpsw Jan 30, 2023 •

edited

Loading

icexelloss Feb 2, 2023 •

edited

Loading

icexelloss Feb 2, 2023 •

edited

Loading

westonpace Feb 3, 2023 •

edited

Loading

icexelloss Feb 8, 2023 •

edited

Loading

rtpsw commented Feb 2, 2023 •

edited

Loading

rtpsw commented Feb 2, 2023 •

edited

Loading

icexelloss commented Feb 7, 2023 •

edited

Loading

icexelloss commented Feb 8, 2023 •

edited

Loading