Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33899: [C++] Add NamedTapRel relation as a Substrait extension #33909

Merged
merged 10 commits into from
Feb 10, 2023

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Jan 28, 2023

See #33899. This PR adds NamedTapRel and a simple test case with a no-op tap (i.e., just passing-through).

@github-actions
Copy link

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look. The Substrait side of this makes sense but I'm a little unclear on what the plan is for the C++ side of things.

cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
cpp/proto/substrait/extension_rels.proto Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
Comment on lines 87 to 93
struct ARROW_ENGINE_EXPORT NamedTapNodeOptions : public compute::ExecNodeOptions {
NamedTapNodeOptions(const std::string& name, std::shared_ptr<Schema> schema)
: name(name), schema(std::move(schema)) {}

std::string name;
std::shared_ptr<Schema> schema;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the plan is here. I thought originally the goal was to mimic the named table. In that case I'd expect to see something like NamedTableProvider (e.g. NamedTapProvider) that translates names into actual nodes (e.g. a TeeNode).

In that case I wouldn't expect there to be any "named tap" equivalent in Acero. I'm not sure what NamedTapNode and AddPassFactory are providing in this PR?

Copy link
Contributor Author

@rtpsw rtpsw Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a design part to discuss. The current PR implementation takes the kind field (i.e., named_tap_rel.kind()) as the factory name in the declaration. There is no need for a new NamedTapProvider concept when the factory registry is the provider of taps. OTOH, I agree with your mapping idea - adding a mapping from the kind field to the factory name would improve.

So, the plan here is:

  • NamedTapRel: the kind field selects the tap's kind, the name field configures the tap's action, and the columns field configures the tap's output schema.
  • NamedTapNodeOptions: the name field configured the tap's action as in NamedTapRel, and the schema field is the output schema configured for the tap. The selection of the tap kind is not done via options, so it's not included in them.
    I could add a description along these lines in docstrings.

The test code shows a vanilla example of this. The AddPassFactory tap is registered in the factory registry. It's configuration is not the focus of the test. In a (future) test case of a specific tap, we would cast the options argument of the tap's Make function to NamedTapNodeOptions and use the fields there to configure it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you return exec node options (similar to named tap provider) then you could bypass the need to encode properties into the name (you wouldn't even really need kind):

# in python
def tap_provider(name):
  if name == 'one':
    return TeeNodeOptions('/tmp/dataset_one')
  elif name == 'two':
    return TeeNodeOptions('/tmp/dataset_two')
  else:
    raise Error(...)

or you could move the name encoding / decoding into python

# in python
def tap_provider(name)
  path = get_path_from_name(name)
  return TeeNodeOptions(path)

or you could still use the kind mapping

# in python
def tap_provider(name)
  kind = get_kind_from_name(name)
  if kind == 'tee':
   return TeeNodeOptions(path)...

However, this feature is still pretty experimental so I don't mind sticking with kind<->factory name mapping for now if that is what you would prefer.

Copy link
Contributor

@icexelloss icexelloss Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind we should do the following:

(1) Add "NamedTapRel" as a purely abstract relation in Acero. The Acero substrait consumer code itself doesn't care or know how to turn a NamedTapRel message into a Declaration.

(2) Add a C++ extension API that allows downstream users / implementer of the NamedTapRel to tell Acero "hey, if you see a NamedTapRel message, call into the custom substrait extension function that I provide here and I will return a declaration to u" (basically this: #33850)

This way I believe we don't need most the stuff here (including changes to pass conv_opts, named_tap_mapper, etc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@icexelloss icexelloss Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a specific structure can be useful however in this case I don't feel it is much more useful than a general one in this case. For example, the specific structure for the NamedTapNode make function creates a NamedTapNode option, and doesn't not allow the user to return a declaration with a custom node option (e.g, in our case, a WriteSmoothNodeOption), so we need to handle that in our code after calling this NamedTapNode make function here (or totally by pass NamedTapNode function here). This an be done but I feel that is unnecessary indirection.

Copy link
Member

@westonpace westonpace Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @rtpsw that going too generic (protobuf.Any in and Declaration out) is not a good idea. This is the exact API for the ExtensionProvider and so we already have this. You could use this for WriteSmoothNodeOption and it would end up looking just like the extension that is in place for AsofJoinNode. If that works for you then there is nothing we need to do :)

So if we want to go less generic then we should look at what Acero provides at each step to simplify things:

Generic Node

The next step down in generality would be to remove protobuf from the picture using Substrait/Arrow literals:

message GenericOptions {
  message Property {
    string key = 0;
    Literal value = 1;
  }
  repeated Property properties = 1;
}

message GenericNode {
  string kind = 0;
  GenericOptions options = 1;
}

The API in Arrow would be:

using Options = unordered_map<string, Scalar>;
void RegisterGenericExtensionHandler(string kind, function<Declaration(Options)> handler);

In other words, given properties as a map of arrow scalars, create a declaration

Pros:

  • No need for extension author to write proto file
  • Can be used for any extension type

Cons:

  • Need to create an entire ExecNode for your custom behavior

Generic Mapper

An even more focused approach could be to remove the need to create an exec node at all. This would be focused on the 1 input in / 1 input out / no accumulation case (e.g. what MapNode handles). From what I know of your write need this would fit this case.

The Substrait would be pretty much identical to the above.

message MapNode {
  string kind = 0;
  GenericOptions options = 1;
}

However, the Arrow API would be:

using Options = unordered_map<string, Scalar>;
using Mapper = function<ExecBatch(ExecBatch)>;
void RegisterGenericMapHandler(string kind, function<Mapper(Options)> handler);

Pros:

  • No need to write proto file
  • No need to create an exec node

Cons:

  • Only handles 1-in relations that don't accumulate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, sorry, I was starting a bit with a blank slate above. I think, in this PR, the current idea is that instead of using GenericOptions we use string name and encode the options into the name. This is probably less flexible but in many cases users might be more familiar with encoding options into strings (ala URL encoding) instead of dealing with Arrow scalars. I'm fine with either approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sticking with name I think the two approaches are:

Generic Node

void RegisterGenericExtensionHandler(string kind, function<Declaration(string)> handler);

Generic Mapper

using Mapper = function<ExecBatch(ExecBatch)>;
void RegisterGenericMapHandler(string kind, function<Mapper(string)> handler);

Copy link
Contributor

@icexelloss icexelloss Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow missed this message - I am leaning towards "Generic Node" since it allows more flexibility to define node that is not just "map-like" and also we don't need to write protobuf file.

I have a question - if we go the "Generic Node" route it looks like we can achieve register custom handlers via either RegisterGenericExtensionHandler or set_default_extension_provider (7423f03#diff-8c5c127e8db8e52b40519262ee2b7e0179d4a39e5bcfe8b2165173b2b352d6a9R139) so technically we don't need set_default_extension_provider if we have RegisterGenericExtensionHandler?

@rtpsw
Copy link
Contributor Author

rtpsw commented Jan 31, 2023

@westonpace, in the recent commit, when coding up kDefaultNamedTapKindMapper, I reused the code structure for kDefaultNamedTableProvider. However, it looks like it is not properly initialized - I observed the error message C++ exception with description "bad_function_call" thrown in the test body when calling its operator(). Is this intended? Or shall I fix the initialization of both?

@rtpsw rtpsw requested a review from westonpace January 31, 2023 20:39
@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 1, 2023

Note that the factory registry does not support scoping, which could be useful for cleaning up after and isolating testing (here, due to AddPassFactory) and in other use cases, like with other scoped registries. @westonpace, do you agree? open an issue?

@westonpace
Copy link
Member

@westonpace, in the recent commit, when coding up kDefaultNamedTapKindMapper, I reused the code structure for kDefaultNamedTableProvider. However, it looks like it is not properly initialized - I observed the error message C++ exception with description "bad_function_call" thrown in the test body when calling its operator(). Is this intended? Or shall I fix the initialization of both?

What would you initialize it to? There is no default named table provider. I'm pretty sure we are using kDefaultNamedTapKindMapper as sentinel value only. Since std::function evaluates to false if default initialized:

        if (!conversion_options.named_table_provider) {
          return Status::Invalid(
              "plan contained a named table but a NamedTableProvider has not been "
              "configured");
        }

@westonpace
Copy link
Member

Note that the factory registry does not support scoping, which could be useful for cleaning up after and isolating testing (here, due to AddPassFactory) and in other use cases, like with other scoped registries. @westonpace, do you agree? open an issue?

I'm not opposed to this but I don't think it is really necessary. The tests all rely on the fact that the initialization methods are setup as "call-once" so calling them repeatedly won't be a problem.

When dealing with the function registry it seemed important to support nested registries because you might have query-specific functions (e.g. if a UDF is embedded this could be one way to handle it). However, the odds of having query-specific nodes seems pretty slim to me.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this further I do worry that kind will be too restrictive. For example, if you are writing to a temporary output directory and the name specifies the path then how do you configure the directory to write to?

Perhaps that might be what you intended with the nesting of node factories? E.g. you would register a node factory that had the output directory hard-coded as a part of the node factory?

cpp/proto/substrait/extension_rels.proto Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/options.h Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/serde_test.cc Outdated Show resolved Hide resolved
@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 2, 2023

What would you initialize it to?

I would have the default implementation return the invalid status, like the one you pointed to.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 2, 2023

I'm not opposed to this but I don't think it is really necessary.

The main purpose I had in mind for this is test isolation. I agree it's not strictly necessary.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 2, 2023

Looking at this further I do worry that kind will be too restrictive. For example, if you are writing to a temporary output directory and the name specifies the path then how do you configure the directory to write to?

Something like kind = "write_to_temp" and name = "/path/to/tmp.XXXXXX". The XXXXXX part has the same meaning as in mktemp(1).

Perhaps that might be what you intended with the nesting of node factories? E.g. you would register a node factory that had the output directory hard-coded as a part of the node factory?

One could register some node factory to implement a very specific custom behavior, and it's good that this option is available to users, but the above example is simple enough that it won't be needed for it. I believe this would be the case with many examples. If we settle on this kind+name structure, we'd want to encourage its application, i.e., the recommendation would be to use kind to specify the functionality and name to specify its configuration whenever possible.

@rtpsw rtpsw requested a review from westonpace February 2, 2023 15:57
@icexelloss
Copy link
Contributor

@rtpsw With #34042 merged I think we can simply this PR? I.e. we don't really need to use the MakeNamedTapRel in this PR since it will be implemented in the custom substrait consumer anyway?

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 7, 2023

My understanding is that even with the most generic solution Weston described, we would need some message defined since the current design has the extension provider check for the message using google::protobuf::Any::Is<...>(). Is there a different place, external to Arrow rather than in its extension_rels.proto, where we could define the message?

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 7, 2023

@westonpace, are you leaning toward accepting the current version of the PR, or something close to it? @icexelloss would be OK with reducing the PR to just adding the message, in case you'd prefer this.

@icexelloss
Copy link
Contributor

icexelloss commented Feb 7, 2023

IMO we should reduce this PR to be just the extension protobuf/message. The main reason is that most of the other change in this PR will not be used by us anyway because we will using our custom extension provider to handle the NamedTapMessage (since we will be creating a WriteSmoothNode from that NamedTapMessage) so it will not be hitting many of the code changes in this PR. Therefore introducing a bunch of code that will not be used doesn't seem necessarily.

@westonpace
Copy link
Member

If we changed NamedTapKindMapper to return a Declaration I would be happy to proceed with something like this.

@icexelloss
Copy link
Contributor

icexelloss commented Feb 8, 2023

@westonpace would u be ok with having just the proto change in
https://github.com/apache/arrow/pull/33909/files#diff-878a08387dc84ac471ccb8f129a6e032154fc82d457f8cc59bfcbe012920339b

I think since the we can register custom substrait extension consumer, we won't really need to call into the implementation for "NamedTapRel" in the default substrait consumer.

If there is a use case can justifies support NamedTapRel in the default consumer in the future I think we can add that later (but currently, our use case doesn't need it). WDTY?

@icexelloss
Copy link
Contributor

One thing I am thinking is for testing maybe we can do the following:

  • construct a plan that contains a NamedTapRel
  • register a custom substrait consumer that can handle NamedTapRel (returns some declaration, but not necessarily need to execute it)
  • Verify the Declaration

This way the test serves and both an example for using NamedTapRel and also mimicing how we will be using this internally.

@westonpace
Copy link
Member

I'd rather not just add proto without any support for processing that proto. A mapper that returns a declaration seems simple enough (and this PR is 80% of the way there) and this makes it clear what the intent of the proto message is even if you don't end up using it.

@icexelloss
Copy link
Contributor

A mapper that returns a declaration seems simple enough (and this PR is 80% of the way there) and this makes it clear what the intent of the proto message is even if you don't end up using it.

Fair enough. Let do that @rtpsw

@icexelloss
Copy link
Contributor

@rtpsw @westonpace

I realize there might be some confusion so I want to clarify here.

There seems to be two (or three) proposed solutions:
(1) The one that is implemented in the PR now
(2) What Weston wrote 5 days ago here (#33909 (comment)) (The Generic Node / Generic Map) (sorry I somehow missed this message and just read that Today)

It's not clear to me if we have agreement to go with one vs the other? Both seem workable to me. (2) does seem a bit more generic / future proof and (1) seems pretty much done. 

Li

@westonpace
Copy link
Member

If 2 is acceptable then I would ask that we please move to that. I think the only change is to have the mapper return a declaration instead of a string. If we feel there is still some ambiguity I'd be happy to make a PR / diff real quick to show what I am thinking.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 9, 2023

The recent commit has NamedTapKindMapper returning a Declaration. @westonpace, let me know if anything is missing.

@icexelloss
Copy link
Contributor

2 is acceptable to me. @westonpace are u asking for "Generic Node" or "Generic Mapper"?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I hadn't noticed that commit. I agree this looks close. I have a few more questions / minor suggestions

Comment on lines 72 to 74
using NamedTapKindMapper = std::function<Result<compute::Declaration>(
const std::string&, std::vector<compute::Declaration::Input>,
std::shared_ptr<compute::ExecNodeOptions>)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I think we can do one more small tweak.

Suggested change
using NamedTapKindMapper = std::function<Result<compute::Declaration>(
const std::string&, std::vector<compute::Declaration::Input>,
std::shared_ptr<compute::ExecNodeOptions>)>;
using NamedTapKindMapper = std::function<Result<compute::Declaration>(
const std::string&, std::vector<compute::Declaration::Input>,
std::shared_ptr<Schema>, std::string name)>;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const ExtensionDetails& ext_details,
const ExtensionSet& ext_set) = 0;
};

ARROW_ENGINE_EXPORT std::shared_ptr<ExtensionProvider> default_extension_provider();

struct ARROW_ENGINE_EXPORT NamedTapNodeOptions : public compute::ExecNodeOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the schema and name directly instead of using this class. If we still want to wrap the two of them in some kind of object (to reduce the # of args passed into the mapper?) then we should include kind, do not extend ExecNodeOptions, and call it something like NamedTap or NamedTapOptions.

In other words, we shouldn't try to form the node options from the protobuf. That will be the mapper's job. We just need to pass whatever was in the protobuf along to the mapper and let the mapper figure out what node options are appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// \brief A custom strategy to be used for mapping a tap kind to a function name
///
/// The default mapper returns a declaration whose factory name is equal to the tap kind
NamedTapKindMapper named_tap_mapper = kDefaultNamedTapKindMapper;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about renaming this to NamedTapProvider for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

using NamedTapKindMapper = std::function<Result<compute::Declaration>(
const std::string&, std::vector<compute::Declaration::Input>,
std::shared_ptr<compute::ExecNodeOptions>)>;
static NamedTapKindMapper kDefaultNamedTapKindMapper =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we will want to make the default named tap mapper (or the default named table provider for that matter) configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it configurable.

[](const std::string& kind, std::vector<compute::Declaration::Input> inputs,
std::shared_ptr<compute::ExecNodeOptions> options)
-> Result<compute::Declaration> {
return compute::Declaration(kind, inputs, options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove NamedTapNodeOptions like some of my other comments suggest then I suppose there is no more meaningful default mapper. That being said, I don't know of any use case where this default would be correct. Perhaps we just return an error here like we do with named tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works. Thank you for your persistence.

cpp/src/arrow/engine/substrait/options.cc Outdated Show resolved Hide resolved
@westonpace westonpace merged commit 24e5a58 into apache:master Feb 10, 2023
@ursabot
Copy link

ursabot commented Feb 11, 2023

Benchmark runs are scheduled for baseline = 8fed97f and contender = 24e5a58. 24e5a58 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.95% ⬆️0.12%] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 24e5a580 ec2-t3-xlarge-us-east-2
[Failed] 24e5a580 test-mac-arm
[Finished] 24e5a580 ursa-i9-9960x
[Finished] 24e5a580 ursa-thinkcentre-m75q
[Finished] 8fed97fa ec2-t3-xlarge-us-east-2
[Finished] 8fed97fa test-mac-arm
[Finished] 8fed97fa ursa-i9-9960x
[Finished] 8fed97fa ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Feb 11, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
…on (apache#33909)

See apache#33899. This PR adds `NamedTapRel` and a simple test case with a no-op tap (i.e., just passing-through).
* Closes: apache#33899

Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@rtpsw rtpsw deleted the GH-33899 branch February 18, 2023 08:02
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…on (apache#33909)

See apache#33899. This PR adds `NamedTapRel` and a simple test case with a no-op tap (i.e., just passing-through).
* Closes: apache#33899

Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add NamedTapRel relation as a Substrait extension
4 participants