Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33850: [C++] Allow Substrait's default extension provider to be configured #34042

Merged
merged 3 commits into from
Feb 6, 2023

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Feb 4, 2023

@github-actions
Copy link

github-actions bot commented Feb 4, 2023

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 4, 2023

cc @westonpace @icexelloss

return kDefaultExtensionProvider;
}

void set_default_extension_provider(const std::shared_ptr<ExtensionProvider>& provider) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document these method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - do we need to expose these in public headers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is declared in the public header options.h and documented there.

@icexelloss
Copy link
Contributor

One thing wasn't clear to be is how do we dispatch some of the message to our extension provider vs use the default extension provider. For example, for the asof join message, I would imagine that being passed to the existing Acero extension provider - how would we handle that?

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 5, 2023

One thing wasn't clear to be is how do we dispatch some of the message to our extension provider vs use the default extension provider. For example, for the asof join message, I would imagine that being passed to the existing Acero extension provider - how would we handle that?

The usual pattern would be chaining of an extension provider - something along the lines of

class MyExtensionProvider {
public:
  MyExtensionProvider(const std::shared_ptr<ExtensionProvider>& parent_provider) :
      parent_provider_(parent_provider) {}

  Result<RelationInfo> MakeRel(const std::vector<DeclarationInfo>& inputs,
                               const google::protobuf::Any& rel,
                               const ExtensionSet& ext_set) override {
    if (rel.Is<substrait_ext::MyRel>()) {
      substrait_ext::MyRel my_rel;
      rel.UnpackTo(&my_rel);
      return MakeMyRel(inputs, my_rel, ext_set);
    }
    return parent_provider_->MakeRel(inputs, rel, ext_set);
  }

  // MakeMyRel implementation here

private:
  std::shared_ptr<ExtensionProvider> parent_provider_;
};

and for initialization

set_default_extension_provider(std::make_shared<MyExtensionProvider>(default_extension_provider());

@icexelloss
Copy link
Contributor

IMO we can handle the "fallback" behavior in Acero instead of custom code so not everyone that wants to provide custom extension need to implement this.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 5, 2023

IMO we can handle the "fallback" behavior in Acero instead of custom code so not everyone that wants to provide custom extension need to implement this.

What do you have in mind for the API? My suggestion would be to encapsulate the chaining pattern via an API like:

chain_extension_provider(const std::shared_ptr<ExtensionProvider>& provider);

Here provider would not need to call a parent because this would be done automatically.

@westonpace
Copy link
Member

We could make the chaining of extension providers more explicit (e.g. Acero could have a stack of extension providers). Or we could create a base class that we offer as a helper / example. However, I'm not sure that this is a good idea. I'd rather stick to something like what we already have as it is very simple. Yes, most providers will want to do something similar but I don't think it is a large burden for them.

@icexelloss
Copy link
Contributor

I'd rather stick to something like what we already have as it is very simple. Yes, most providers will want to do something similar but I don't think it is a large burden for them.

This sounds fine to me. We can implement the fallback logic internally.

Comment on lines 127 to 130
std::shared_ptr<ExtensionProvider> kDefaultExtensionProvider =
std::make_shared<DefaultExtensionProvider>();

std::mutex kDefaultExtensionProviderMutex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<ExtensionProvider> kDefaultExtensionProvider =
std::make_shared<DefaultExtensionProvider>();
std::mutex kDefaultExtensionProviderMutex;
std::shared_ptr<ExtensionProvider> default_extension_provider =
std::make_shared<DefaultExtensionProvider>();
std::mutex default_extension_provider_mutex;

These aren't constants so we can't use kName. Will this work?

Copy link
Contributor Author

@rtpsw rtpsw Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we can't use kName. I believe the right way is to use g_name, which I just used in the recent commit.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feature.

@ursabot
Copy link

ursabot commented Feb 7, 2023

Benchmark runs are scheduled for baseline = 2b661ba and contender = 7423f03. 7423f03 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.24% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.03% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7423f033 ec2-t3-xlarge-us-east-2
[Failed] 7423f033 test-mac-arm
[Finished] 7423f033 ursa-i9-9960x
[Finished] 7423f033 ursa-thinkcentre-m75q
[Finished] 2b661bad ec2-t3-xlarge-us-east-2
[Finished] 2b661bad test-mac-arm
[Finished] 2b661bad ursa-i9-9960x
[Finished] 2b661bad ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@rtpsw rtpsw deleted the GH-33850 branch February 7, 2023 13:13
@kou
Copy link
Member

kou commented Feb 8, 2023

@github-actions crossbow submit test-ubuntu-18.04-cpp

@github-actions
Copy link

github-actions bot commented Feb 8, 2023

Revision: aa57153

Submitted crossbow builds: ursacomputing/crossbow @ actions-c920441bc9

Task Status
test-ubuntu-18.04-cpp Github Actions

@kou
Copy link
Member

kou commented Feb 8, 2023

@rtpsw @westonpace This broke test-ubuntu-18.04-cpp job:

https://github.com/ursacomputing/crossbow/actions/runs/4121522608/jobs/7117290841#step:5:1695

FAILED: src/arrow/engine/CMakeFiles/arrow_substrait_shared.dir/substrait/options.cc.o 
/usr/local/bin/sccache /usr/bin/c++  -DARROW_ENGINE_EXPORTING -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_HDFS -DARROW_MIMALLOC -DARROW_NO_DEPRECATED_API -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -DUTF8PROC_STATIC -Darrow_substrait_shared_EXPORTS -Isrc -I/arrow/cpp/src -I/arrow/cpp/src/generated -Isubstrait_ep-generated -isystem /arrow/cpp/thirdparty/flatbuffers/include -isystem /arrow/cpp/thirdparty/hadoop/include -isystem orc_ep-install/include -isystem zstd_ep-install/include -isystem utf8proc_ep-install/include -isystem xsimd_ep/src/xsimd_ep-install/include -isystem jemalloc_ep-prefix/src -isystem mimalloc_ep/src/mimalloc_ep/include/mimalloc-2.0 -Wno-noexcept-type  -fdiagnostics-color=always  -Wall -Wno-conversion -Wno-sign-conversion -Wunused-result -fno-semantic-interposition -msse4.2  -g -Werror -O0 -ggdb -fPIC   -pthread -std=c++1z -MD -MT src/arrow/engine/CMakeFiles/arrow_substrait_shared.dir/substrait/options.cc.o -MF src/arrow/engine/CMakeFiles/arrow_substrait_shared.dir/substrait/options.cc.o.d -o src/arrow/engine/CMakeFiles/arrow_substrait_shared.dir/substrait/options.cc.o -c /arrow/cpp/src/arrow/engine/substrait/options.cc
/arrow/cpp/src/arrow/engine/substrait/options.cc:130:6: error: 'mutex' in namespace 'std' does not name a type
 std::mutex g_default_extension_provider_mutex;
      ^~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc: In function 'std::shared_ptr<arrow::engine::ExtensionProvider> arrow::engine::default_extension_provider()':
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:8: error: 'unique_lock' is not a member of 'std'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
        ^~~~~~~~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:8: note: suggested alternative: 'unique_copy'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
        ^~~~~~~~~~~
        unique_copy
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:25: error: 'mutex' is not a member of 'std'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                         ^~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:37: error: 'g_default_extension_provider_mutex' was not declared in this scope
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:37: note: suggested alternative: 'default_extension_provider'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                     default_extension_provider
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:32: error: 'lock' was not declared in this scope
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                ^~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:135:32: note: suggested alternative: 'clock'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                ^~~~
                                clock
/arrow/cpp/src/arrow/engine/substrait/options.cc: In function 'void arrow::engine::set_default_extension_provider(const std::shared_ptr<arrow::engine::ExtensionProvider>&)':
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:8: error: 'unique_lock' is not a member of 'std'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
        ^~~~~~~~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:8: note: suggested alternative: 'unique_copy'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
        ^~~~~~~~~~~
        unique_copy
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:25: error: 'mutex' is not a member of 'std'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                         ^~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:37: error: 'g_default_extension_provider_mutex' was not declared in this scope
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:37: note: suggested alternative: 'default_extension_provider'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                     default_extension_provider
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:32: error: 'lock' was not declared in this scope
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                ^~~~
/arrow/cpp/src/arrow/engine/substrait/options.cc:140:32: note: suggested alternative: 'clock'
   std::unique_lock<std::mutex> lock(g_default_extension_provider_mutex);
                                ^~~~
                                clock

Could you check this?

@rtpsw rtpsw restored the GH-33850 branch February 8, 2023 09:00
@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 8, 2023

@kou, I'll handle this. I don't have the same platform readily available, but I have a reasonable guess of the cause.

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 8, 2023

@westonpace, could you please reopen this PR? I already pushed a commit to the same branch.

Edit: Or perhaps we must do this in a separate PR?

@icexelloss
Copy link
Contributor

icexelloss commented Feb 8, 2023 via email

@rtpsw
Copy link
Contributor Author

rtpsw commented Feb 8, 2023

Created a fix-PR.

westonpace pushed a commit that referenced this pull request Feb 8, 2023
…nfigured (fix) (#34075)

See [this post](#34042 (comment)) for background.
* Closes: #33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Feb 10, 2023
… be configured (apache#34042)

See apache#33850
* Closes: apache#33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Feb 10, 2023
… be configured (fix) (apache#34075)

See [this post](apache#34042 (comment)) for background.
* Closes: apache#33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
… be configured (apache#34042)

See apache#33850
* Closes: apache#33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
… be configured (fix) (apache#34075)

See [this post](apache#34042 (comment)) for background.
* Closes: apache#33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
… be configured (apache#34042)

See apache#33850
* Closes: apache#33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
… be configured (fix) (apache#34075)

See [this post](apache#34042 (comment)) for background.
* Closes: apache#33850

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Allow Substrait's default extension provider to be configured
5 participants