Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Gandiva] Support external function registry #37753

Closed
niyue opened this issue Sep 16, 2023 · 3 comments · Fixed by #38116
Closed

[C++][Gandiva] Support external function registry #37753

niyue opened this issue Sep 16, 2023 · 3 comments · Fixed by #38116

Comments

@niyue
Copy link
Contributor

niyue commented Sep 16, 2023

Describe the enhancement requested

Description

Our team has been leveraging Gandiva in our projects, and its performance and capabilities have been commendable. However, we've identified a constraint concerning the registration of functions. At present, Gandiva necessitates that functions be registered directly within its codebase. This method, while functional, is not the most user-friendly and presents hurdles for those aiming to incorporate third-party functions. Direct modifications to Gandiva's source code for such integrations can inadvertently introduce maintenance challenges and potential versioning conflicts down the line.

Proposal

To address this limitation, I propose the introduction of an external function registry mechanism in Gandiva. This would allow users and developers to register and integrate custom functions without directly modifying Gandiva's core source code.
Two major changes are proposed for supporting this capability:

  • add a new API AddFunction(NativeFunction native_function) for FunctionRegistry, where the given parameter native_function stores the external function metadata, so that developers can register external functions by calling this API. The external function metadata discovery responsibility is outside the scope of this proposal and developers can come up with her own
  • add a new class for storing external functions' LLVM IR buffers (likely a singleton class), and merge the external IRs with the built-in function IR into the LLVM module, so that third party pre-compiled functions can be integrated via LLVM bitcode

I've came up with a PR (#37787) with more details, including the JSON schema and a minimum external C++ external function registered via this approach to demonstrate/test how it works.
The latest PR for implementing this proposal is here (#38116)

Benefits

  • Flexibility: This feature would grant users the autonomy to effortlessly integrate third-party functions or bespoke logic, eliminating the need to navigate the core codebase.
  • Maintainability: Segregating external functions ensures that Gandiva's primary code remains streamlined, facilitating easier updates and maintenance.
  • Polyglot programming: The proposed registry could pave the way for function authoring beyond C++, potentially embracing languages like C, Rust, or Zig. These could then be integrated via a standard C API or even at the LLVM IR level.
  • Community Engagement: Encouraging external function registration could bolster community participation. This would enable developers to not only contribute but also disseminate their unique functions. Consequently, specialized functions could be curated and shared more widely.

While the intricate design specifics are still under consideration, I'm keen to understand the community's perspective on this proposal. Would an external function registry align with Gandiva's future roadmap? Your insights would be invaluable. Thank you.

Component(s)

C++ - Gandiva

UPDATE:

@js8544
Copy link
Collaborator

js8544 commented Sep 17, 2023

IMO this is definitely a useful feature. You are very welcome to submit a PR if you would like to!

@niyue
Copy link
Contributor Author

niyue commented Sep 19, 2023

@js8544 I've crafted a PR to introduce support for an external function registry. For a comprehensive overview, please refer to #37787. I'd appreciate your guidance on the most suitable reviewer for this contribution. Your feedback would be invaluable. Thank you.

@alamb
Copy link
Contributor

alamb commented Sep 25, 2023

Related mailing list discussion: https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod

@kou kou changed the title [C++] [Gandiva] Support external function registry [C++][Gandiva] Support external function registry Oct 4, 2023
@kou kou closed this as completed in #38116 Nov 8, 2023
kou added a commit that referenced this issue Nov 8, 2023
# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See #37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue #38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: #37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 15.0.0 milestone Nov 8, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Nov 9, 2023
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment