Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Gandiva] Allow registering external C functions #38589

Closed
niyue opened this issue Nov 5, 2023 · 0 comments · Fixed by #38632
Closed

[C++][Gandiva] Allow registering external C functions #38589

niyue opened this issue Nov 5, 2023 · 0 comments · Fixed by #38632

Comments

@niyue
Copy link
Contributor

niyue commented Nov 5, 2023

Describe the enhancement requested

Description

In issue #37753, Gandiva provides the support to register external functions so that developers can register third party functions to use in Gandiva expressions. However, the supported external functions need to be compiled to LLVM IR so that they can be registered and used. This limitation causes troubles sometimes, in particular when the third party function has some non trivial dependency such as an HTTP library, because it requires compiling all dependent libraries into LLVM IR and compile all the IRs during runtime, which is slow.

Proposal

To address this limitation, I propose to allow registering external C functions to Gandiva, so that Gandiva expression can use these functions without relying on compiling third party functions into LLVM IR. Within Gandiva project, there are already such functions, and they are called stub function internally, but this capability is not exposed to external functions yet.

The following APIs are proposed to be added to the FunctionRegistry API for this purpose:

  • arrow::Status Register(NativeFunction func, void* c_function_ptr, std::optional<FunctionHolderMaker> function_holder_maker = std::nullopt)
    • register a C function into the function registry
    • @param func the registered function's metadata
    • @param c_function_ptr the function pointer to the registered function's implementation
    • @param function_holder_maker optional, this will be used as the function holder if the function requires a function holder, where using FunctionHolderMaker = std::function<arrow::Result<std::shared_ptr<gandiva::FunctionHolder>>(const FunctionNode& function_node)>
  • const std::vector<std::pair<NativeFunction, void*>>& GetCFunctions() const
    * get a list of C functions saved in the registry

Benefits

  • Complex functions that require some dependent libraries can be used without performance penalty. Previously LLVM IR based functions is slow to construct during runtime if the generated LLVM IR is big (> several MB), and since constructing LLVM module requires copying all LLVM bitcode into the modules, the more functions are implemented in LLVM IR, the slower constructing the LLVM module is (unless selective IR loading is supported)
  • LLVM IR does allow users to develop a third party function using different languages. However, complex external functions may use APIs in standard libraries in a language, which makes it necessary to compile that language's standard library into LLVM IR as well. This may not be possible in many languages, additionally, the generated LLVM IR will be too big (dead code elimination doesn't help too much about this as far as I can tell). If we allow using C functions, we could overcome this issue since the standard library usage is typically part of the Gandiva's caller program (statically linked or dynamically loaded)
  • Certain capabilities, like using thread local variables, are not available in current Gandiva's JIT engine (MCJIT engine) when JIT-compiling LLVM IR. We have to upgrade MCJIT engine to Orc v2 engine ([C++][Gandiva] Migration JIT engine from MCJIT to LLJIT #37848) to support this. Some libraries uses thread local variables, such as Rust's std::collections::HashMap, which internally uses thread local variable, and this makes it easily running into this restriction if we are authoring a third party function using Rust. But if we allow C functions, there won't be such limitation.

Notes

Component(s)

C++ - Gandiva

kou added a commit that referenced this issue Nov 8, 2023
# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See #37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue #38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: #37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Nov 9, 2023
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@niyue niyue changed the title [C++][Gandiva] Allow registering external C interface functions [C++][Gandiva] Allow registering external C functions Nov 9, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kou pushed a commit that referenced this issue Nov 17, 2023
…8632)

### Rationale for this change
This PR tries to enhance Gandiva by supporting registering external C functions to its function registry, so that developers can author third party functions with complex dependency and expose them as C functions to be used in Gandiva expression. See more details in GH-38589.

### What changes are included in this PR?
This PR primarily adds a new API to the `FunctionRegistry` so that developers can use it to register external C functions:
```C++
arrow::Status Register(
      NativeFunction func, void* c_function_ptr,
      std::optional<FunctionHolderMaker> function_holder_maker = std::nullopt);
```

### Are these changes tested?
* The changes are tested via unit tests in this PR, and the unit tests include several C functions written using C++ and we confirm this kind of functions can be used by Gandiva after registration using the above mentioned new API.
* Additionally, locally I wrote some Rust based functions, and integrate the Rust based functions into a C++ program by using the new registration API and verified this approach did work, but this piece of work is not included in the PR.

### Are there any user-facing changes?
There are several new APIs added to `FunctionRegistry` class:
```C++
/// \brief register a C function into the function registry
  /// @ param func the registered function's metadata
  /// @ param c_function_ptr the function pointer to the
  /// registered function's implementation
  /// @ param function_holder_maker this will be used as the function holder if the
  /// function requires a function holder
  arrow::Status Register(
      NativeFunction func, void* c_function_ptr,
      std::optional<FunctionHolderMaker> function_holder_maker = std::nullopt);

  /// \brief get a list of C functions saved in the registry
  const std::vector<std::pair<NativeFunction, void*>>& GetCFunctions() const;

  const FunctionHolderMakerRegistry& GetFunctionHolderMakerRegistry() const;
```

* Closes: #38589

### Notes
* This PR is related with #38116, which adds the initial support for registering LLVM IR based external functions into Gandiva.

Authored-by: Yue Ni <niyue.com@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 15.0.0 milestone Nov 17, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ns (apache#38632)

### Rationale for this change
This PR tries to enhance Gandiva by supporting registering external C functions to its function registry, so that developers can author third party functions with complex dependency and expose them as C functions to be used in Gandiva expression. See more details in apacheGH-38589.

### What changes are included in this PR?
This PR primarily adds a new API to the `FunctionRegistry` so that developers can use it to register external C functions:
```C++
arrow::Status Register(
      NativeFunction func, void* c_function_ptr,
      std::optional<FunctionHolderMaker> function_holder_maker = std::nullopt);
```

### Are these changes tested?
* The changes are tested via unit tests in this PR, and the unit tests include several C functions written using C++ and we confirm this kind of functions can be used by Gandiva after registration using the above mentioned new API.
* Additionally, locally I wrote some Rust based functions, and integrate the Rust based functions into a C++ program by using the new registration API and verified this approach did work, but this piece of work is not included in the PR.

### Are there any user-facing changes?
There are several new APIs added to `FunctionRegistry` class:
```C++
/// \brief register a C function into the function registry
  /// @ param func the registered function's metadata
  /// @ param c_function_ptr the function pointer to the
  /// registered function's implementation
  /// @ param function_holder_maker this will be used as the function holder if the
  /// function requires a function holder
  arrow::Status Register(
      NativeFunction func, void* c_function_ptr,
      std::optional<FunctionHolderMaker> function_holder_maker = std::nullopt);

  /// \brief get a list of C functions saved in the registry
  const std::vector<std::pair<NativeFunction, void*>>& GetCFunctions() const;

  const FunctionHolderMakerRegistry& GetFunctionHolderMakerRegistry() const;
```

* Closes: apache#38589

### Notes
* This PR is related with apache#38116, which adds the initial support for registering LLVM IR based external functions into Gandiva.

Authored-by: Yue Ni <niyue.com@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
lriggs added a commit to lriggs/arrow that referenced this issue Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants