Skip to content

Commit

Permalink
GH-37753: [C++][Gandiva] Add external function registry support (#38116)
Browse files Browse the repository at this point in the history
# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See #37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue #38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: #37753

Lead-authored-by: Yue Ni <niyue.com@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
  • Loading branch information
niyue and kou committed Nov 8, 2023
1 parent c4db009 commit bbb610e
Show file tree
Hide file tree
Showing 43 changed files with 809 additions and 247 deletions.
23 changes: 23 additions & 0 deletions c_glib/arrow-glib/version.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,15 @@
# define GARROW_UNAVAILABLE(major, minor) G_UNAVAILABLE(major, minor)
#endif

/**
* GARROW_VERSION_15_0:
*
* You can use this macro value for compile time API version check.
*
* Since: 15.0.0
*/
#define GARROW_VERSION_15_0 G_ENCODE_VERSION(15, 0)

/**
* GARROW_VERSION_14_0:
*
Expand Down Expand Up @@ -346,6 +355,20 @@

#define GARROW_AVAILABLE_IN_ALL

#if GARROW_VERSION_MIN_REQUIRED >= GARROW_VERSION_15_0
# define GARROW_DEPRECATED_IN_15_0 GARROW_DEPRECATED
# define GARROW_DEPRECATED_IN_15_0_FOR(function) GARROW_DEPRECATED_FOR(function)
#else
# define GARROW_DEPRECATED_IN_15_0
# define GARROW_DEPRECATED_IN_15_0_FOR(function)
#endif

#if GARROW_VERSION_MAX_ALLOWED < GARROW_VERSION_15_0
# define GARROW_AVAILABLE_IN_15_0 GARROW_UNAVAILABLE(15, 0)
#else
# define GARROW_AVAILABLE_IN_15_0
#endif

#if GARROW_VERSION_MIN_REQUIRED >= GARROW_VERSION_14_0
# define GARROW_DEPRECATED_IN_14_0 GARROW_DEPRECATED
# define GARROW_DEPRECATED_IN_14_0_FOR(function) GARROW_DEPRECATED_FOR(function)
Expand Down
4 changes: 4 additions & 0 deletions c_glib/doc/gandiva-glib/gandiva-glib-docs.xml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,10 @@
<title>Index of deprecated API</title>
<xi:include href="xml/api-index-deprecated.xml"><xi:fallback /></xi:include>
</index>
<index id="api-index-15-0-0" role="15.0.0">
<title>Index of new symbols in 15.0.0</title>
<xi:include href="xml/api-index-15.0.0.xml"><xi:fallback /></xi:include>
</index>
<index id="api-index-4-0-0" role="4.0.0">
<title>Index of new symbols in 4.0.0</title>
<xi:include href="xml/api-index-4.0.0.xml"><xi:fallback /></xi:include>
Expand Down
118 changes: 101 additions & 17 deletions c_glib/gandiva-glib/function-registry.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
*/

#include <gandiva/function_registry.h>
#include <gandiva-glib/function-registry.h>

#include <gandiva-glib/function-registry.hpp>
#include <gandiva-glib/function-signature.hpp>
#include <gandiva-glib/native-function.hpp>

Expand All @@ -34,18 +34,86 @@ G_BEGIN_DECLS
* Since: 0.14.0
*/

G_DEFINE_TYPE(GGandivaFunctionRegistry,
ggandiva_function_registry,
G_TYPE_OBJECT)
struct GGandivaFunctionRegistryPrivate {
std::shared_ptr<gandiva::FunctionRegistry> function_registry;
};

enum {
PROP_FUNCTION_REGISTRY = 1,
};

G_DEFINE_TYPE_WITH_PRIVATE(GGandivaFunctionRegistry,
ggandiva_function_registry,
G_TYPE_OBJECT)

#define GGANDIVA_FUNCTION_REGISTRY_GET_PRIVATE(object) \
static_cast<GGandivaFunctionRegistryPrivate *>( \
ggandiva_function_registry_get_instance_private( \
GGANDIVA_FUNCTION_REGISTRY(object)))

static void
ggandiva_function_registry_finalize(GObject *object)
{
auto priv = GGANDIVA_FUNCTION_REGISTRY_GET_PRIVATE(object);
priv->function_registry.~shared_ptr();
G_OBJECT_CLASS(ggandiva_function_registry_parent_class)->finalize(object);
}

static void
ggandiva_function_registry_set_property(GObject *object,
guint prop_id,
const GValue *value,
GParamSpec *pspec)
{
auto priv = GGANDIVA_FUNCTION_REGISTRY_GET_PRIVATE(object);

switch (prop_id) {
case PROP_FUNCTION_REGISTRY:
priv->function_registry =
*static_cast<std::shared_ptr<gandiva::FunctionRegistry> *>(
g_value_get_pointer(value));
break;
default:
G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec);
break;
}
}

static void
ggandiva_function_registry_init(GGandivaFunctionRegistry *object)
{
auto priv = GGANDIVA_FUNCTION_REGISTRY_GET_PRIVATE(object);
new(&priv->function_registry) std::shared_ptr<gandiva::FunctionRegistry>;
}

static void
ggandiva_function_registry_class_init(GGandivaFunctionRegistryClass *klass)
{
auto gobject_class = G_OBJECT_CLASS(klass);
gobject_class->finalize = ggandiva_function_registry_finalize;
gobject_class->set_property = ggandiva_function_registry_set_property;

GParamSpec *spec;
spec = g_param_spec_pointer("function-registry",
"Function registry",
"The raw std::shared_ptr<gandiva::FunctionRegistry> *",
static_cast<GParamFlags>(G_PARAM_WRITABLE |
G_PARAM_CONSTRUCT_ONLY));
g_object_class_install_property(gobject_class, PROP_FUNCTION_REGISTRY, spec);
}

/**
* ggandiva_function_registry_default:
*
* Returns: (transfer full): The process-wide default function registry.
*
* Since: 15.0.0
*/
GGandivaFunctionRegistry *
ggandiva_function_registry_default(void)
{
auto gandiva_function_registry = gandiva::default_function_registry();
return ggandiva_function_registry_new_raw(&gandiva_function_registry);
}

/**
Expand All @@ -58,7 +126,8 @@ ggandiva_function_registry_class_init(GGandivaFunctionRegistryClass *klass)
GGandivaFunctionRegistry *
ggandiva_function_registry_new(void)
{
return GGANDIVA_FUNCTION_REGISTRY(g_object_new(GGANDIVA_TYPE_FUNCTION_REGISTRY, NULL));
auto gandiva_function_registry = std::make_shared<gandiva::FunctionRegistry>();
return ggandiva_function_registry_new_raw(&gandiva_function_registry);
}

/**
Expand All @@ -75,15 +144,16 @@ GGandivaNativeFunction *
ggandiva_function_registry_lookup(GGandivaFunctionRegistry *function_registry,
GGandivaFunctionSignature *function_signature)
{
gandiva::FunctionRegistry gandiva_function_registry;
auto gandiva_function_registry =
ggandiva_function_registry_get_raw(function_registry);
auto gandiva_function_signature =
ggandiva_function_signature_get_raw(function_signature);
auto gandiva_native_function =
gandiva_function_registry.LookupSignature(*gandiva_function_signature);
gandiva_function_registry->LookupSignature(*gandiva_function_signature);
if (gandiva_native_function) {
return ggandiva_native_function_new_raw(gandiva_native_function);
} else {
return NULL;
return nullptr;
}
}

Expand All @@ -99,18 +169,32 @@ ggandiva_function_registry_lookup(GGandivaFunctionRegistry *function_registry,
GList *
ggandiva_function_registry_get_native_functions(GGandivaFunctionRegistry *function_registry)
{
gandiva::FunctionRegistry gandiva_function_registry;

auto gandiva_function_registry =
ggandiva_function_registry_get_raw(function_registry);
GList *native_functions = nullptr;
for (auto gandiva_native_function = gandiva_function_registry.begin();
gandiva_native_function != gandiva_function_registry.end();
++gandiva_native_function) {
auto native_function = ggandiva_native_function_new_raw(gandiva_native_function);
for (const auto &gandiva_native_function : *gandiva_function_registry) {
auto native_function = ggandiva_native_function_new_raw(&gandiva_native_function);
native_functions = g_list_prepend(native_functions, native_function);
}
native_functions = g_list_reverse(native_functions);

return native_functions;
return g_list_reverse(native_functions);
}

G_END_DECLS

GGandivaFunctionRegistry *
ggandiva_function_registry_new_raw(
std::shared_ptr<gandiva::FunctionRegistry> *gandiva_function_registry)
{
return GGANDIVA_FUNCTION_REGISTRY(
g_object_new(GGANDIVA_TYPE_FUNCTION_REGISTRY,
"function-registry", gandiva_function_registry,
nullptr));
}

std::shared_ptr<gandiva::FunctionRegistry>
ggandiva_function_registry_get_raw(GGandivaFunctionRegistry *function_registry)
{
auto priv = GGANDIVA_FUNCTION_REGISTRY_GET_PRIVATE(function_registry);
return priv->function_registry;
}

2 changes: 2 additions & 0 deletions c_glib/gandiva-glib/function-registry.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ struct _GGandivaFunctionRegistryClass
GObjectClass parent_class;
};

GARROW_AVAILABLE_IN_15_0
GGandivaFunctionRegistry *ggandiva_function_registry_default(void);
GGandivaFunctionRegistry *ggandiva_function_registry_new(void);
GGandivaNativeFunction *
ggandiva_function_registry_lookup(GGandivaFunctionRegistry *function_registry,
Expand Down
30 changes: 30 additions & 0 deletions c_glib/gandiva-glib/function-registry.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

#pragma once

#include <gandiva/function_registry.h>

#include <gandiva-glib/function-registry.h>

GGandivaFunctionRegistry *
ggandiva_function_registry_new_raw(
std::shared_ptr<gandiva::FunctionRegistry> *gandiva_function_registry);
std::shared_ptr<gandiva::FunctionRegistry>
ggandiva_function_registry_get_raw(GGandivaFunctionRegistry *function_registry);
2 changes: 1 addition & 1 deletion c_glib/test/gandiva/test-function-registry.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ class TestGandivaFunctionRegistry < Test::Unit::TestCase

def setup
omit("Gandiva is required") unless defined?(::Gandiva)
@registry = Gandiva::FunctionRegistry.new
@registry = Gandiva::FunctionRegistry.default
end

sub_test_case("lookup") do
Expand Down
2 changes: 1 addition & 1 deletion c_glib/test/gandiva/test-native-function.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ class TestGandivaNativeFunction < Test::Unit::TestCase

def setup
omit("Gandiva is required") unless defined?(::Gandiva)
@registry = Gandiva::FunctionRegistry.new
@registry = Gandiva::FunctionRegistry.default
@not = lookup("not", [boolean_data_type], boolean_data_type)
@isnull = lookup("isnull", [int8_data_type], boolean_data_type)
end
Expand Down
75 changes: 75 additions & 0 deletions cpp/cmake_modules/GandivaAddBitcode.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Create bitcode for the given source file.
function(gandiva_add_bitcode SOURCE)
set(CLANG_OPTIONS -std=c++17)
if(MSVC)
# "19.20" means that it's compatible with Visual Studio 16 2019.
# We can update this to "19.30" when we dropped support for Visual
# Studio 16 2019.
#
# See https://cmake.org/cmake/help/latest/variable/MSVC_VERSION.html
# for MSVC_VERSION and Visual Studio version.
set(FMS_COMPATIBILITY 19.20)
list(APPEND CLANG_OPTIONS -fms-compatibility
-fms-compatibility-version=${FMS_COMPATIBILITY})
endif()

get_filename_component(SOURCE_BASE ${SOURCE} NAME_WE)
get_filename_component(ABSOLUTE_SOURCE ${SOURCE} ABSOLUTE)
set(BC_FILE ${CMAKE_CURRENT_BINARY_DIR}/${SOURCE_BASE}.bc)
set(PRECOMPILE_COMMAND)
if(CMAKE_OSX_SYSROOT)
list(APPEND
PRECOMPILE_COMMAND
${CMAKE_COMMAND}
-E
env
SDKROOT=${CMAKE_OSX_SYSROOT})
endif()
list(APPEND
PRECOMPILE_COMMAND
${CLANG_EXECUTABLE}
${CLANG_OPTIONS}
-DGANDIVA_IR
-DNDEBUG # DCHECK macros not implemented in precompiled code
-DARROW_STATIC # Do not set __declspec(dllimport) on MSVC on Arrow symbols
-DGANDIVA_STATIC # Do not set __declspec(dllimport) on MSVC on Gandiva symbols
-fno-use-cxa-atexit # Workaround for unresolved __dso_handle
-emit-llvm
-O3
-c
${ABSOLUTE_SOURCE}
-o
${BC_FILE}
${ARROW_GANDIVA_PC_CXX_FLAGS})
if(ARROW_BINARY_DIR)
list(APPEND PRECOMPILE_COMMAND -I${ARROW_BINARY_DIR}/src)
endif()
if(ARROW_SOURCE_DIR)
list(APPEND PRECOMPILE_COMMAND -I${ARROW_SOURCE_DIR}/src)
endif()
if(NOT ARROW_USE_NATIVE_INT128)
foreach(boost_include_dir ${Boost_INCLUDE_DIRS})
list(APPEND PRECOMPILE_COMMAND -I${boost_include_dir})
endforeach()
endif()
add_custom_command(OUTPUT ${BC_FILE}
COMMAND ${PRECOMPILE_COMMAND}
DEPENDS ${SOURCE_FILE})
endfunction()
21 changes: 12 additions & 9 deletions cpp/cmake_modules/ThirdpartyToolchain.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -230,18 +230,21 @@ macro(build_dependency DEPENDENCY_NAME)
endif()
endmacro()

# Find modules are needed by the consumer in case of a static build, or if the
# linkage is PUBLIC or INTERFACE.
macro(provide_find_module PACKAGE_NAME ARROW_CMAKE_PACKAGE_NAME)
set(module_ "${CMAKE_SOURCE_DIR}/cmake_modules/Find${PACKAGE_NAME}.cmake")
if(EXISTS "${module_}")
message(STATUS "Providing CMake module for ${PACKAGE_NAME} as part of ${ARROW_CMAKE_PACKAGE_NAME} CMake package"
function(provide_cmake_module MODULE_NAME ARROW_CMAKE_PACKAGE_NAME)
set(module "${CMAKE_SOURCE_DIR}/cmake_modules/${MODULE_NAME}.cmake")
if(EXISTS "${module}")
message(STATUS "Providing CMake module for ${MODULE_NAME} as part of ${ARROW_CMAKE_PACKAGE_NAME} CMake package"
)
install(FILES "${module_}"
install(FILES "${module}"
DESTINATION "${ARROW_CMAKE_DIR}/${ARROW_CMAKE_PACKAGE_NAME}")
endif()
unset(module_)
endmacro()
endfunction()

# Find modules are needed by the consumer in case of a static build, or if the
# linkage is PUBLIC or INTERFACE.
function(provide_find_module PACKAGE_NAME ARROW_CMAKE_PACKAGE_NAME)
provide_cmake_module("Find${PACKAGE_NAME}" ${ARROW_CMAKE_PACKAGE_NAME})
endfunction()

macro(resolve_dependency DEPENDENCY_NAME)
set(options)
Expand Down

0 comments on commit bbb610e

Please sign in to comment.