Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): 14925, support udf wasm #15107

Merged
merged 27 commits into from
Apr 23, 2024

Conversation

shamb0
Copy link
Contributor

@shamb0 shamb0 commented Mar 27, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Integration of Wasm User-Defined Functions (UDFs)

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0
Copy link
Contributor Author

shamb0 commented Mar 27, 2024

Integration of Wasm User-Defined Functions (UDFs), Draft, Code Drop for Initial Review

  • Major changes are part of the files
Layer File
Physical layer src/query/service/src/pipelines/processors/transforms/transform_udf_script.rs
it test code src/query/ee/tests/it/ddl_basic/ddl_basic_01_create_udf.rs

Overview of the strategy used for integrating Wasm UDF (User-Defined Function) functionality. The two-stage integration process steps, which are outlined below.

Compilation and Storage

  1. The Wasm UDF module is built as part of the test application.
  2. The compiled Wasm module is compressed using the Zstd compression algorithm.
  3. The compressed data is stored in persistent storage using the DataOperator.
  4. The path to the compressed data is embedded as part of the CREATE FUNCTION SQL statement.

Execution

  1. During the physical layer pipeline transformation execution, the compressed data is retrieved from persistent storage using the DataOperator.
  2. The compressed data is decompressed using the Zstd decompression algorithm.
  3. The Wasm module is loaded into the arrow_udf_wasm::Runtime.
  4. The arrow_udf_wasm::Runtime is invoked to run the UDF functions.

Testing

The integration includes test code for the Wasm UDF functionality, located at src/query/ee/tests/it/ddl_basic/ddl_basic_01_create_udf.rs.

To run the it tests, use the following CLI commands:

  • Wasm UDF:
RUST_LOG=info \
	cargo test \
	--test it ddl_basic::ddl_basic_01_create_udf::test_udf_wasm_gcd

  • Python UDF:
RUST_LOG=info \
	cargo test \
	--test it ddl_basic::ddl_basic_01_create_udf::test_udf_py_gcd

  • JS UDF:
clear && \
    RUST_LOG=info \
    cargo test \
    --test it ddl_basic::ddl_basic_01_create_udf::test_udf_js_gcd

@shamb0 shamb0 changed the title feat-14925:support udf wasm, draft, code drop feat(query): 14925, support udf wasm, draft, code drop Mar 27, 2024
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Mar 27, 2024
@shamb0 shamb0 changed the title feat(query): 14925, support udf wasm, draft, code drop feat(query): 14925, support udf wasm Mar 27, 2024
…naging script execution

Signed-off-by: shamb0 <r.raajey@gmail.com>
…compressed and uncompressed WASM modules

Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0
Copy link
Contributor Author

shamb0 commented Mar 29, 2024

Hi @sundy-li ,

Please find quick update below ...

  • As suggested introduced enum ScriptRuntime, for the management of script runtime functionalities. I hope code structure is better now.

  • The injection of the runtime instance now occurs within the TransformUdfScript::try_create() function.

  • The WASM script runtime can now handle both compressed (zstd) and uncompressed WASM modules.

  • To detect the MIME type of WASM modules, I have implemented a temporary workaround using the infer crate. This step was necessary due to a limitation in the DataOperator::operator, which was unable to update the content type as expected within unit tests context. I need to verify this functionality in a standalone server test setup to ensure it operates as expected.

Here's a code snippet illustrating the issue:

let blocking_operator = DataOperator::instance().operator().blocking();
blocking_operator
    .write_with(&wasm_module_path, code_blob)
    .content_type("application/wasm")
    .call()?;
  • I believe all the captured requirements have been implemented and updated in the relevant code file (src/query/service/src/pipelines/processors/transforms/transform_udf_script.rs).

  • The code changes are now ready for review.

@sundy-li
Copy link
Member

Others LGTM.

Let's continue:

  1. make the ci happy
  2. change the ut to be logical test

@shamb0
Copy link
Contributor Author

shamb0 commented Apr 1, 2024

@sundy-li, Thank you for your review feedback. I am currently working on understanding the logical test, and I will provide an update soon. Before proceeding with the implementation, I need to clarify the approach for preparing the "wasm module". Specifically, should we prepare the WASM module within the test application itself, or do we need to integrate it into the Physical layer under the ScriptRuntime::create_wasm_runtime function?

@sundy-li
Copy link
Member

sundy-li commented Apr 1, 2024

Specifically, should we prepare the WASM module within the test application itself, or do we need to integrate it into the Physical layer under the ScriptRuntime::create_wasm_runtime function?

we should read the wasm binary codes inside resolve_udf_script in type_checker.rs, and then add the result into UDFType::Script

#[derive(Clone, Debug, Hash, Eq, PartialEq, serde::Serialize, serde::Deserialize, EnumAsInner)]
pub enum UDFType {
    Server(String),                   // server_addr
    Script((String, String, String)), // Lang, Version, Code | WasmBinary 
}

@shamb0
Copy link
Contributor Author

shamb0 commented Apr 15, 2024

Hi @sundy-li

I am resuming work on this task after a long break, as I was busy with my bootcamp capstone project which I completed yesterday.

I would like to discuss a few things and get some clarification from you.

Here's the context:

  • In the current implementation, after creating the WebAssembly (Wasm) module, it is compressed using the zstd codec. The resulting compressed binary blob is then uploaded to external storage. The path to this stored file is passed as an argument in the SQL CREATE statement. This current flow is functioning as expected.
	let command = format!(
		r#"CREATE FUNCTION wasm_gcd (INT, INT) RETURNS BIGINT LANGUAGE wasm HANDLER = 'wasm_gcd(int4,int4)->int4' AS $${wasm_module_path}$$"#
	);
	log::info!("Create UDF DDL command: {}", command);
	fixture.execute_command(&command).await?;
  • As an experiment, I tried an alternative approach because the binary size of the zstd blob was around 0.65 MB. I directly embedded this binary blob within the SQL CREATE statement. However, this resulted in a stack overflow error originating in the file sql/src/planner/planner.rs and propagating through the Abstract Syntax Tree (AST) Parser. I have included the complete call stack analysis in the file link for your reference.
    let command = format!(
        r#"CREATE FUNCTION wasm_gcd (INT, INT) RETURNS BIGINT LANGUAGE wasm HANDLER = 'wasm_gcd(int4,int4)->int4' AS $${:#?}$$"#,
        &code_blob
    );
    log::info!("Create UDF DDL command:");    
    fixture.execute_command(&command).await?;

My questions are:

  1. Is there any possibility to embed the binary blob as part of the SQL statement, perhaps with an extra marker flag, so that the SQL planner and AST parser can ignore the binary blob during parsing?

  2. If not, should I proceed with implementing the SQL logic test case using the existing approach of moving the binary blob to an external location and embedding the path as part of the SQL statement?

I hope I've provided enough details. Please let me know if you need any additional information that I may have missed.

@sundy-li
Copy link
Member

sundy-li commented Apr 15, 2024

If not, should I proceed with implementing the SQL logic test case using the existing approach of moving the binary blob to an external location and embedding the path as part of the SQL statement?

That will be better.

Maybe we can support a new syntax for WASM, eg:

CREATE FUNCTION wasm_gcd (INT, INT) RETURNS BIGINT LANGUAGE wasm HANDLER = 'wasm_gcd' from location; 

We need to refactor the struct UDFScript into:

#[derive(Clone, Debug, Eq, PartialEq)]
pub struct UDFScript {
    pub code: String,
    pub location: Option<FileLocation>,
    pub handler: String,
    pub language: String,
    pub arg_types: Vec<DataType>,
    pub return_type: DataType,
    pub runtime_version: String,
}

During resolve_udf, if location is some, we override code from the content of this location.

let (stage_info, path) = resolve_file_location();
let op = StageTable::get_op(&stage_info)?;
op.read..

@shamb0
Copy link
Contributor Author

shamb0 commented Apr 15, 2024

If not, should I proceed with implementing the SQL logic test case using the existing approach of moving the binary blob to an external location and embedding the path as part of the SQL statement?

That will be better.

Maybe we can support a new syntax for WASM, eg:

CREATE FUNCTION wasm_gcd (INT, INT) RETURNS BIGINT LANGUAGE wasm HANDLER = 'wasm_gcd' from location; 

We need to refactor the struct UDFScript into:

#[derive(Clone, Debug, Eq, PartialEq)]
pub struct UDFScript {
    pub code: String,
    pub location: Option<FileLocation>,
    pub handler: String,
    pub language: String,
    pub arg_types: Vec<DataType>,
    pub return_type: DataType,
    pub runtime_version: String,
}

During resolve_udf, if location is some, we override code from the content of this location.

let (stage_info, path) = resolve_file_location();
let op = StageTable::get_op(&stage_info)?;
op.read..

Fine Thankyou

Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0
Copy link
Contributor Author

shamb0 commented Apr 21, 2024

Hi @sundy-li, Please find update on latest changes ...

  1. The sqllogictests have been updated to include a test suite for testing UDF (User-Defined Function) scripts.

  2. The UDFType enum has been updated to introduce a new variant called WasmScript. This variant is used to process WebAssembly (Wasm) modules using the Vec<u8> type. The updated UDFType enum now looks like this:

#[derive(Clone, Debug, Hash, Eq, PartialEq, serde::Serialize, serde::Deserialize, EnumAsInner)]
pub enum UDFType {
    Server(String),
    Script((String, String, String)),
    WasmScript((String, String, Vec<u8>)),
}
  1. Test cases have been removed from the ee::it module.

@shamb0 shamb0 marked this pull request as ready for review April 21, 2024 05:30
Signed-off-by: shamb0 <r.raajey@gmail.com>
@sundy-li
Copy link
Member

sundy-li commented Apr 21, 2024

LGTM, and i believe that this pull request does not require support for arrow-udf-python. This is because there was a building issue, which can be seen at risingwavelabs/arrow-udf#15.

Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0
Copy link
Contributor Author

shamb0 commented Apr 21, 2024

LGTM, and i believe that this pull request does not require support for arrow-udf-python. This is because there was a building issue, which can be seen at risingwavelabs/arrow-udf#15.

Thank you for the review comments. I value the feedback. I will go through the comments carefully and get back to you soon if I have any questions or updates.

Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
@hanxuanliang
Copy link
Contributor

LGTM, and i believe that this pull request does not require support for arrow-udf-python. This is because there was a building issue, which can be seen at risingwavelabs/arrow-udf#15.

I have posted our current questions to the pyo3 Discord discussion, to see if there are any targeted suggestions from the pyo3 official team later on:

PYO3_PYTHON question

Cargo.toml Outdated Show resolved Hide resolved
@shamb0
Copy link
Contributor Author

shamb0 commented Apr 23, 2024

LGTM, and i believe that this pull request does not require support for arrow-udf-python. This is because there was a building issue, which can be seen at risingwavelabs/arrow-udf#15.

I have posted our current questions to the pyo3 Discord discussion, to see if there are any targeted suggestions from the pyo3 official team later on:

PYO3_PYTHON question

@hanxuanliang, as per @sundy-li's suggestion, we will create a separate pull request (PR) for integrating the 'arrow-udf-python' library. I have a backup of the changes, and they are working fine with Python version 3.12.2. However, a few commands in the 'make lint' process are failing, which requires further investigation.

Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
Signed-off-by: shamb0 <r.raajey@gmail.com>
@shamb0
Copy link
Contributor Author

shamb0 commented Apr 23, 2024

@sundy-li

  • While 99% of the CI jobs are passing, there are two remaining fuzz test cases, that are causing a stack overflow error, preventing the CI pipeline from completing successfully.
------------
     Summary [ 191.363s] 1623 tests run: 1621 passed, 2 failed, 9 skipped
     SIGABRT [  32.188s] databend-enterprise-query::it aggregating_index::index_scan::test_fuzz
     SIGABRT [  32.737s] databend-enterprise-query::it aggregating_index::index_scan::test_fuzz_with_spill
error: test run failed
Error: Process completed with exit code 100.
  • The stack trace suggests that the issue is related to the API src/query/sql/src/planner/semantic/type_check.rs::resolve_binary_op(). The complete call stack trace is available in the following Gist Link

image

Could you please provide guidance on how to investigate and resolve this stack overflow issue?

Thank you for your assistance.

Copy link
Member

@sundy-li sundy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @shamb0 for your contribution in this impressive pr.

@sundy-li sundy-li added this pull request to the merge queue Apr 23, 2024
Merged via the queue into datafuselabs:main with commit ba9b107 Apr 23, 2024
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: support wasm udf
3 participants