Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add JavaScript UDF #2221

Merged
merged 2 commits into from Nov 17, 2023
Merged

feat: Add JavaScript UDF #2221

merged 2 commits into from Nov 17, 2023

Conversation

chubei
Copy link
Contributor

@chubei chubei commented Nov 14, 2023

Example:

  • dozer-config.yaml
app_name: js_udf
version: 1

connections:
  - name: input
    config: !JavaScript
      bootstrap_path: ./ingest.js

sql: SELECT square(value) as value INTO output FROM json_records;

endpoints:
  - name: input
    path: /input
    table_name: json_records

  - name: output
    path: /output
    table_name: output

udfs:
  - name: square
    config: !JavaScript
      module: ./square.js
  • ingest.js
(async function () {
    for (const value of [1, 2, 3]) {
        await Deno[Deno.internal].core.ops.ingest({
            typ: 'Insert',
            old_val: null,
            new_val: value,
        });
    }
}());
  • square.js
export default function (input) {
    return input * input;
}

Unit tests is not added yet.

The implementation itself is not complicated, however, we have to pass runtime around and change some functions to async, hence the 1000 lines change.

@Jesse-Bakker
Copy link
Contributor

I really dislike the whole pipeline being infected with async. Can we instead make the javascript UDF processor create its own Current-thread scheduler?

@chubei
Copy link
Contributor Author

chubei commented Nov 14, 2023

I'm not sure if I understand you correctly. If you are talking about the changes to ProcessorFactory, I think making it async makes sense, because it's already in async context. If you are taking about the Arc held in the udf, I don't have preference. It's just that I'm not used to using multiple runtime in the same process.

@Jesse-Bakker
Copy link
Contributor

I don't think whether or not something is in an async context dictates whether it should be async, but rather whether it does async work. As a very crude example, you would not write async fn add(a: usize, b: usize) { a + b }. That's why I think ProcessorFactory should not be async, because semantically, it's not async.

In this case, I prefer the UDF to run on its own runtime, because a UDF will probably be blocking and we don't want it to starve the main runtime. (I don't know if v8 will actually block the tokio runtime if a called js function blocks, but making the UDF create its own runtime also cleans up the API).

@chubei
Copy link
Contributor Author

chubei commented Nov 15, 2023

On the first point, ProcessorFactory is indeed async now, because it loads and executes js modules. Loading is async because it's mostly IO. Execution is async because js code runs in a dedicated thread, and the ProcessorFactory only needs to wait for the result. Note that this execution is module evaluation, not UDF execution.

On the second point, UDF execution also happens in dedicated thread, so runtime will not be blocked.

@Jesse-Bakker
Copy link
Contributor

What I'm trying to say is that the trait method should not be async. If one implementation happens to call an API that requires an async environment, we can use Handle::try_current().expect("JS UDF's require being run in a tokio runtime"), or it can spawn its own runtime at creation time (which is my preference, because implicit dependencies are nasty). This keeps the API for the processors a lot cleaner.

I'm willing to be convinced otherwise, so if you disagree, we can have a short sync discussion.

@chubei chubei added this pull request to the merge queue Nov 17, 2023
Merged via the queue into getdozer:main with commit 3ad02a8 Nov 17, 2023
6 checks passed
@chubei chubei deleted the feat/js_udf branch November 17, 2023 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants