Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added udf/udaf support to th etesting tool. #3252

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hjafarpour
Copy link
Contributor

Description

This PR adds UDF/UDAF support to the testing tool.
I also removed some unused methods!
It also fixes #3094

Testing done

Test with UDF added.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@hjafarpour hjafarpour requested review from JimGalasyn and a team August 22, 2019 17:58
@hjafarpour hjafarpour changed the title Added udf/udaf support to th etesting tool. feat: Added udf/udaf support to th etesting tool. Aug 22, 2019
Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core of this change looks good, but I'm requesting changes because of the usability concern and the test speed concern.

Also, how does this fix #3094?

--------------

The KSQL testing tool supports UDFs and UDAFs the same as KSQL. However, unlike KSQL where the location of the jar file that contains the UDF/UDAF code is configured by ``ksql.extension.dir``,
in the KSQL testing tool, you should put the jar file in the same folder as the test statements file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem like a very good user experience. I should be able to point you to where the JAR is instead of forcing me to place it in a specific place for tests (likely duplicating where I actually store it) by providing a --udf-jar path/to/jar flag or something like that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316900701)

Totally agree - we should not require the jar in the same folder as the statements file.

We should be able to run the test with a directory structure where test files are in sub-directories, and we shouldn't expect people to copy the jar into multiple places.

I would say we just load it from the class path. There is no need to do anything special if we do that.

I'm not sure we need a --udf-jar path/to/jar flag for the application, the same can be achieved by adding it to the class path on the command line. Of course we could have a wrapper script, (we may already!), that took such a parameter and included it in the class path.

Side note: Can we currently run the tool against a directory of files or are we expecting users to script up reading a directory of files and invoking per-file? Surely the former... otherwise just the cost of spinning up the JVM for each test is going to be prohibitive!

new UdfLoader(mutableFR,
statementFileObject.getParentFile(),
KsqlTestingTool.class.getClassLoader(),
value -> false, new UdfCompiler(Optional.empty()), Optional.empty(), true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value -> false, new UdfCompiler(Optional.empty()), Optional.empty(), true
value -> false,
new UdfCompiler(Optional.empty()),
Optional.empty(),
true

@@ -122,14 +117,16 @@ static void runWithTripleFiles(
true
);

final File statementFileObject = new File(statementFile);
final FunctionRegistry functionRegistry
= TestRunnerFunctionRegistryUtil.getFunctionRegistry(statementFileObject);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might slow down our tests (in KsqlTestingToolTest) because we are loading the functions every time. If the test does not supply a custom jar, I think we should fallback to TestFunctionRegistry.INSTANCE.get() - can you post how long it takes to run the tests before and after this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316902086)

So TestFunctionRegistry.INSTANCE will also be loading UDFs once per test file if the testing tool doesn't support running a directory structure of tests.

If we do as I suggest and require the UDF jar on the class path then TestFunctionRegistry.INSTANCE will already have loaded the custom UDFs.

We certainly shouldn't be loading UDFs twice. This is really slow!

So our fix should not be to mess around with the logic here - it should always use TestFuncitonRegistry.INSTANCE - and our fix for performance should be to support running the tool against recursively against a directory structure of tests. This can either be with single combined test files, or with the separate files, but matching them using some naming convention.

@@ -0,0 +1,2 @@
CREATE STREAM TEST (ID bigint, NAME varchar, VALUE double) WITH (kafka_topic='test_topic', value_format='DELIMITED', key='ID');
CREATE STREAM S1 as SELECT myudf(name) FROM test;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was test-udf-1.0-SNAPSHOT.jar created? It's nice to be able to update this jar if necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have test UDFS in the code base, right? We should be either using those, or adding a new test-udf module. We shouldn't be checking in such jars as binaries.

@agavra agavra requested a review from a team August 22, 2019 22:00
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hjafarpour

I think a much simpler solution to allow supporting of custom UDFs is just to expect them on the class path. This requires no code changes! And is pretty standard. This also does not require the user to copy the jar anywhere.

I'm not sure if we supply a bash script for executing the testing tool, but if we don't then we should. Such a tool could have a --udf switch, as @almog suggests, which would handle adding the jar or directory to the class path.

While reviewing this I've noted that it seems to test tool doesn't support operating on a directory structure of tests - is that the case? If so, then we need to fix this urgently, otherwise the performance of operating over many files is going to be terrible. Spinning up a JVM and loading UDFs for each file is going to be a killer!

Also, we should not be checking in jar binaries. Luckily, if we're just loading off the class path there is nothing to do, so no need for the custom UDF jar.

Finally, this PR does not fix the issue of test jars being available in our release (#3094) - this is cause by the testing tool continuing dependency on test jars.

--------------

The KSQL testing tool supports UDFs and UDAFs the same as KSQL. However, unlike KSQL where the location of the jar file that contains the UDF/UDAF code is configured by ``ksql.extension.dir``,
in the KSQL testing tool, you should put the jar file in the same folder as the test statements file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316900701)

Totally agree - we should not require the jar in the same folder as the statements file.

We should be able to run the test with a directory structure where test files are in sub-directories, and we shouldn't expect people to copy the jar into multiple places.

I would say we just load it from the class path. There is no need to do anything special if we do that.

I'm not sure we need a --udf-jar path/to/jar flag for the application, the same can be achieved by adding it to the class path on the command line. Of course we could have a wrapper script, (we may already!), that took such a parameter and included it in the class path.

Side note: Can we currently run the tool against a directory of files or are we expecting users to script up reading a directory of files and invoking per-file? Surely the former... otherwise just the cost of spinning up the JVM for each test is going to be prohibitive!

@@ -122,14 +117,16 @@ static void runWithTripleFiles(
true
);

final File statementFileObject = new File(statementFile);
final FunctionRegistry functionRegistry
= TestRunnerFunctionRegistryUtil.getFunctionRegistry(statementFileObject);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316902086)

So TestFunctionRegistry.INSTANCE will also be loading UDFs once per test file if the testing tool doesn't support running a directory structure of tests.

If we do as I suggest and require the UDF jar on the class path then TestFunctionRegistry.INSTANCE will already have loaded the custom UDFs.

We certainly shouldn't be loading UDFs twice. This is really slow!

So our fix should not be to mess around with the logic here - it should always use TestFuncitonRegistry.INSTANCE - and our fix for performance should be to support running the tool against recursively against a directory structure of tests. This can either be with single combined test files, or with the separate files, but matching them using some naming convention.

}

public static FunctionRegistry getFunctionRegistry(final File statementFileObject) {
Objects.requireNonNull(statementFileObject, "statementFileObject");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary check

@@ -119,7 +119,7 @@ public void load() {
}

// Does not handle customer udfs, i.e the loader is the ParentClassLoader and path is internal
public void loadUdfFromClass(final Class<?> ... udfClass) {
void loadUdfFromClass(final Class<?> ... udfClass) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is not actually used, (except in tests), you could just delete it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is actually necessary to test UDFs whose loading crashes the UdfLoader. Moreover, I think it is useful to be able to specify a subset of UDFs one wants to load. Currently, there is no way to specify which specific UDFs to load since all on the classpath are loaded.

@@ -79,7 +79,7 @@


@SuppressWarnings("OptionalUsedAsFieldOrParameterType")
UdfLoader(
public UdfLoader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than make this internal constructor public, it would be better to add a new public constructor with just the bits needed. This internal one requires stuff that no client of the class should need to pass - this constructor is for testing only.

I suggest:

public UdfLoader(
      final MutableFunctionRegistry functionRegistry,
      final Optional<File> customUdfDirectory,
      final Optional<Metrics> metrics
  ) {
    this(
        functionRegistry,
        customUdfDirectory.orElse(new File("")),
        Thread.currentThread().getContextClassLoader(),
        customUdfDirectory
            .map(pluginDir -> (Predicate<String>) new Blacklist(
                new File(pluginDir, "resource-blacklist.txt")))
            .orElse(s -> false),
        new UdfCompiler(metrics),
        metrics,
        customUdfDirectory.isPresent()
    );
  }

  @SuppressWarnings("OptionalUsedAsFieldOrParameterType")
  @VisibleForTesting
  UdfLoader(
      final MutableFunctionRegistry functionRegistry,
      final File pluginDir,
      final ClassLoader parentClassLoader,
      final Predicate<String> blacklist,
      final UdfCompiler compiler,
      final Optional<Metrics> metrics,
      final boolean loadCustomerUdfs
  ) {
  ...
 }

This uses the presence of the customPluginDirectory to determine if plugins should be loaded.

The internal use of the constructor can also be switched to use this new constructor, to remove code duplication:

public static UdfLoader newInstance(final KsqlConfig config,
      final MutableFunctionRegistry metaStore,
      final String ksqlInstallDir
  ) {
    final boolean loadCustomerUdfs = config.getBoolean(KsqlConfig.KSQL_ENABLE_UDFS);
    final boolean collectMetrics = config.getBoolean(KsqlConfig.KSQL_COLLECT_UDF_METRICS);
    final String extDirName = config.getString(KsqlConfig.KSQL_EXT_DIR);
    final Optional<File> pluginDir = loadCustomerUdfs
        ? Optional.of(KsqlConfig.DEFAULT_EXT_DIR.equals(extDirName)
        ? new File(ksqlInstallDir, extDirName)
        : new File(extDirName))
        : Optional.empty();

    final Optional<Metrics> metrics = collectMetrics
        ? Optional.of(MetricCollectors.getMetrics())
        : Optional.empty();

    if (config.getBoolean(KsqlConfig.KSQL_UDF_SECURITY_MANAGER_ENABLED)) {
      System.setSecurityManager(ExtensionSecurityManager.INSTANCE);
    }

    return new UdfLoader(metaStore, pluginDir, metrics);
  }

For extra brownie points you can drop the loadCustomerUdfs field completely by switching the pluginDir field to Optional<File>...

@@ -0,0 +1,2 @@
CREATE STREAM TEST (ID bigint, NAME varchar, VALUE double) WITH (kafka_topic='test_topic', value_format='DELIMITED', key='ID');
CREATE STREAM S1 as SELECT myudf(name) FROM test;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have test UDFS in the code base, right? We should be either using those, or adding a new test-udf module. We shouldn't be checking in such jars as binaries.

@big-andy-coates
Copy link
Contributor

Hey @vpapavas

I talked to @apurvam and we thought you might be able to pick up this PR and see it through...?

@big-andy-coates
Copy link
Contributor

Linking to #2804, which is the ticket that #3094 was closed as a duplicate of

@holgerbrandl
Copy link
Contributor

UDF support is key to efficiently use the ksql-test-runner. Any news on this subject? Or is there any workaround to do so?

Copy link

cla-assistant bot commented Nov 15, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test UDF functions are revealed
6 participants