-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added udf/udaf support to th etesting tool. #3252
base: master
Are you sure you want to change the base?
feat: Added udf/udaf support to th etesting tool. #3252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the core of this change looks good, but I'm requesting changes because of the usability concern and the test speed concern.
Also, how does this fix #3094?
-------------- | ||
|
||
The KSQL testing tool supports UDFs and UDAFs the same as KSQL. However, unlike KSQL where the location of the jar file that contains the UDF/UDAF code is configured by ``ksql.extension.dir``, | ||
in the KSQL testing tool, you should put the jar file in the same folder as the test statements file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem like a very good user experience. I should be able to point you to where the JAR is instead of forcing me to place it in a specific place for tests (likely duplicating where I actually store it) by providing a --udf-jar path/to/jar
flag or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316900701)
Totally agree - we should not require the jar in the same folder as the statements file.
We should be able to run the test with a directory structure where test files are in sub-directories, and we shouldn't expect people to copy the jar into multiple places.
I would say we just load it from the class path. There is no need to do anything special if we do that.
I'm not sure we need a --udf-jar path/to/jar
flag for the application, the same can be achieved by adding it to the class path on the command line. Of course we could have a wrapper script, (we may already!), that took such a parameter and included it in the class path.
Side note: Can we currently run the tool against a directory of files or are we expecting users to script up reading a directory of files and invoking per-file? Surely the former... otherwise just the cost of spinning up the JVM for each test is going to be prohibitive!
new UdfLoader(mutableFR, | ||
statementFileObject.getParentFile(), | ||
KsqlTestingTool.class.getClassLoader(), | ||
value -> false, new UdfCompiler(Optional.empty()), Optional.empty(), true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value -> false, new UdfCompiler(Optional.empty()), Optional.empty(), true | |
value -> false, | |
new UdfCompiler(Optional.empty()), | |
Optional.empty(), | |
true |
@@ -122,14 +117,16 @@ static void runWithTripleFiles( | |||
true | |||
); | |||
|
|||
final File statementFileObject = new File(statementFile); | |||
final FunctionRegistry functionRegistry | |||
= TestRunnerFunctionRegistryUtil.getFunctionRegistry(statementFileObject); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might slow down our tests (in KsqlTestingToolTest
) because we are loading the functions every time. If the test does not supply a custom jar, I think we should fallback to TestFunctionRegistry.INSTANCE.get()
- can you post how long it takes to run the tests before and after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316902086)
So TestFunctionRegistry.INSTANCE
will also be loading UDFs once per test file if the testing tool doesn't support running a directory structure of tests.
If we do as I suggest and require the UDF jar on the class path then TestFunctionRegistry.INSTANCE
will already have loaded the custom UDFs.
We certainly shouldn't be loading UDFs twice. This is really slow!
So our fix should not be to mess around with the logic here - it should always use TestFuncitonRegistry.INSTANCE
- and our fix for performance should be to support running the tool against recursively against a directory structure of tests. This can either be with single combined test files, or with the separate files, but matching them using some naming convention.
@@ -0,0 +1,2 @@ | |||
CREATE STREAM TEST (ID bigint, NAME varchar, VALUE double) WITH (kafka_topic='test_topic', value_format='DELIMITED', key='ID'); | |||
CREATE STREAM S1 as SELECT myudf(name) FROM test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was test-udf-1.0-SNAPSHOT.jar
created? It's nice to be able to update this jar if necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have test UDFS in the code base, right? We should be either using those, or adding a new test-udf module. We shouldn't be checking in such jars as binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hjafarpour
I think a much simpler solution to allow supporting of custom UDFs is just to expect them on the class path. This requires no code changes! And is pretty standard. This also does not require the user to copy the jar anywhere.
I'm not sure if we supply a bash script for executing the testing tool, but if we don't then we should. Such a tool could have a --udf
switch, as @almog suggests, which would handle adding the jar or directory to the class path.
While reviewing this I've noted that it seems to test tool doesn't support operating on a directory structure of tests - is that the case? If so, then we need to fix this urgently, otherwise the performance of operating over many files is going to be terrible. Spinning up a JVM and loading UDFs for each file is going to be a killer!
Also, we should not be checking in jar binaries. Luckily, if we're just loading off the class path there is nothing to do, so no need for the custom UDF jar.
Finally, this PR does not fix the issue of test jars being available in our release (#3094) - this is cause by the testing tool continuing dependency on test jars.
-------------- | ||
|
||
The KSQL testing tool supports UDFs and UDAFs the same as KSQL. However, unlike KSQL where the location of the jar file that contains the UDF/UDAF code is configured by ``ksql.extension.dir``, | ||
in the KSQL testing tool, you should put the jar file in the same folder as the test statements file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316900701)
Totally agree - we should not require the jar in the same folder as the statements file.
We should be able to run the test with a directory structure where test files are in sub-directories, and we shouldn't expect people to copy the jar into multiple places.
I would say we just load it from the class path. There is no need to do anything special if we do that.
I'm not sure we need a --udf-jar path/to/jar
flag for the application, the same can be achieved by adding it to the class path on the command line. Of course we could have a wrapper script, (we may already!), that took such a parameter and included it in the class path.
Side note: Can we currently run the tool against a directory of files or are we expecting users to script up reading a directory of files and invoking per-file? Surely the former... otherwise just the cost of spinning up the JVM for each test is going to be prohibitive!
@@ -122,14 +117,16 @@ static void runWithTripleFiles( | |||
true | |||
); | |||
|
|||
final File statementFileObject = new File(statementFile); | |||
final FunctionRegistry functionRegistry | |||
= TestRunnerFunctionRegistryUtil.getFunctionRegistry(statementFileObject); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Replying to Almog's thread: https://github.com/confluentinc/ksql/pull/3252/files#r316902086)
So TestFunctionRegistry.INSTANCE
will also be loading UDFs once per test file if the testing tool doesn't support running a directory structure of tests.
If we do as I suggest and require the UDF jar on the class path then TestFunctionRegistry.INSTANCE
will already have loaded the custom UDFs.
We certainly shouldn't be loading UDFs twice. This is really slow!
So our fix should not be to mess around with the logic here - it should always use TestFuncitonRegistry.INSTANCE
- and our fix for performance should be to support running the tool against recursively against a directory structure of tests. This can either be with single combined test files, or with the separate files, but matching them using some naming convention.
} | ||
|
||
public static FunctionRegistry getFunctionRegistry(final File statementFileObject) { | ||
Objects.requireNonNull(statementFileObject, "statementFileObject"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary check
@@ -119,7 +119,7 @@ public void load() { | |||
} | |||
|
|||
// Does not handle customer udfs, i.e the loader is the ParentClassLoader and path is internal | |||
public void loadUdfFromClass(final Class<?> ... udfClass) { | |||
void loadUdfFromClass(final Class<?> ... udfClass) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is not actually used, (except in tests), you could just delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is actually necessary to test UDFs whose loading crashes the UdfLoader. Moreover, I think it is useful to be able to specify a subset of UDFs one wants to load. Currently, there is no way to specify which specific UDFs to load since all on the classpath are loaded.
@@ -79,7 +79,7 @@ | |||
|
|||
|
|||
@SuppressWarnings("OptionalUsedAsFieldOrParameterType") | |||
UdfLoader( | |||
public UdfLoader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than make this internal constructor public, it would be better to add a new public constructor with just the bits needed. This internal one requires stuff that no client of the class should need to pass - this constructor is for testing only.
I suggest:
public UdfLoader(
final MutableFunctionRegistry functionRegistry,
final Optional<File> customUdfDirectory,
final Optional<Metrics> metrics
) {
this(
functionRegistry,
customUdfDirectory.orElse(new File("")),
Thread.currentThread().getContextClassLoader(),
customUdfDirectory
.map(pluginDir -> (Predicate<String>) new Blacklist(
new File(pluginDir, "resource-blacklist.txt")))
.orElse(s -> false),
new UdfCompiler(metrics),
metrics,
customUdfDirectory.isPresent()
);
}
@SuppressWarnings("OptionalUsedAsFieldOrParameterType")
@VisibleForTesting
UdfLoader(
final MutableFunctionRegistry functionRegistry,
final File pluginDir,
final ClassLoader parentClassLoader,
final Predicate<String> blacklist,
final UdfCompiler compiler,
final Optional<Metrics> metrics,
final boolean loadCustomerUdfs
) {
...
}
This uses the presence of the customPluginDirectory
to determine if plugins should be loaded.
The internal use of the constructor can also be switched to use this new constructor, to remove code duplication:
public static UdfLoader newInstance(final KsqlConfig config,
final MutableFunctionRegistry metaStore,
final String ksqlInstallDir
) {
final boolean loadCustomerUdfs = config.getBoolean(KsqlConfig.KSQL_ENABLE_UDFS);
final boolean collectMetrics = config.getBoolean(KsqlConfig.KSQL_COLLECT_UDF_METRICS);
final String extDirName = config.getString(KsqlConfig.KSQL_EXT_DIR);
final Optional<File> pluginDir = loadCustomerUdfs
? Optional.of(KsqlConfig.DEFAULT_EXT_DIR.equals(extDirName)
? new File(ksqlInstallDir, extDirName)
: new File(extDirName))
: Optional.empty();
final Optional<Metrics> metrics = collectMetrics
? Optional.of(MetricCollectors.getMetrics())
: Optional.empty();
if (config.getBoolean(KsqlConfig.KSQL_UDF_SECURITY_MANAGER_ENABLED)) {
System.setSecurityManager(ExtensionSecurityManager.INSTANCE);
}
return new UdfLoader(metaStore, pluginDir, metrics);
}
For extra brownie points you can drop the loadCustomerUdfs
field completely by switching the pluginDir
field to Optional<File>
...
@@ -0,0 +1,2 @@ | |||
CREATE STREAM TEST (ID bigint, NAME varchar, VALUE double) WITH (kafka_topic='test_topic', value_format='DELIMITED', key='ID'); | |||
CREATE STREAM S1 as SELECT myudf(name) FROM test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have test UDFS in the code base, right? We should be either using those, or adding a new test-udf module. We shouldn't be checking in such jars as binaries.
UDF support is key to efficiently use the ksql-test-runner. Any news on this subject? Or is there any workaround to do so? |
|
Description
This PR adds UDF/UDAF support to the testing tool.
I also removed some unused methods!
It also fixes #3094
Testing done
Test with UDF added.
Reviewer checklist