Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Remove unnecessary arg conversion for UDFs #3595

Merged
merged 1 commit into from
Oct 18, 2019

Conversation

purplefox
Copy link
Contributor

@purplefox purplefox commented Oct 16, 2019

Description

Previously arg coercion for UDFs was attempting to coerce arguments passed to UDF into the types expected by the method signature, including doing such things as casting one numeric type to another or converting strings to other types.

However in reality the caller of the UDF (the expression evaluation) will never pass in invalid types as they are verified by the type parser and the results of any expression evaluation are already coerced to the right type before calling the UDF.

The only useful thing the arg coercion was doing is converting args for a varargs method into an array of the appropriate type - this code has been moved to the DynamicUdfInvoker.

Testing done

Amended unit tests as appropriate.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@purplefox purplefox requested a review from a team as a code owner October 16, 2019 10:48
@agavra
Copy link
Contributor

agavra commented Oct 17, 2019

Do we have any tests covering the end to end coercion mechanism that you can point me to? I'd be much more comfortable approving this with those tests in mind then just trying to scan the code. (And if we don't, then we should add tests for that flow)

@purplefox
Copy link
Contributor Author

purplefox commented Oct 18, 2019

End to end arg coercion tests might be tricky, as it seems there is no arg coercion going on.
I.e. if I try and add a test that tries to invoke a UDF that expects an INT with a STRING and verify what is passed to the invoker, then the invoker never gets called.
I tried various things but I could never get the invoker to be called with arguments other than ones that are correct types for the function no matter what I tried. I tried with both CodeGenRunnerTest and writing QTT tests.
Maybe there is a way of doing this, but it seems the parser will not let me write a query with UDF calls in it, where the types being passed to the UDF don't exactly match at least one of the functions defined in the UDF. As you have more experience here, maybe you know some devious ways of crafting a query that ends up with different types being sent to the UDF?

@big-andy-coates
Copy link
Contributor

Hummm... love PRs that remove unnecessary code!

I'm trying to think how we might end up invoking a UDF with something other than what it expects. KSQL is, in general, strongly typed. UDFs are matched on their signature, and looking at the why that matching works, it doesn't seem to do any implicit conversion of types. (I'm looking at SchemaUtil.isCompatible to see that). So it looks to me like we only match a UDF against the exact types in its signature. @agavra, I know you wrote this code, so you may be able to confirm / deny this.

So, assuming exact matching against the types, a UDF would only be passed the wrong type if something was happening up stream that wasn't checking types were as expected, e.g. a UDF that says its returning a List<Long> but at runtime actually returns a List<Integer>. I think this situation is possible if a UDF had a method signature that returned List and a _schemaProviderthat indicatedARRAY`, but I'm not sure. Of course, the right fix here is not coercion in the UDF code invocation code, but to have checking that return values from UDFs are of the correct (element) type. Such a test may already be in place.

In the future, I know we want/plan to support Object parameters, i.e. any type. But this also won't introduce the need for coercion, as the inputs are still strongly typed.

So... the conclusion I would draw is that this coercion is indeed unnecessary and can be removed. However, it might just be worth trying to write a bad-actor UDF to see how this is handled/detected/reported with the old and new code. As I said, you may find there is already code to handle this, and tests that test it, or maybe there isn't.

@purplefox
Copy link
Contributor Author

e.g. a UDF that says its returning a List but at runtime actually returns a List. I think this situation is possible if a UDF had a method signature that returned List and a _schemaProviderthat indicatedARRAY`, but I'm not sure.

I tried this and the return type matching the schema type is checked in the code so it's not possible to craft such a udf:

Caused by: io.confluent.ksql.util.KsqlException: Return type MAP<STRING, BIGINT> of UDF TEST_UDF does not match the declared return type MAP<STRING, STRING>.
	at io.confluent.ksql.function.KsqlFunction.checkMatchingReturnTypes(KsqlFunction.java:181)
	at io.confluent.ksql.function.KsqlFunction.getReturnType(KsqlFunction.java:155)

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caused by: io.confluent.ksql.util.KsqlException: Return type MAP<STRING, BIGINT> of UDF TEST_UDF does not match the declared return type MAP<STRING, STRING>.
at io.confluent.ksql.function.KsqlFunction.checkMatchingReturnTypes(KsqlFunction.java:181)
at io.confluent.ksql.function.KsqlFunction.getReturnType(KsqlFunction.java:155)

I think that's not an error that is thrown when invoking the UDF, that's an error thrown when building the topology. Which suggests your test isn't actually testing what we want.

I've knocked up a quick test on master with:

@UdfDescription(name = "badudf", description = "UDF that returns a type that does not match the advertised schema")
public class BadUdf {

  @Udf(schema = "ARRAY<VARCHAR>")
  public List<Integer> apply(final String value) {
    return ImmutableList.of(1);
  }
}

Note how it says its returning ARRAY<VARCHAR> but actually returns ARRAY<INT>.

And:

@UdfDescription(name = "GoodUdf", description = "Returns first element")
public class GoodUdf {

  @Udf
  public String apply(final List<String> value) {
    return value.isEmpty() ? "" : value.get(0);
  }
}

Then added a QTT test:

{
      "name": "Create a struct from a string",
      "statements": [
        "CREATE STREAM test (value STRING) WITH (kafka_topic='test_topic', value_format='JSON');",
        "CREATE STREAM OUTPUT AS SELECT GoodUdf(BadUdf(value)) AS value FROM test;"
      ],
      "inputs": [
        {"topic": "test_topic", "key": 1, "value": {"value": "a"}, "timestamp": 0}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": 1, "value": {"VALUE": "a"}, "timestamp": 0}
      ]
    }

When this is run on master the GoodUdf is invoked with a list containing the integer 1, which results in a class cast exception.

If we remove the GoodUdf from the test case then the test fails when trying to serialize the row, as it doesn't match the expected schema.

Conclusion... a bad-actor UDF would not have worked previously, so there's not change in behaviour with this PR.

FYI, while investigating I found this issue: #3620

@purplefox
Copy link
Contributor Author

Caused by: io.confluent.ksql.util.KsqlException: Return type MAP<STRING, BIGINT> of UDF TEST_UDF does not match the declared return type MAP<STRING, STRING>.
at io.confluent.ksql.function.KsqlFunction.checkMatchingReturnTypes(KsqlFunction.java:181)
at io.confluent.ksql.function.KsqlFunction.getReturnType(KsqlFunction.java:155)

I think that's not an error that is thrown when invoking the UDF, that's an error thrown when building the topology. Which suggests your test isn't actually testing what we want.

Yep, that was the point. I tried to write a test with a UDF where the return type didn't match the schema provider return but it didn't let me do that.

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Oct 18, 2019

Yep, that was the point. I tried to write a test with a UDF where the return type didn't match the schema provider return but it didn't let me do that.

Confused... see my example BadUdf.

@purplefox
Copy link
Contributor Author

purplefox commented Oct 18, 2019

tbh, people are going to be able to write bad UDFs that do naughty things anyway, e.g. its easy to cast a List<Integer> to a List<String> in a UDF so you don't even need to do the different schema to return type trick.

   @Udf("whatever")
  public List<String> foo() {
    List<Integer> bar = new ArrayList<>();
    bar.add(1);
    return (List<String>)(List)bar;
  }

Moreover, generics are erased at compile time, so at invocation time the engine can't distinguish between a List<String> and List<Integer> so doing runtime checks would be hard (well you could iterate through each element of the list and check them but that would be nasty and slow)

@purplefox purplefox merged commit 4c42530 into confluentinc:master Oct 18, 2019
@purplefox purplefox deleted the argcoercer_refactor branch October 28, 2019 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants