refactor: Remove unnecessary arg conversion for UDFs #3595

purplefox · 2019-10-16T10:48:03Z

Description

Previously arg coercion for UDFs was attempting to coerce arguments passed to UDF into the types expected by the method signature, including doing such things as casting one numeric type to another or converting strings to other types.

However in reality the caller of the UDF (the expression evaluation) will never pass in invalid types as they are verified by the type parser and the results of any expression evaluation are already coerced to the right type before calling the UDF.

The only useful thing the arg coercion was doing is converting args for a varargs method into an array of the appropriate type - this code has been moved to the DynamicUdfInvoker.

Testing done

Amended unit tests as appropriate.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

agavra · 2019-10-17T22:08:51Z

Do we have any tests covering the end to end coercion mechanism that you can point me to? I'd be much more comfortable approving this with those tests in mind then just trying to scan the code. (And if we don't, then we should add tests for that flow)

purplefox · 2019-10-18T06:26:59Z

End to end arg coercion tests might be tricky, as it seems there is no arg coercion going on.
I.e. if I try and add a test that tries to invoke a UDF that expects an INT with a STRING and verify what is passed to the invoker, then the invoker never gets called.
I tried various things but I could never get the invoker to be called with arguments other than ones that are correct types for the function no matter what I tried. I tried with both CodeGenRunnerTest and writing QTT tests.
Maybe there is a way of doing this, but it seems the parser will not let me write a query with UDF calls in it, where the types being passed to the UDF don't exactly match at least one of the functions defined in the UDF. As you have more experience here, maybe you know some devious ways of crafting a query that ends up with different types being sent to the UDF?

big-andy-coates · 2019-10-18T09:50:50Z

Hummm... love PRs that remove unnecessary code!

I'm trying to think how we might end up invoking a UDF with something other than what it expects. KSQL is, in general, strongly typed. UDFs are matched on their signature, and looking at the why that matching works, it doesn't seem to do any implicit conversion of types. (I'm looking at SchemaUtil.isCompatible to see that). So it looks to me like we only match a UDF against the exact types in its signature. @agavra, I know you wrote this code, so you may be able to confirm / deny this.

So, assuming exact matching against the types, a UDF would only be passed the wrong type if something was happening up stream that wasn't checking types were as expected, e.g. a UDF that says its returning a List<Long> but at runtime actually returns a List<Integer>. I think this situation is possible if a UDF had a method signature that returned List and a _schemaProviderthat indicatedARRAY`, but I'm not sure. Of course, the right fix here is not coercion in the UDF code invocation code, but to have checking that return values from UDFs are of the correct (element) type. Such a test may already be in place.

In the future, I know we want/plan to support Object parameters, i.e. any type. But this also won't introduce the need for coercion, as the inputs are still strongly typed.

So... the conclusion I would draw is that this coercion is indeed unnecessary and can be removed. However, it might just be worth trying to write a bad-actor UDF to see how this is handled/detected/reported with the old and new code. As I said, you may find there is already code to handle this, and tests that test it, or maybe there isn't.

purplefox · 2019-10-18T12:09:34Z

e.g. a UDF that says its returning a List but at runtime actually returns a List. I think this situation is possible if a UDF had a method signature that returned List and a _schemaProviderthat indicatedARRAY`, but I'm not sure.

I tried this and the return type matching the schema type is checked in the code so it's not possible to craft such a udf:

Caused by: io.confluent.ksql.util.KsqlException: Return type MAP<STRING, BIGINT> of UDF TEST_UDF does not match the declared return type MAP<STRING, STRING>.
	at io.confluent.ksql.function.KsqlFunction.checkMatchingReturnTypes(KsqlFunction.java:181)
	at io.confluent.ksql.function.KsqlFunction.getReturnType(KsqlFunction.java:155)

big-andy-coates

Caused by: io.confluent.ksql.util.KsqlException: Return type MAP<STRING, BIGINT> of UDF TEST_UDF does not match the declared return type MAP<STRING, STRING>.
at io.confluent.ksql.function.KsqlFunction.checkMatchingReturnTypes(KsqlFunction.java:181)
at io.confluent.ksql.function.KsqlFunction.getReturnType(KsqlFunction.java:155)

I think that's not an error that is thrown when invoking the UDF, that's an error thrown when building the topology. Which suggests your test isn't actually testing what we want.

I've knocked up a quick test on master with:

@UdfDescription(name = "badudf", description = "UDF that returns a type that does not match the advertised schema")
public class BadUdf {

  @Udf(schema = "ARRAY<VARCHAR>")
  public List<Integer> apply(final String value) {
    return ImmutableList.of(1);
  }
}

Note how it says its returning ARRAY<VARCHAR> but actually returns ARRAY<INT>.

And:

@UdfDescription(name = "GoodUdf", description = "Returns first element")
public class GoodUdf {

  @Udf
  public String apply(final List<String> value) {
    return value.isEmpty() ? "" : value.get(0);
  }
}

Then added a QTT test:

{
      "name": "Create a struct from a string",
      "statements": [
        "CREATE STREAM test (value STRING) WITH (kafka_topic='test_topic', value_format='JSON');",
        "CREATE STREAM OUTPUT AS SELECT GoodUdf(BadUdf(value)) AS value FROM test;"
      ],
      "inputs": [
        {"topic": "test_topic", "key": 1, "value": {"value": "a"}, "timestamp": 0}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": 1, "value": {"VALUE": "a"}, "timestamp": 0}
      ]
    }

When this is run on master the GoodUdf is invoked with a list containing the integer 1, which results in a class cast exception.

If we remove the GoodUdf from the test case then the test fails when trying to serialize the row, as it doesn't match the expected schema.

Conclusion... a bad-actor UDF would not have worked previously, so there's not change in behaviour with this PR.

FYI, while investigating I found this issue: #3620

purplefox · 2019-10-18T14:56:36Z

Caused by: io.confluent.ksql.util.KsqlException: Return type MAP<STRING, BIGINT> of UDF TEST_UDF does not match the declared return type MAP<STRING, STRING>.
at io.confluent.ksql.function.KsqlFunction.checkMatchingReturnTypes(KsqlFunction.java:181)
at io.confluent.ksql.function.KsqlFunction.getReturnType(KsqlFunction.java:155)

I think that's not an error that is thrown when invoking the UDF, that's an error thrown when building the topology. Which suggests your test isn't actually testing what we want.

Yep, that was the point. I tried to write a test with a UDF where the return type didn't match the schema provider return but it didn't let me do that.

big-andy-coates · 2019-10-18T15:05:15Z

Yep, that was the point. I tried to write a test with a UDF where the return type didn't match the schema provider return but it didn't let me do that.

Confused... see my example BadUdf.

purplefox · 2019-10-18T15:48:58Z

tbh, people are going to be able to write bad UDFs that do naughty things anyway, e.g. its easy to cast a List<Integer> to a List<String> in a UDF so you don't even need to do the different schema to return type trick.

   @Udf("whatever")
  public List<String> foo() {
    List<Integer> bar = new ArrayList<>();
    bar.add(1);
    return (List<String>)(List)bar;
  }

Moreover, generics are erased at compile time, so at invocation time the engine can't distinguish between a List<String> and List<Integer> so doing runtime checks would be hard (well you could iterate through each element of the list and check them but that would be nasty and slow)

purplefox requested a review from a team as a code owner October 16, 2019 10:48

purplefox requested review from agavra and big-andy-coates October 16, 2019 10:48

fix: Get rid of UdfArgCoercer

fd41ed1

purplefox force-pushed the argcoercer_refactor branch from 9c19120 to fd41ed1 Compare October 16, 2019 21:09

big-andy-coates approved these changes Oct 18, 2019

View reviewed changes

purplefox merged commit 4c42530 into confluentinc:master Oct 18, 2019

purplefox deleted the argcoercer_refactor branch October 28, 2019 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Remove unnecessary arg conversion for UDFs #3595

refactor: Remove unnecessary arg conversion for UDFs #3595

purplefox commented Oct 16, 2019 •

edited

Loading

agavra commented Oct 17, 2019 •

edited

Loading

purplefox commented Oct 18, 2019 •

edited

Loading

big-andy-coates commented Oct 18, 2019

purplefox commented Oct 18, 2019

big-andy-coates left a comment •

edited

Loading

purplefox commented Oct 18, 2019

big-andy-coates commented Oct 18, 2019 •

edited

Loading

purplefox commented Oct 18, 2019 •

edited

Loading

refactor: Remove unnecessary arg conversion for UDFs #3595

refactor: Remove unnecessary arg conversion for UDFs #3595

Conversation

purplefox commented Oct 16, 2019 • edited Loading

Description

Testing done

Reviewer checklist

agavra commented Oct 17, 2019 • edited Loading

purplefox commented Oct 18, 2019 • edited Loading

big-andy-coates commented Oct 18, 2019

purplefox commented Oct 18, 2019

big-andy-coates left a comment • edited Loading

Choose a reason for hiding this comment

purplefox commented Oct 18, 2019

big-andy-coates commented Oct 18, 2019 • edited Loading

purplefox commented Oct 18, 2019 • edited Loading

purplefox commented Oct 16, 2019 •

edited

Loading

agavra commented Oct 17, 2019 •

edited

Loading

purplefox commented Oct 18, 2019 •

edited

Loading

big-andy-coates left a comment •

edited

Loading

big-andy-coates commented Oct 18, 2019 •

edited

Loading

purplefox commented Oct 18, 2019 •

edited

Loading