Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udfs): generic support for lists/maps in UDFs #3054

Merged
merged 5 commits into from
Jul 24, 2019

Conversation

agavra
Copy link
Contributor

@agavra agavra commented Jul 3, 2019

Description

NOTE: This PR looks large, but the entire non-test, non-description functionality is under 500 lines and contained to a few files (see description below).

This change adds support for a single generic in UDFs so that container types are more accessible in our UDF framework. For example, the following UDF can now be used to create a map from two lists, the second of which has a generic type (see the QTT test for this UDF for an example):

  @Udf
  public final <T> Map<String, T> asMap(
      @UdfParameter final List<String> keys,
      @UdfParameter final List<T> values) {
    final Map<String, T> map = new HashMap<>();
    for (int i = 0; i < keys.size(); i++) {
      map.putIfAbsent(keys.get(i), values.get(i));
    }
    return map;
  }

To do this, the following changes were made:

  • when resolving the return type from a method, KsqlFunction will apply some logic to resolve any generics in the return type based on the actual argument types being passed in
  • UdfIndex will index generic types, and will traverse the trie for any input
  • Generics are represented as BYTES with schema name GENERIC:<name>. If we ever move to our own type system (i.e. not connect) this can have first class support
  • GenericsUtil and JavaToConnect have all of the magic, take a look there
  • some UDFs to showcase this feature were added: SUBLIST, AS_LIST and AS_MAP
  • Some SchemaUtil methods were moved to JavaToConnect (and corresponding tests were moved as well)

Missing

  • DESCRIBE functionality

Testing done

  • unit tets
  • QTTs with some sample UDFs

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@agavra agavra requested a review from blueedgenick July 3, 2019 21:27
@agavra agavra requested a review from a team as a code owner July 3, 2019 21:27
@big-andy-coates
Copy link
Contributor

Generics are represented as BYTES with schema name GENERIC:. If we ever move to our own type system (i.e. not connect) this can have first class support

We should do the move to our own schema type first IMHO.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its not ready for review... don't raise a review ;)

@agavra
Copy link
Contributor Author

agavra commented Jul 15, 2019

We should do the move to our own schema type first IMHO.

While I agree with you in principle I think it depends on how invasive this change is, and looking at the code it's really not that bad. The hack is very limited to a small part of the code and the investment of moving to our own schema type system is very large.

If its not ready for review... don't raise a review ;)

I did this to share code in a way that was easy to reference :)

@agavra agavra changed the title feat(udfs): generic support for lists/maps in UDFs (PROTOTYPE) feat(udfs): generic support for lists/maps in UDFs Jul 18, 2019
@agavra agavra requested a review from a team July 18, 2019 00:49
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agavra main thing is the naming and zero-indexing of arrays, and edge cases on the generics that may or may not be an issue. Otherwise looking good. Wouldn't mind another review myself, but if this is holding you up, don't let me stop you.

@@ -129,8 +135,25 @@ private KsqlFunction(
}


public Schema getReturnType() {
return returnType;
public Schema getReturnType(final List<Schema> arguments) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: having a param called arguments and a field with the same name doesn't make the code the easiest to read.

What's the difference between the two? Can we express this in the naming somehow to make it easier to grok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the field to parameters and what is passed in as arguments. At some point I might go through the codebase and make this change everywhere because it's a distinction that is confused often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the different between arguments and parameters - both are the same to me. So I'm not sure changing the name makes the code in getReturnType any more readable. But if we're changing the name, lets change the name of the getter, constructor param and toString too!

Copy link
Contributor Author

@agavra agavra Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A parameter is a variable in a method definition. When a method is called, the arguments are the data you pass into the method's parameters.

https://stackoverflow.com/a/156787/2258040

EDIT: I will make the corresponding changes in another PR since it touches a surprising amount of files.

return "DECIMAL("
+ DecimalUtil.precision(schema) + ", "
+ DecimalUtil.scale(schema) + ")";
SchemaUtil.requireValidBytesType(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong - why are we extending the general use formatter to support this? The use of this special schema type should be very self contained within the UDF code...

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple of suggestions.

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the first pass through, still reviewing tests. Left some comments/questions inline

return Sets.union(
constituentGenerics(type.keySchema()),
constituentGenerics(type.valueSchema()));
case STRUCT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is currently not supported, right? maybe just throw an exception? It made me confused going between this and identifyGenerics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented because it's called to determine whether or not something has generics, which is used to determine whether or not to call identifyGenerics. Throwing an exception here would break the normal flow (this method is called for every KsqlFunction)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah that makes perfect sense. thanks!

}
}

return Sets.difference(constituentGenerics(schema), asMap.keySet()).isEmpty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the case when this would be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asMap represents the "resolved" generics. if every one of the constituent generics exists in the mapping of resolved generics, then this would be true (e.g. constituents(Map<K,V>) ~ {K:String, V:Integer} would be true because K and V are both represented)

if you meant to ask when this would ever be false, then you're right - I'm not actually sure it'll ever be false

&& GenericsUtil.instanceOf(schema, argument)) {
// check if this specific generic has already been resolved to some other
// type - if not, reserve it for future checks to accept
final Schema old = reservedGenerics.putIfAbsent(schema, argument);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be tracking the resolved generic schemas in reservedGenerics, rather than the container schemas?

e.g. if I call accepts with MAP<A, B>, MAP<STRING, INT>, reservedGenerics should have A->STRING, B->INT not MAP<A, B> -> MAP<STRING, INT> ?

This way, if I have a signature like FOO(MAP<A, B>, LIST<A>), I make sure that A is always resolved to the same type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that was the intention - it seems that I only have tests covering the case with generic top levels. Good catch!


- Any generic in the return value of a method must appear in at least one of the method parameters
- The generic must not adhere to any interface. For example, ``<T extends Number>`` is not valid).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support type coercion for generics, but do for specific types, right? e.g. if I have a function like add(T a, T b) and I try to call it with something that resolves to INT, BIGINT, that won't work. But if I did add(Long a, Long b) it would. Might want to call that out in the docs.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agavra couple more minors.

Also be aware I've added: #3114.

@@ -129,8 +135,25 @@ private KsqlFunction(
}


public Schema getReturnType() {
return returnType;
public Schema getReturnType(final List<Schema> arguments) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the different between arguments and parameters - both are the same to me. So I'm not sure changing the name makes the code in getReturnType any more readable. But if we're changing the name, lets change the name of the getter, constructor param and toString too!

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit inline, otherwise LGTM

}

final Map<Schema, Schema> genericMapping = GenericsUtil.resolveGenerics(schema, argument);
for (final Entry<Schema, Schema> entry : genericMapping.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider pulling this pattern out into a common function in GenericsUtil, e.g:

public boolean checkGenericsConsistent(final Collection Entry<Schema, Schema> mappings, Optional<Map<Schema, Schema>> oReserved) {
    final Map<Schema, Schema> reserved = oReserved.orElseGet(new HashMap<>());
    for (final Entry<Schema, Schema> entry : mappings) {
        final Schema old = reserved.putIfAbsent(entry.getKey(), entry.getValue());
        if (old != null && !old.equals(entry.getValue())) {
          return false;
        }
    }
    return true;
}

Then you can use it from instanceOf, resolveGenerics, and here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have as few methods with side-effects as possible, but the problem is that we need to both return a boolean and update a map... 🤔 I'll see if I can make that clean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now why I didn't do this originally, for resolveGenerics I wanted to be able to have a detailed exception message that tell me exactly which one was the problem. If I throw an exception after returning I lose that information. I think the amount of boiler plate to get this to work for all three of the methods is more confusing than just doing it in place.

@agavra agavra merged commit a381c48 into confluentinc:master Jul 24, 2019
@agavra agavra deleted the generics_udf branch July 24, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants