[FLINK-39392][table] Support conditional traits for PTFs#27886
Conversation
twalthr
left a comment
There was a problem hiding this comment.
Thank you for this PR @gustavodemorais. Overall I'm +1 for this change. However, we need to clearly define the boundaries, when static arguments are fully resolved and a trait condition has no effect anymore. Some locations look currently very hacky, we should take another look. Also we need Table API support which is not covered by this PR, at least not in tests.
| } | ||
|
|
||
| /** True when the named boolean argument is provided and its value is {@code true}. */ | ||
| static TraitCondition argIsTrue(final String name) { |
There was a problem hiding this comment.
generialize the is true and is false to:
static <T> TraitCondition argIsEqualTo(T obj) {ctx.getScalarArgument(name, obj.getClass) == obj}
| } | ||
|
|
||
| final int timeColumn = inputTimeColumns.get(tableArgCall.getInputIndex()); | ||
| final org.apache.flink.table.types.inference.TraitContext traitCtx = |
There was a problem hiding this comment.
pay attention to full imports, seems Claude loves to do this
| final org.apache.flink.table.types.inference.TraitContext traitCtx = | |
| final TraitContext traitCtx = |
There was a problem hiding this comment.
same comment as above. resolve the static arg as early as possible to not reconstruct TraitContext multiple times
| if (operand.getKind() == SqlKind.DEFAULT || !(operand instanceof RexLiteral)) { | ||
| return Optional.empty(); | ||
| } | ||
| return Optional.ofNullable(((RexLiteral) operand).getValueAs(clazz)); |
There was a problem hiding this comment.
this is too simple, it should follow the same rules as CallContext does. Otherwise it won't be possible e.g. to get Instant.class or other literals.
There was a problem hiding this comment.
That's right, I've extended it to support NULL, DEFAULT, DESCRIPTOR, MAP and literals using a similar logic as we have in CallContext.
We could theoretically create an OperatorBindingCallContext here to avoid code duplication and delegate but it's heavy to have a whole OperatorBindingCallContext only for this and creates a tight coupling between StreamPhysicalProcessTableFunction
| final boolean hasPartitionBy = partitionKeys.length > 0; | ||
| final boolean reportedAsSet = tableCharacteristic.semantics == Semantics.SET; | ||
| final boolean setIsConditional = | ||
| staticArg.hasConditionalTrait(StaticArgumentTrait.SET_SEMANTIC_TABLE); |
There was a problem hiding this comment.
too fragile. determine the effective StaticArgument first and then execute this logic.
There was a problem hiding this comment.
We're using now StreamPhysicalProcessTableFunction.buildTraitContext and resolving the traits here before using it
raminqaf
left a comment
There was a problem hiding this comment.
Did a first pass and left some comments for improvements
| */ | ||
| public StaticArgument addTraitWhen( | ||
| final TraitCondition condition, final StaticArgumentTrait trait) { | ||
| final List<ConditionalTrait> newList = new ArrayList<>(this.conditionalTraits); |
There was a problem hiding this comment.
Any reason we copy the list everytime we add new elements to it?
There was a problem hiding this comment.
In general, not strictly necessary in this case but I follow the immutable builder pattern by default. It's a fluent builder that returns a new instance at each step, just like how StaticArgument.table() returns a new instance. It's the same pattern as EnumSet.of() or List.of() - each call produces a distinct value
| @Nullable Class<?> conversionClass, | ||
| boolean isOptional, | ||
| EnumSet<StaticArgumentTrait> traits, | ||
| List<ConditionalTrait> conditionalTraits) { |
There was a problem hiding this comment.
Not sure if a list is the most suitable data structure for the traits here. Maybe representing as a HashMap would make more sense. With a list we could still allow the user to define two conditional traits with the same StaticArgument.
[StaticArgumentTrait.SET_SEMANTIC_TABLE, hasPartitionBy, StaticArgumentTrait.SET_SEMANTIC_TABLE, hasSomeCondition]
There was a problem hiding this comment.
I've given this some though and decided we could go with OR semantics for multiple conditional traits for the same trait. This makes it possible that the user writes multiple simple condition traits and thus a list makes sense. the trait is activated if any of its conditions is met. I've documented the behavior
| */ | ||
| @PublicEvolving | ||
| @FunctionalInterface | ||
| public interface TraitCondition extends Serializable { |
There was a problem hiding this comment.
How about we extend Java's Predicate?
| public interface TraitCondition extends Serializable { | |
| public interface TraitCondition extends Serializable, Predicate<TraitContext> { |
There was a problem hiding this comment.
If we extend Predicate to inherit and(), or(), negate() those return Predicate not TraitCondition, so composition breaks
Also, we can't simply use Predicate because we need it to be serializable, since these are stored. I think we'll have to go with the TraitCondition as it is
18f4c00 to
9e1ea8a
Compare
|
Thanks for the reviews, @raminqaf and @twalthr. I've addressed all the comments and tried to make the maintain the change scope to only adding the conditional traits concept. I've also only resolved it where it's necessary for now and we can proceed to do that for the other places where we need to do it as we add new features, instead of resolving it everywhere where we read the traits. Take a look |
9e1ea8a to
8e8700d
Compare
|
One note @twalthr, I've had the feeling multiple times that "isPtfUpsert" wasn't the best name for the function so I did a small refactoring and renamed it to "requiresUpdateBefore" adjusting the logic in the appropriate places. That helps with understanding that it's related to the input of the ptf and says exactly what we're checking. I think the code is easier to read now. Take a look and let me know what you think c0e2242 |
twalthr
left a comment
There was a problem hiding this comment.
Thank you @gustavodemorais. I added some more comments to harden the PTF framework for this.
| | `op` | No | A `DESCRIPTOR` with a single column name for the operation code column. Defaults to `op`. | | ||
| | Parameter | Required | Description | | ||
| |:-------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | `input` | Yes | The input table. With `PARTITION BY`, rows with the same key are co-located and run in the same operator instance. Without `PARTITION BY`, each row is processed independently. Accepts insert-only, retract, and upsert tables. For upsert tables, providing `PARTITION BY` is recommended for better performance. | |
There was a problem hiding this comment.
| | `input` | Yes | The input table. With `PARTITION BY`, rows with the same key are co-located and run in the same operator instance. Without `PARTITION BY`, each row is processed independently. Accepts insert-only, retract, and upsert tables. For upsert tables, providing `PARTITION BY` is recommended for better performance. | | |
| | `input` | Yes | The input table. With `PARTITION BY`, rows with the same key are co-located and run in the same operator instance. Without `PARTITION BY`, each row is processed independently. Accepts insert-only, retract, and upsert tables. For upsert tables, a provided `PARTITION BY` must match the upsert key of the subquery. | |
There was a problem hiding this comment.
The key doesn't have to match exactly, it can be a subset. Updating with "For upsert tables, the provided PARTITION BY key should match or be a subset of the upsert key of the subquery"
| } | ||
|
|
||
| /** A trait that is conditionally added based on a {@link TraitCondition}. */ | ||
| private static final class ConditionalTrait implements Serializable { |
There was a problem hiding this comment.
| private static final class ConditionalTrait implements Serializable { | |
| private static final class ConditionalTrait { |
| for (int i = 0; i < staticArgs.size(); i++) { | ||
| final StaticArgument arg = staticArgs.get(i); | ||
| if (arg.is(StaticArgumentTrait.SCALAR) && arg.getName().equals(name)) { | ||
| if (!callContext.isArgumentLiteral(i)) { |
There was a problem hiding this comment.
just do double check: do we also need a null check here via callContext.isNullLiteral? Or is that covered by getArgumentValue?
There was a problem hiding this comment.
getArgumentValue returns Optional.empty() for null args, and isArgumentLiteral returns false for null literals. So the null case is already handled
| if (semantics == null) { | ||
| return Stream.<Field>empty(); | ||
| } |
There was a problem hiding this comment.
nit:
| if (semantics == null) { | |
| return Stream.<Field>empty(); | |
| } | |
| * | ||
| * StaticArgument.table("input", Row.class, false, EnumSet.of(TABLE, SUPPORT_UPDATES)) | ||
| * .withConditionalTrait(SET_SEMANTIC_TABLE, hasPartitionBy()); | ||
| * }</pre> |
There was a problem hiding this comment.
Mention that hashCode/Equals need to be implemented otherwise StaticArgument.equals/hashCode won't work.
| */ | ||
| @PublicEvolving | ||
| @FunctionalInterface | ||
| public interface TraitCondition extends Serializable { |
There was a problem hiding this comment.
| public interface TraitCondition extends Serializable { | |
| public interface TraitCondition { |
There was a problem hiding this comment.
I though we needed it - removed
| if (arg.is(StaticArgumentTrait.ROW_SEMANTIC_TABLE)) { | ||
| semantics = TableCharacteristic.Semantics.ROW; | ||
| } else if (arg.is(StaticArgumentTrait.SET_SEMANTIC_TABLE)) { | ||
| // Report SET if it may apply - which allows the use of Partition BY |
There was a problem hiding this comment.
| // Report SET if it may apply - which allows the use of Partition BY | |
| // Report SET semantics if it may apply - which allows the use of PARTITION BY |
|
|
||
| final int timeColumn = inputTimeColumns.get(tableArgCall.getInputIndex()); | ||
| final StaticArgument resolvedArg = | ||
| tableArg.applyConditionalTraits( |
There was a problem hiding this comment.
Can we apply them earlier? Ideally, this topic is done as early as possible. org.apache.flink.table.planner.plan.rules.physical.stream.StreamPhysicalProcessTableFunctionRule#convert could be a good location.
There was a problem hiding this comment.
Yes, that's a better idea. I've moved the resolution to StreamPhysicalProcessTableFunction's constructor instead of the rule. What do you think?
There was a problem hiding this comment.
I've moved the resolution to StreamPhysicalProcessTableFunction's constructor
Do we then need resolution in ChangelogModeInferenceProgram?
| * which require the full CallContext bridge. | ||
| */ | ||
| @SuppressWarnings("unchecked") | ||
| private static <T> Optional<T> findScalarLiteral( |
There was a problem hiding this comment.
Reuse StreamPhysicalProcessTableFunction.toCallContext
There was a problem hiding this comment.
Thanks for the pointer. I've created an overload that supports feting the scalar literals but doesn't require the additional changelog information which we don't use and do not require. Using it now
public static CallContext toCallContext(RexCall udfCall) {
return toCallContext(udfCall, null, null, null);
}
aedec22 to
aa4f50c
Compare
raminqaf
left a comment
There was a problem hiding this comment.
Thanks for this PR and solving this issue! Left some comments and questions!
| */ | ||
| public StaticArgument withConditionalTrait( | ||
| final StaticArgumentTrait trait, final TraitCondition condition) { | ||
| if (trait == StaticArgumentTrait.SCALAR |
There was a problem hiding this comment.
Should we make these part of the ConditionalTrait class? Suggestion a EnumSet called IllegalConditionalTraits and having a method (isConditionalTrait) that checks it.
|
|
||
| /** True when the named scalar argument equals the expected value. */ | ||
| @SuppressWarnings("unchecked") | ||
| static <T> TraitCondition argIsEqualTo(final String name, final T expected) { |
There was a problem hiding this comment.
Should we insure type safety by passing the Class<T> clazz to the method?
There was a problem hiding this comment.
It's a valid question - I've only added this as a placeholder example for now. We can check if type verification is necessary when we use it in the following prs and adjust it acordingly
There was a problem hiding this comment.
Given that the expected always comes from reviewed implementation, I don't think he class is necessary here.
| * <p>Conditions are evaluated at planning time using the {@link TraitContext} which provides access | ||
| * to the SQL call's properties (PARTITION BY presence, scalar literal values, etc.). | ||
| * | ||
| * <p>Implementations must implement {@code hashCode} and {@code equals} for {@link |
There was a problem hiding this comment.
Javadocs is inforcing to implement hashCode and equals but none of the implementations bellow (argIsEqualTo and not) is doing this.
| @Nullable final TableSemantics semantics, | ||
| final CallContext callContext, | ||
| final List<StaticArgument> staticArgs) { | ||
| return new TraitContext() { |
There was a problem hiding this comment.
We have two implementations of TraitContext. One here and the other one in StreamPhysicalProcessTableFunction any reason for that?
If they need to be similar move the implementation into TraitContext itself as a static factory (TraitContext.of( @Nullable TableSemantics semantics, CallContext callContext, List<StaticArgument> staticArgs))
There was a problem hiding this comment.
Yes, we need the CallContext to resolve the arguments and there no obvious single place to resolve this once so that we don't have multiple places. At logical time, for example when we create TypeInputStrategy, we only have the Logical Operator instance but not the CallContext yet - a logical operator can be reused multiple times throughout multiple calls inside a single SQL command. But I agree with you: I don't like this and that's what i'm doing now, trying to find a better place so we do it only once
There was a problem hiding this comment.
I've moved the implementation to TraitContext and simplified resolution. See #27886 (comment)
| /** ROW and SET semantics are mutually exclusive - adding one removes the other. */ | ||
| private static void removeMutuallyExclusiveTraits( | ||
| final EnumSet<StaticArgumentTrait> traits, final StaticArgumentTrait adding) { | ||
| if (adding == StaticArgumentTrait.SET_SEMANTIC_TABLE) { |
There was a problem hiding this comment.
We can introduce a getIncompatibleWith() method and simplify this to
| if (adding == StaticArgumentTrait.SET_SEMANTIC_TABLE) { | |
| private static void removeMutuallyExclusiveTraits( | |
| EnumSet<StaticArgumentTrait> traits, StaticArgumentTrait adding) { | |
| traits.removeAll(adding.getIncompatibleWith()); | |
| } |
or add an else branch for fast fails (personally prefer a switch with default here)
f7b4498 to
e0c725d
Compare
…nd ExecNode deserialization
e0c725d to
0c98e03
Compare
…or TraitCondition
0c98e03 to
f1fe2f2
Compare
|
I've tried to simplify the code as much as possible, @twalthr. It's not trivial to implement the feature: having only one place where we resolve the traits is unfortunately conceptually not possible. I've adjusted the code so that all downstream code can simply read the traits and assume they're resolved. For that, I had to resolve it in three places
We have to resolve it inside SystemTypeInference's inferType and inferInputType because we only there have the actual Call passed as a param. We have to resolve it in FlinkLogicalTableFunctionScan because it's the first place we actually create the RexCall. And we also have to resolve it StreamExecProcessTableFunction which is where we restore our unresolved RexCall from the compiled plan. Now, all relevant places where we read traits should receive the resolved traits |
twalthr
left a comment
There was a problem hiding this comment.
Great job @gustavodemorais. The PR looks much better now. I left 2 remaining comments. But should be good in the next iteration.
| * kind + args}; the {@code impl} predicate is reused but never compared, so two conditions | ||
| * built from the same factory inputs are equal. | ||
| */ | ||
| final class BuiltInCondition implements TraitCondition { |
There was a problem hiding this comment.
| final class BuiltInCondition implements TraitCondition { | |
| private final class BuiltInCondition implements TraitCondition { |
There was a problem hiding this comment.
Java forbids private on nested types inside an interface. I've moved it to its own file. Top-level package-private class achieves the same encapsulation; users outside org.apache.flink.table.types.inference can't reach BuiltInCondition
| * org.apache.flink.table.types.inference.CallContext}, since the planner doesn't carry one. The | ||
| * validation-time equivalent is {@link TraitContext#of}. | ||
| */ | ||
| private static TraitContext buildTraitContext( |
There was a problem hiding this comment.
Reusing toCallContext from StreamPhysicalProcessFunction was not an option? Relying directly on RexLiteral might cause issues in the future. It is better to reuse the provided wrapper.
There was a problem hiding this comment.
My thinking was that having the code in the BridgingSqlFunction importing thing from the physical layer felt like a bad idea. We would be calling in the early logical layer some code that already in the physical layer and we might create a weird dependency.
But it's a valid point that we should have try to have the logic only once and reuse. I've moved "toCallContext" to the BridgingSqlFunction.java and we're reusing it in StreamPhysicalProcessTableFunction.Java. I think that makes sense and it's also not a lot of code
…stricter encapsulation
…dedup toCallContext
|
Thank you, @twalthr. I've addressed both comments |
twalthr
left a comment
There was a problem hiding this comment.
LGTM, thanks @gustavodemorais!
raminqaf
left a comment
There was a problem hiding this comment.
LGTM! @gustavodemorais Thanks!
0213156 to
d054743
Compare
What is the purpose of the change
Adds conditional traits for PTF table arguments, enabling traits that vary based on the SQL call context (e.g., whether PARTITION BY is present). Applied to TO_CHANGELOG: row semantics by default, set semantics when PARTITION BY is provided.
Brief change log
StaticArgument.withConditionalTrait(trait, condition)andapplyConditionalTraits(ctx)APIsTraitConditioninterface withhasPartitionBy(),argIsEqualTo(),not()TraitContextinterface for read-only trait resolution context.withConditionalTrait(SET_SEMANTIC_TABLE, hasPartitionBy())applyConditionalTraitsin type inference, UID derivation, distribution, and runtime semanticsisPtfUpserttoptfRequiresUpdateBefore[PARTITION BY]syntaxVerifying this change
testSetSemanticsWithPartitionByverifies set semantics produces Exchange(hash)Does this pull request potentially affect one of the following parts
@Public(Evolving): (no)Documentation