[FLINK-39259][table] Support TO_CHANGELOG built-in process table function (upsert stream, flat mode)#27777
[FLINK-39259][table] Support TO_CHANGELOG built-in process table function (upsert stream, flat mode)#27777gustavodemorais wants to merge 8 commits intoapache:masterfrom
Conversation
| final List<SqlNode> operands = mapCall.getOperandList(); | ||
| final Map<String, String> map = new LinkedHashMap<>(); | ||
| for (int i = 0; i < operands.size(); i += 2) { | ||
| final String key = SqlLiteral.unchain(operands.get(i)).getValueAs(String.class); |
There was a problem hiding this comment.
What happens if this unchaining fails? A map can only be literal if all children are also literal, otherwise the getArgumentValue must return Optional.empty()
There was a problem hiding this comment.
how well is null supported in keys and values? officially we support them in Flink.
There was a problem hiding this comment.
Let's catch exceptions here and also check for nulls
| } | ||
|
|
||
| @Override | ||
| public boolean isArgumentLiteral(int pos) { |
There was a problem hiding this comment.
update isArgumentLiteral as well? A map can only be literal if all children are also literal
| if (opDescriptor.isPresent() && opDescriptor.get().getNames().size() != 1) { | ||
| return callContext.fail( | ||
| throwOnFailure, | ||
| "OP descriptor must contain exactly one column name " |
There was a problem hiding this comment.
The descriptor for argument 'op' must contain exactly one column name
| if (!(key instanceof String) || !VALID_ROW_KIND_NAMES.contains(key)) { | ||
| return callContext.fail( | ||
| throwOnFailure, | ||
| "Invalid RowKind name '%s' in op_mapping. Valid names are: %s.", |
There was a problem hiding this comment.
"Invalid target mapping for argument 'op_mapping'."
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.flink.table.planner.functions.ptf; |
There was a problem hiding this comment.
put into org.apache.flink.table.runtime.functions.ptf in table-runtime
| * partition key columns automatically. | ||
| */ | ||
| @Internal | ||
| public class ToChangelogFunction extends ProcessTableFunction<Row> { |
There was a problem hiding this comment.
BuiltInProcessTableFunction similar to BuiltInScalarFunction
There was a problem hiding this comment.
You probably mean we should add a new BuiltInProcessTableFunction similar to BuiltInScalarFunction as start using it as a base class?
| collect(buildOutputRow(input, opCode)); | ||
| } | ||
|
|
||
| private Row buildOutputRow(final Row input, final String opCode) { |
There was a problem hiding this comment.
once you use BuiltIn stack, everything should be RowData. as efficiently as possible with as little conversion as possible
There was a problem hiding this comment.
Done. It was necessary to do one more change to the framework to support this: TypeInferenceUtil.inferInputTypes() (line 484) returns semantics.dataType() for table arguments, which always has Row as conversion class - it ignores the conversionClass specified in StaticArgument.table(). Adding bridgedTo(expectedArg.getConversionClass()).
twalthr
left a comment
There was a problem hiding this comment.
Thank you @gustavodemorais. Sorry for being so picky here but this is the first real built-in PTF so we need to prepare a good template for the future.
|
|
||
| @Override | ||
| public Table toChangelog() { | ||
| return process("TO_CHANGELOG"); |
There was a problem hiding this comment.
don't hard code the name here. use the built in function definition. I know this was an issue for ml predict, but we should fix it properly now.
| } | ||
|
|
||
| @Override | ||
| public Table toChangelog() { |
There was a problem hiding this comment.
add op_mapping parameter and op, same functionality as in sql. to avoid to many overloads, we can accept Expression and force people to use .asArgument.
|
|
||
| /** | ||
| * Converts this dynamic table into an append-only stream with an explicit operation code column | ||
| * using the built-in {@link BuiltInFunctionDefinitions#TO_CHANGELOG TO_CHANGELOG} PTF. |
There was a problem hiding this comment.
Remove references to BuiltInFunctionDefinitions#TO_CHANGELOG TO_CHANGELOG. BuiltInFunctionDefinitions is rather internal.
…ion and other feedback
What is the purpose of the change
This is the first implementation of TO_CHANGELOG. The idea is to implement the use cases one by one interactively. With this ticket we will write the first version which allows flat mapping of upsert stream.
Implement TO_CHANGELOG built-in PTF as specified in FLIP-564, section 4.2.2.2 (retract/upsert stream to upsert stream, flat mode).
TO_CHANGELOG converts a dynamic table (retract/upsert stream) into an append-only stream with an explicit operation code column. Each input row is emitted as an INSERT-only row with a string op column indicating the original RowKind. This is stateless - each event maps directly to one output record.
Brief change log
- Built-in function definition and type strategies
- Runtime implementation
- Table API interface
- MAP literal support in OperatorBindingCallContext and CallBindingCallContext
- Plan tests, semantic tests (SQL, Table API, convenience API)
- Use-case test: LAG on upsert stream (previously impossible without TO_CHANGELOG, like other aggregation functions)
- Documentation
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation