[FLINK-39538][table] Support upsert output mode for FROM_CHANGELOG#28164
[FLINK-39538][table] Support upsert output mode for FROM_CHANGELOG#28164raminqaf wants to merge 4 commits into
Conversation
|
|
||
| If you are producing an upsert table — that is, you are emitting `UPDATE_AFTER` but no `UPDATE_BEFORE` from your CDC input stream — the partition key you select here will be considered both the primary key and the upsert key by the engine. Make sure the `PARTITION BY` key matches your primary key exactly. | ||
|
|
||
| #### Upsert output |
There was a problem hiding this comment.
I'm wondering to what extend we need to document this behavior.
FROM_CHANGELOG converts and explicit changelog into a Flink dynamic table that is internally handled. From a user's point of view, does it really matter how Flink encodes changes? It shouldn't change the semantics of the following SQL operations. Is using upsert changelog mode not "just" an internal optimization?
There was a problem hiding this comment.
This is a fair point to keep this simple. I thought a bit and still think it'd be good to have the section. We have retract as the default mode. The changelog mode of a table is an user facing option which users are aware which users are often dealing with. So a section telling them what they have to pay attention to when they want to fabricate an upsert table makes sense imo
| -- -D[id:2, name:'Bob'] | ||
| ``` | ||
|
|
||
| Without `PARTITION BY`, or when the active `op_mapping` includes `UPDATE_BEFORE`, the output remains a retract changelog. |
There was a problem hiding this comment.
Document that validation fails if there is no PARTITION BY and no UPDATE_BEFORE?
| } | ||
| final Map<String, String> mapping = opMapping.get(); | ||
| final boolean hasUpdateAfter = | ||
| mapping.values().stream().anyMatch(v -> UPDATE_AFTER.equals(v.trim())); |
There was a problem hiding this comment.
was already checked that v isn't null or might v.trim() throw a NPE?
There was a problem hiding this comment.
good catch! I filter the nulls before motching
| * cannot be resolved as a literal, since the default mapping includes both (retract). | ||
| */ | ||
| @SuppressWarnings("unchecked") | ||
| public static boolean isUpsertConfig(final ChangelogContext ctx) { |
There was a problem hiding this comment.
does this function need to be public?
There was a problem hiding this comment.
This can be (package-)private
| final Map<String, String> mapping, | ||
| final boolean throwOnFailure) { | ||
| final boolean hasUpdateAfter = | ||
| mapping.values().stream().anyMatch(v -> UPDATE_AFTER.equals(v.trim())); |
There was a problem hiding this comment.
same question about NPE as above
| callContext | ||
| .getTableSemantics(ARG_TABLE) | ||
| .map(ts -> ts.partitionByColumns().length > 0) | ||
| .orElse(false); |
There was a problem hiding this comment.
Could be extracted into a helper method (is also used in isUpsertConfig()
There was a problem hiding this comment.
The extraction would not be as beautiful as (you and) I imagined to be. One is using CallContetxt the other one is using ChangelogContext. I will try to play around and make it nice 🙂
| * describes an upsert changelog. Upsert mode requires a key, so the input table must use set | ||
| * semantics via {@code PARTITION BY}. | ||
| */ | ||
| private static Optional<List<DataType>> validateUpsertRequiresPartitionBy( |
There was a problem hiding this comment.
The return type of this method looks strange.
It never returns an actual List, only Optional.empty().
I'd rather return a boolean and move the error message into validateOpMapping().
| mapping.values().stream().anyMatch(v -> UPDATE_AFTER.equals(v.trim())); | ||
| final boolean hasUpdateBefore = | ||
| mapping.values().stream().anyMatch(v -> UPDATE_BEFORE.equals(v.trim())); | ||
| if (!hasUpdateAfter || hasUpdateBefore) { |
There was a problem hiding this comment.
Add a comment describing the rational of this condition.
| + "'DELETE', 'DELETE'])") | ||
| .build(); | ||
|
|
||
| public static final TableTestProgram UPSERT_PARTITION_BY_CUSTOM_MAPPING = |
There was a problem hiding this comment.
Isn't the only difference to the other added test that the mapping uses "c" instead of "INSERT"?
Isn't that essentially the same test?
There was a problem hiding this comment.
Rather add a validation test for the case of UPDATE_AFTER without UPDATE_BEFORE and PARTITION BY?
There was a problem hiding this comment.
Sry for the confusion. It is removed
gustavodemorais
left a comment
There was a problem hiding this comment.
The PR in general looks very good, @raminqaf!
I've added multiple comments and doc suggestions and only one change requested.
|
|
||
| If you are producing an upsert table — that is, you are emitting `UPDATE_AFTER` but no `UPDATE_BEFORE` from your CDC input stream — the partition key you select here will be considered both the primary key and the upsert key by the engine. Make sure the `PARTITION BY` key matches your primary key exactly. | ||
|
|
||
| #### Upsert output |
There was a problem hiding this comment.
This is a fair point to keep this simple. I thought a bit and still think it'd be good to have the section. We have retract as the default mode. The changelog mode of a table is an user facing option which users are aware which users are often dealing with. So a section telling them what they have to pay attention to when they want to fabricate an upsert table makes sense imo
|
|
||
| If you are producing an upsert table — that is, you are emitting `UPDATE_AFTER` but no `UPDATE_BEFORE` from your CDC input stream — the partition key you select here will be considered both the primary key and the upsert key by the engine. Make sure the `PARTITION BY` key matches your primary key exactly. | ||
|
|
||
| #### Upsert output |
There was a problem hiding this comment.
| #### Upsert output | |
| #### Upsert table |
| -- -D[id:2, name:'Bob'] | ||
| ``` | ||
|
|
||
| Without `PARTITION BY`, or when the active `op_mapping` includes `UPDATE_BEFORE`, the output remains a retract changelog. |
There was a problem hiding this comment.
| Without `PARTITION BY`, or when the active `op_mapping` includes `UPDATE_BEFORE`, the output remains a retract changelog. | |
| By default, without `PARTITION BY`, or when the active `op_mapping` includes `UPDATE_BEFORE`, the output remains a retract changelog. |
|
|
||
| #### Upsert output | ||
|
|
||
| When `PARTITION BY` is combined with an `op_mapping` that does NOT include `UPDATE_BEFORE`, the output changelog is an upsert table keyed on the partition columns. Each input row produces an `INSERT`, `UPDATE_AFTER`, or `DELETE` event with the partition key acting as the upsert key. |
There was a problem hiding this comment.
| When `PARTITION BY` is combined with an `op_mapping` that does NOT include `UPDATE_BEFORE`, the output changelog is an upsert table keyed on the partition columns. Each input row produces an `INSERT`, `UPDATE_AFTER`, or `DELETE` event with the partition key acting as the upsert key. | |
| To generate an `Upsert table`, the following requirements must be met: | |
| * **Key Partitioning:** You must use `PARTITION BY <key>`, where the partition key corresponds to the unique/primary key of the dataset. | |
| * **Op Mapping Configuration:** The `op_mapping` must include `UPDATE_AFTER` and must NOT include `UPDATE_BEFORE`. | |
| **How it works:** | |
| The engine assumes that the keys provided in the `PARTITION BY` clause function as the unique upsert keys. The resulting output changelog becomes an upsert table keyed on these partition columns. Each incoming row is evaluated and produces `INSERT`, `UPDATE_AFTER`, or `DELETE` events, using the partition key as the explicit upsert key. Therefore, if the incoming changelog contains unique keys (such as a primary key), they **must** be used in the `PARTITION BY` clause. | |
| The FROM_CHANGELOG PTF assumes events arrived ordered. If the source itself does not guarantee ordering for events for the same PARTITION BY keys, consider using using ORDER BY <link>. |
| * | ||
| * <ul> | ||
| * <li><b>Retract</b> (default): the active {@code op_mapping} includes {@code UPDATE_BEFORE} | ||
| * or no updates at all. The output emits {@code INSERT}, {@code UPDATE_BEFORE}, {@code |
There was a problem hiding this comment.
| * or no updates at all. The output emits {@code INSERT}, {@code UPDATE_BEFORE}, {@code | |
| * or only {@code INSERT} and {@code DELETE} pairs. The output emits {@code INSERT}, {@code UPDATE_BEFORE}, {@code |
| * <p>Upsert mode uses full deletes ({@link ChangelogMode#upsert(boolean) upsert(false)}) | ||
| * because the runtime forwards each input delete row with all fields populated; only the {@link | ||
| * org.apache.flink.types.RowKind} is rewritten. |
There was a problem hiding this comment.
| * <p>Upsert mode uses full deletes ({@link ChangelogMode#upsert(boolean) upsert(false)}) | |
| * because the runtime forwards each input delete row with all fields populated; only the {@link | |
| * org.apache.flink.types.RowKind} is rewritten. | |
| * <p>Upsert mode uses full deletes by default ({@link ChangelogMode#upsert(boolean) upsert(false)}). We currently only support consuming from a changelog stream with full deletes (where not only the primary keys, but all fields fields are populated). |
nit: we could maybe even delete this and add only more info when we add "consume_full_updates"
| * Returns {@code true} when the FROM_CHANGELOG call should emit an upsert changelog: the input | ||
| * table is partitioned AND the resolved {@code op_mapping} contains {@code UPDATE_AFTER} | ||
| * without {@code UPDATE_BEFORE}. Falls back to {@code false} when the mapping is absent or | ||
| * cannot be resolved as a literal, since the default mapping includes both (retract). |
There was a problem hiding this comment.
| * cannot be resolved as a literal, since the default mapping includes both (retract). | |
| * cannot be resolved as a literal. The default mapping maps to retract table. |
| .item("uid", uid) | ||
| .item("select", String.join(",", getRowType().getFieldNames())) | ||
| .item("rowType", getRowType()); | ||
| final Set<ImmutableBitSet> upsertKeys = |
There was a problem hiding this comment.
I don't think you want to do this. This would change the plan output for all PTFs and you'd theoretically have to regenerate all PTF tests in the code base.
There was a problem hiding this comment.
You just want to output the upsertKeys in the plan, right? Then you want to use/thinker your tests to use the testing utility that allows you do so instead of changing the default explain for ptf nodes
There was a problem hiding this comment.
Reverted this and now stringify the plan through TableTestBase
| + "'DELETE', 'DELETE'])") | ||
| .build(); | ||
|
|
||
| public static final TableTestProgram UPSERT_PARTITION_BY_CUSTOM_MAPPING = |
08d52fc to
742d0db
Compare
FROM_CHANGELOG now emits an upsert changelog (INSERT, UPDATE_AFTER, full DELETE) when the input table is partitioned (set semantics via PARTITION BY) and the active op_mapping maps to UPDATE_AFTER without UPDATE_BEFORE. The partition key acts as the upsert key. In all other cases the output remains a retract changelog. Submitting an op_mapping with UPDATE_AFTER but no UPDATE_BEFORE without PARTITION BY is rejected at validation time, because upsert mode requires a key. To enable the strategy to inspect the resolved op_mapping and the input table's partition keys, ChangelogFunction.ChangelogContext is extended with two default methods: getArgumentValue(int, Class) and getTableSemantics(int). Defaults return Optional.empty() to preserve source compatibility for existing implementations. The planner-side wrapper in FlinkChangelogModeInferenceProgram delegates the two new methods to the underlying CallContext. Upsert mode uses full deletes (ChangelogMode.upsert(false)) because the runtime forwards each input delete row with all fields populated; only the RowKind is rewritten. This matches the runtime's behavior and avoids forcing downstream operators to reconstruct rows from state. The upsert key derivation in FlinkRelMdUniqueKeys.getPtfUniqueKeys already returns the partition columns when a PTF emits upsert, so no metadata changes are needed.
ca6642b to
7da043f
Compare
| return ctx.getArgumentValue(ARG_OP_MAPPING, Map.class).stream() | ||
| .anyMatch(FromChangelogTypeStrategy::describesUpsert); |
There was a problem hiding this comment.
nit
| return ctx.getArgumentValue(ARG_OP_MAPPING, Map.class).stream() | |
| .anyMatch(FromChangelogTypeStrategy::describesUpsert); | |
| return ctx.getArgumentValue(ARG_OP_MAPPING, Map.class).map(FromChangelogTypeStrategy::describesUpsert).orElse(false); |
stream() of an Optional surely works, but kinda hides that this is just a single element.
map() is more clear, IMO.
There was a problem hiding this comment.
Thanks! Addressed!
gustavodemorais
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedback, @raminqaf! Take a look
| The engine assumes that the keys provided in the `PARTITION BY` clause function as the unique upsert keys. The resulting output changelog becomes an upsert table keyed on these partition columns. Each incoming row is evaluated and produces `INSERT`, `UPDATE_AFTER`, or `DELETE` events, using the partition key as the explicit upsert key. Therefore, if the incoming changelog contains unique keys (such as a primary key), they **must** be used in the `PARTITION BY` clause. | ||
|
|
||
| <span class="label label-danger">Note</span> | ||
| - An `op_mapping` that produces `UPDATE_AFTER` without `UPDATE_BEFORE` describes an upsert changelog and requires a key. `PARTITION BY` must be present on the table argument; otherwise the call would produce key-less updates and is rejected at validation time with a `ValidationException`. |
There was a problem hiding this comment.
The note is a bit repetitive, we already stated that it's an upsert table and we have explicit above that the partition by is required
| The engine assumes that the keys provided in the `PARTITION BY` clause function as the unique upsert keys. The resulting output changelog becomes an upsert table keyed on these partition columns. Each incoming row is evaluated and produces `INSERT`, `UPDATE_AFTER`, or `DELETE` events, using the partition key as the explicit upsert key. Therefore, if the incoming changelog contains unique keys (such as a primary key), they **must** be used in the `PARTITION BY` clause. | |
| <span class="label label-danger">Note</span> | |
| - An `op_mapping` that produces `UPDATE_AFTER` without `UPDATE_BEFORE` describes an upsert changelog and requires a key. `PARTITION BY` must be present on the table argument; otherwise the call would produce key-less updates and is rejected at validation time with a `ValidationException`. | |
| An `op_mapping` that produces `UPDATE_AFTER` without `UPDATE_BEFORE` describes an upsert changelog and requires a key. The engine assumes that the keys provided in the `PARTITION BY` clause function as the unique upsert keys. The resulting output changelog becomes an upsert table keyed on these partition columns. Each incoming row is evaluated and produces `INSERT`, `UPDATE_AFTER`, or `DELETE` events, using the partition key as the explicit upsert key. Therefore, if the incoming changelog contains unique keys (such as a primary key), they **must** be used in the `PARTITION BY` clause. |
| * or no updates at all. The output emits {@code INSERT}, {@code UPDATE_BEFORE}, {@code | ||
| * UPDATE_AFTER}, and {@code DELETE}. | ||
| * <li><b>Upsert</b>: the {@code op_mapping} maps to {@code UPDATE_AFTER} without {@code | ||
| * UPDATE_BEFORE}. The output emits {@code INSERT}, {@code UPDATE_AFTER}, and full {@code |
There was a problem hiding this comment.
hmmm, it wasn't :D. I see "full"
| * Verify the AST and the optimized rel plan for the given SELECT query. The rendered optimized | ||
| * rel plan includes the `upsertKeys=[...]` term for rel nodes that derive upsert keys. | ||
| */ | ||
| def verifyRelPlanWithUpsertKey(query: String, extraDetails: ExplainDetail*): Unit = { |
There was a problem hiding this comment.
This does not seem to fit the style of the utilities we have here. We don have unique verifyRelPlan for each param we can configure. Take a look at how withDuplicateChanges and withQueryBlockAlias were implemented
What is the purpose of the change
FROM_CHANGELOGnow emits an upsert changelog (INSERT,UPDATE_AFTER, fullDELETE) when the input table is partitioned (set semantics viaPARTITION BY) and the activeop_mappingmaps toUPDATE_AFTERwithoutUPDATE_BEFORE. The partition key acts as the upsert key. In all other cases the output remains a retract changelog.Submitting an
op_mappingwithUPDATE_AFTERbut noUPDATE_BEFOREwithoutPARTITION BYis rejected at validation time, because upsert mode requires a key.To enable the strategy to inspect the resolved op_mapping and the input table's partition keys, ChangelogFunction.ChangelogContext is extended with two default methods: getArgumentValue(int, Class) and getTableSemantics(int). Defaults return Optional.empty() to preserve source compatibility for existing implementations.
The planner-side wrapper in FlinkChangelogModeInferenceProgram delegates the two new methods to the underlying CallContext.
Upsert mode uses full deletes (ChangelogMode.upsert(false)) because the runtime forwards each input delete row with all fields populated; only the RowKind is rewritten. This matches the runtime's behavior and avoids forcing downstream operators to reconstruct rows from state.
The upsert key derivation in FlinkRelMdUniqueKeys.getPtfUniqueKeys already returns the partition columns when a PTF emits upsert, so no metadata changes are needed.
Verifying this change
Changes are backed by semantic tests and plan verification tests
FromChangelogTestProgramFromChangelogTestto verify generated plan and upsertKeysDoes this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation
Was generative AI tooling used to co-author this PR?
Generated-by: Opus 4.7