-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-12269][table-blink] Support Temporal Table Join in blink planner and runtime #8302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
Hi @godfreyhe , I add the It would be nice if you can have a look too. |
| /** | ||
| * The async join runner to lookup the dimension table. | ||
| */ | ||
| public class TemporalTableJoinAsyncRunner extends RichAsyncFunction<BaseRow, BaseRow> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add tests for these runners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| new OneInputOperatorWrapper(genOperator) | ||
| } | ||
|
|
||
| private[flink] def generateFunction[T <: Function]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't adding this to FunctionCodeGenerator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it actually generate code for a CalcProgram, it needs to access the private method generateProcessCode in this file. And FunctionCodeGenerator only accepts code body as parameter, not the `CalcProgram.
What do you think about renaming the method name to generateCalcFunction to align with generateCalcOperator ?
| <Resource name="sql"> | ||
| < | ||
| val tableScan = call.rel[TableScan](3) | ||
| matches(join, tableScan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also check if it's snapshotted by proctime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, whether it is snapshotted by proctime or rowtime will both be translated into temporal table join. And will throw exception if it is rowtime when physical translation, because we don't support rowtime temporal join currently.
So I think we don't need to check proctime here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed current TemporalTableJoin is a SingleRel, which is not suitable for further extension after we support scanning data into state and provide event time join. So i think it's inappropriate to translate snapshotted with rowtime to TemporalTableJoin and then throw exception inside this physical operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
I'm also rethinking about the physical node name of TemporalTableJoin and TemporalTableFunctionJoin (i.e. temporal join in Flink). The node names are really confused to users. Actually, the TemporalTableJoin is joining a dimension table which is a SingleRel.
So how about renaming TemporalTableJoin to DimensionTableJoin, and renaming TemporalTableFunctionJoin to TemporalJoin. And we can change the translation rule to:
- snapshot on proctime & table source supports
LookupableTableSource==>DimensionTableJoin - snapshot on rowtime & table source ONLY supports
LookupableTableSource==> exception - snapshot on proctime/rowtime & table source supports scanning ==>
TemporalJoin
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this idea. How about rename DimensionTableJoin to LookupJoin, and rename TemporalJoin to TemporalTableJoin.
Conceptually LookupJoin is very similar with NestedLoopJoin but is a SingleRel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to LookupJoin.
The only concern of TemporalTableJoin is that it is actually joining two streams, but there is a Table in the name. Do you think it matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because i think TemporalTable is more consistent with sql standard. TemporalJoin will confuse others about what is temporal? It's not intuitive that this is actually used for temporal table join. You can consider this name as TemporalTable ' Join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's go forward with LookupJoin and TemporalTableJoin.
|
Hi @KurtYoung , I renamed the node name to |
| * @param <T> type of the result | ||
| */ | ||
| @PublicEvolving | ||
| public interface LookupableTableSource<T> extends TableSource<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface should also implement DefinedIndexes and DefinedPrimaryKey?
|
|
||
| new OneInputTransformation( | ||
| inputTransformation, | ||
| "TemporalTableJoin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the name here
|
|
||
| @Override | ||
| public void collect(T record) { | ||
| this.collected = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this method call collector.collect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and i just found that the sub-class of this class RowToBaseRowCollector is actually calling getCollector.asInstanceOf[Collector[BaseRow]].collect(result),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the collector.collect should be called by sub-classes. Because the getCollector should collect a final result of the Correlate, i.e. a JoinedRow combines left input and right row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What i meant is this class already holds a private Collector<?> collector;, and also implements Collector<T> interface. In this collect(T record) method, why don't you collect the record with your own collector, but rely on the caller to first getCollector() and collect records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because TableFunctionCollector abstract class does't know the what is the record type T. It might be a BaseRow, but can also be a Row (see RowToBaseRowCollector). As a result, TableFunctionCollector can't collect by himself, because he doesn't know what the result record is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then i think this class shouldn't be a collector, aka shouldn't implementing Collector<T> interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to keep the interface, because it is the basic implementation of UDTF's collector, which is used to accept UDTF's result value.
What about provide a outputResult method used to collect final result. And remove getCollector() method and collect(T record) implementation.
public void outputResult(Object result) {
this.collected = true;
collector.collect(result);
}All the sub-classes can use outputResult(result) to collect final result instead of getCollector().collect(result).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's try this.
| /** | ||
| * Sets the current collector, which used to emit the final row. | ||
| */ | ||
| public void setCollector(ResultFuture<?> resultFuture) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setResultFuture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| /** | ||
| * Gets the internal collector which used to emit the final row. | ||
| */ | ||
| public ResultFuture<?> getCollector() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getResultFuture?
| /** | ||
| * The basic implementation of collector for {@link ResultFuture} in table joining. | ||
| */ | ||
| public abstract class TableFunctionResultFuture<T> extends AbstractRichFunction implements ResultFuture<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The responsibility of this class and TableFunctionCollector is not clean. It looks like they want to do some encapsulation but actually only contain some get and set methods. And the interface contract also seems not consistent, see the comment i left in TableFunctionCollector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The responsibility of them are used to generate a collector which will do some filters or projections on both left input row and right table row. That's why we have setInput, getInput and a setCollector to set the real underlying collector.
The only difference between them is TableFunctionCollector has an additional collected flag used to indicate whether right table is empty. Because UDTF may call collect(T) zero or several times, we need a way to know whether it is called zero times, so that we can emit an null row for left join. However, ResultFuture.complete(Collection<OUT> result) will be called exactly once, so that we don't need a flag to indicate the "zero call", an empty or null result is the "zero call".
| TableFunctionResultFuture<BaseRow> resultFuture = generatedResultFuture.newInstance( | ||
| getRuntimeContext().getUserCodeClassLoader()); | ||
| FunctionUtils.setFunctionRuntimeContext(resultFuture, getRuntimeContext()); | ||
| FunctionUtils.openFunction(resultFuture, new Configuration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider saving parameters of open for use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| TableFunctionResultFuture<BaseRow> resultFuture = generatedResultFuture.newInstance( | ||
| getRuntimeContext().getUserCodeClassLoader()); | ||
| FunctionUtils.setFunctionRuntimeContext(resultFuture, getRuntimeContext()); | ||
| FunctionUtils.openFunction(resultFuture, new Configuration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider close resultFuture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
…er and runtime This closes apache#8302
…er and runtime This closes apache#8302
|
Comments addressed |
…er and runtime This closes apache#8302
|
Rebased. |
|
@KurtYoung |
|
+1 |
What is the purpose of the change
Support translate "FOR SYSTEM_TIME AS OF" query into temporal table join for both Batch and Stream.
Brief change log
LookupableTableSource,AsyncTableFunction,DefinedPrimaryKey,DefinedTableIndex,TableIndex,LookupConfigSome differences between this PR and blink branch.
FOR SYSTEM_TIME AS OFon left table's proctime field, not aPROCTIME()builtin function. This makes syntax clean.String,Timestamp...).Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation