[FLINK-33211][table] support flink table lineage #24618

HuangZhenQiu · 2024-04-03T14:17:51Z

What is the purpose of the change

Add Table Lineage Vertex into transformation in planner. The final LineageGraph is generated from transformation and put into StreamGraph. The lineage graph will be published to Lineage Listener in follow up PR.
Deprecated table source and sink are not considered as no enough info can be used for name and namespace for lineage dataset.

Brief change log

add table lineage interface and default implementations
create lineage vertex and add them to transformation in the phase of physical plan to transformation conversion.

Verifying this change

Add TableLineageGraphTest for both stream and batch.
Added LineageGraph verification in TransformationsTest for legacy sources.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable )

HuangZhenQiu · 2024-04-03T16:56:01Z

@flinkbot run azure

flinkbot · 2024-04-04T09:38:53Z

CI report:

4fc6696 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

HuangZhenQiu · 2024-04-04T15:56:41Z

@flinkbot run azure

PatrickRen

@HuangZhenQiu Thanks for the contribution! I left some comments.

Also I found a lot of one-line changes removing blank line in file header. Could you split them to another hotfix commit, or directly revert them as they are not quite necessary?

PatrickRen · 2024-04-23T07:43:38Z

...ava/src/main/java/org/apache/flink/streaming/api/transformations/PhysicalTransformation.java

@@ -34,6 +35,7 @@
 public abstract class PhysicalTransformation<T> extends Transformation<T> {

    private boolean supportsConcurrentExecutionAttempts = true;
+    private LineageVertex lineageVertex;


It looks like only source and sink transformations have lineage vertex. What about we only add it to source / sink transformations?

Make sense. Added an TransformationWithLineage class for this purpose.

PatrickRen · 2024-04-23T08:38:18Z

...le-planner/src/main/java/org/apache/flink/table/planner/lineage/TableLineageDatasetImpl.java

+
+    public TableLineageDatasetImpl(ContextResolvedTable contextResolvedTable) {
+        this.name = contextResolvedTable.getIdentifier().asSummaryString();
+        this.namespace = inferNamespace(contextResolvedTable.getTable()).orElse("");


I'm not sure if the implementation here matches the definition on the interface. From Javadoc of LineageDataset:

flink/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/lineage/LineageDataset.java

Line 32 in 0b2e988

/* Unique name for this dataset's storage, for example, url for jdbc connector and location for lakehouse connector. */

Let's take JDBC connector as an example. My assumption is that the namespace should describe the URL of the database, or at least some identifier that can tell difference between difference DB instances. Here the implementation only writes jdbc as the namespace.

...in/java/org/apache/flink/table/planner/plan/nodes/exec/common/CommonExecTableSourceScan.java

PatrickRen · 2024-04-23T09:12:00Z

...a/org/apache/flink/table/planner/plan/nodes/exec/stream/StreamExecLegacyTableSourceScan.java

@@ -90,6 +90,7 @@ protected Transformation<RowData> createConversionTransformationIfNeeded(
        final RowType outputType = (RowType) getOutputType();
        final Transformation<RowData> transformation;
        final int[] fieldIndexes = computeIndexMapping(true);
+


This is added by mistake I assume

PatrickRen · 2024-04-23T09:18:15Z

flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraph.java

@@ -123,6 +124,7 @@ public class StreamGraph implements Pipeline {
    private CheckpointStorage checkpointStorage;
    private Set<Tuple2<StreamNode, StreamNode>> iterationSourceSinkPairs;
    private InternalTimeServiceManager.Provider timerServiceProvider;
+    private LineageGraph lineageGraph;


The only usage of this field is for tests. Is it possible not to introduce it in StreamGraph?

As we discussed offline, we will keep it here.

PatrickRen · 2024-04-23T09:24:08Z

...le-planner/src/main/java/org/apache/flink/table/planner/lineage/TableDataSetSchemaFacet.java

+import java.util.Map;
+
+/** Default implementation for DatasetSchemaFacet. */
+public class TableDataSetSchemaFacet implements DatasetSchemaFacet {


DataSet -> Dataset

PatrickRen · 2024-04-23T09:26:15Z

...le-planner/src/main/java/org/apache/flink/table/planner/lineage/TableDataSetSchemaFacet.java

+import java.util.Map;
+
+/** Default implementation for DatasetSchemaFacet. */
+public class TableDataSetSchemaFacet implements DatasetSchemaFacet {


Actually we don't need facets for table schema and table config. They are already included in TableLineageDataset#table.

I agree. It is just for exposing the data in a structured way. After the implementation, I feel we probably don't need to expose CatalogContext and CatalogBaseTable to users.

HuangZhenQiu · 2024-04-29T04:36:48Z

@PatrickRen
Thanks for reviewing the RP. For the testing purpose, I only added lineage provider implementation for values related source functions and input format. I will add lineage provider for Hive in a separate PR.

flink-streaming-java/src/main/java/org/apache/flink/streaming/api/lineage/LineageGraph.java

...ava/src/main/java/org/apache/flink/streaming/api/transformations/OneInputTransformation.java

...ble/flink-table-planner/src/main/java/org/apache/flink/table/planner/lineage/ModifyType.java

...ble-planner/src/main/java/org/apache/flink/table/planner/lineage/TableColumnLineageEdge.java

...-table/flink-table-planner/src/test/java/org/apache/flink/connector/source/ValuesSource.java

HuangZhenQiu · 2024-04-29T18:05:48Z

@davidradl
Thanks for reviewing this PR. This PR is mainly to handle with source/sink level lineage, column level lineage will be need a further discussion in community. Resolved most of your comments.

HuangZhenQiu · 2024-05-06T17:50:42Z

@PatrickRen
I have removed schema facet and config facets, given these info are already provided by CatalogBaseTable. It greatly reduced the size of the PR. Would you please take one more round of review?

HuangZhenQiu force-pushed the support-table-lineage branch 3 times, most recently from 34f3667 to 3a01cfd Compare April 3, 2024 20:18

flinkbot added the component=TableSQL/Runtime label Apr 4, 2024

HuangZhenQiu force-pushed the support-table-lineage branch from 3a01cfd to 4820faf Compare April 7, 2024 21:55

HuangZhenQiu force-pushed the support-table-lineage branch from 4820faf to 16bb67b Compare April 20, 2024 22:36

PatrickRen self-requested a review April 23, 2024 04:05

PatrickRen reviewed Apr 23, 2024

View reviewed changes

HuangZhenQiu force-pushed the support-table-lineage branch 2 times, most recently from e5f8a17 to 5f29a04 Compare April 29, 2024 04:32

HuangZhenQiu force-pushed the support-table-lineage branch from 5f29a04 to 375fe2d Compare April 29, 2024 04:49