[#2543] feat(spark-connector): support row-level operations to iceberg Table #2642

caican00 · 2024-03-22T07:33:39Z

What changes were proposed in this pull request?

support row-level operations to iceberg Table

1. update tableName set c1=v1, c2=v2, ...

2. merge into targetTable t
   using sourceTable s
   on s.key=t.key
   when matched then ...
   when not matched then ...

3. delete from table where xxx

Why are the changes needed?

support row-level operations to iceberg Table

Fix: #2543

Does this PR introduce any user-facing change?

Yes, support update ... , merge into ..., delete from ...

How was this patch tested?

New ITs.
And tested locally.

…ceberg-delete

…ceberg-row-level-update

…ceberg-delete

…ead-write

…to iceberg-read-write

…pdate

…into iceberg-row-level-update

…ceberg-row-level-update

…ino into iceberg-row-level-update

FANNG1 · 2024-04-30T03:00:40Z

Another thought, how about upgrading Iceberg to a newer version since the problem only exists in older versions? @caican00 @qqqttt123 @jerryshao WDYT?

caican00 · 2024-04-30T03:14:12Z

Another thought, how about upgrading Iceberg to a newer version since the problem only exists in older versions? @caican00 @qqqttt123 @jerryshao WDYT?

@FANNG1

IMO, we have to support spark multi-version, such as spark3.1, spark3.3, spark3.4, spark3.5, etc.
Iceberg parser only has this problem before spark3.5, but some physical plans in Iceberg spark-connector, such as AddPartitionFieldExec, SetWriteDistributionAndOrderingExec, have this problem in all versions.

such as spark3.5:
https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/SetWriteDistributionAndOrderingExec.scala#L47

https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala#L166

caican00 · 2024-04-30T03:27:00Z

Another thought, how about upgrading Iceberg to a newer version since the problem only exists in older versions? @caican00 @qqqttt123 @jerryshao WDYT?

@FANNG1

IMO, we have to support spark multi-version, such as spark3.1, spark3.3, spark3.4, spark3.5, etc.

Iceberg parser only has this problem before spark3.5, but some physical plans in Iceberg spark-connector, such as AddPartitionFieldExec, SetWriteDistributionAndOrderingExec, have this problem in all versions.

such as spark3.5: https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/SetWriteDistributionAndOrderingExec.scala#L47

https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala#L166

Unless SparkIcebergTable extended SparkTable and GravitinoIcebergCatalog extended SparkCatalog.
This solution, discussed earlier, also needs to override some of scala's methods, such as productElement, productArity, canEqual. WDYT? cc @FANNG1 @qqqttt123 @jerryshao

FANNG1 · 2024-04-30T03:29:30Z

My concern is current implementation seems hard to maintain especially for different versions of the Spark and Iceberg. If there is no simple solution, I prefer the original implement which returns a GravitinoIcebergTable extends SparkTable.

caican00 · 2024-04-30T03:34:30Z

My concern is current implementation seems hard to maintain especially for different versions of the Spark and Iceberg. If there is no simple solution, I prefer the original implement which returns a GravitinoIcebergTable extends SparkTable.

I prefer this solution too.

caican00 · 2024-04-30T03:46:12Z

My concern is current implementation seems hard to maintain especially for different versions of the Spark and Iceberg. If there is no simple solution, I prefer the original implement which returns a GravitinoIcebergTable extends SparkTable.

I prefer this solution too.

@FANNG1 should i fallback to the original implementation? I would like to finish it today.
And another thought, could we only make SparkIcebergTable extend SparkTable?
hive, jdbc and so on can be inconsistent?
for example, it seems unnecessary to make SparkHiveTable also extend kyuubi HiveTable.

FANNG1 · 2024-04-30T03:53:48Z

My concern is current implementation seems hard to maintain especially for different versions of the Spark and Iceberg. If there is no simple solution, I prefer the original implement which returns a GravitinoIcebergTable extends SparkTable.

I prefer this solution too.

@FANNG1 should i fallback to the original implementation? I would like to finish it today. And another thought, could we only make SparkIcebergTable inherit SparkTable? hive, jdbc and so on can be inconsistent? for example, it seems unnecessary to make SparkHiveTable also inherit kyuubi HiveTable.

@qqqttt123 @jerryshao WDYT?

qqqttt123 · 2024-04-30T04:13:12Z

Some questions.

How the Trino to solve the issue?
Is it necessary to support parser if we support row-level operations?

FANNG1 · 2024-04-30T06:48:51Z

Some questions.

How the Trino to solve the issue?

seems there are no relationship about How Trino solve the issue ? Spark and Trino had different frameworks.

caican00 · 2024-04-30T06:51:54Z

Some questions.

How the Trino to solve the issue?

Is it necessary to support parser if we support row-level operations?

@qqqttt123

Trino sqlParser does not have this issue.
for spark-connector in Iceberg, it explicitly uses SparkTable to identify whether it is an Iceberg table, so we have to rewrite the parser or inherit SparkTable to make rowLevelCommands recognizable.

https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L127-L186

caican00 · 2024-04-30T07:01:09Z

Some questions.

How the Trino to solve the issue?

seems there are no relationship about How Trino solve the issue ? Spark and Trino had different frameworks.

Yes, it actually doesn't have the same problem with spark. So is flink.

qqqttt123 · 2024-04-30T07:05:26Z

What's the problem to extend the SparkTable?

wForget · 2024-04-30T07:25:33Z

Why not just return org.apache.iceberg.spark.source.SparkTable but wrap it? Did we do any extra work?

caican00 · 2024-04-30T07:46:27Z

What's the problem to extend the SparkTable?

@qqqttt123

we have to make SparkBaseTable as an interface, as java can only extend one class.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/table/SparkBaseTable.java
if we use combined interface，it seems that we can no longer extract a common baseTable.
And Then each data source needs to implement Table, SupportsRead, and SupportsWrite interfaces separately, it will cause redundant code and the readability of the code becomes terrible.
some physical rules also have same problem, such as ExtendedDataSourceV2Strategy, it explicitly uses org.apache.iceberg.spark.SparkCatalog to identify whether it is an Iceberg Catalog. https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala#L166
therefore, we also have to make GravitinoIcebergCatalog extend SparkCatalog of Iceberg, and then we have to make BaseCatalog of Gravitino as an interface. It will cause the same problem as above.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/catalog/BaseCatalog.java

qqqttt123 · 2024-04-30T07:52:12Z

What's the problem to extend the SparkTable?

@qqqttt123

we have to make SparkBaseTable as an interface, as java can only extend one class.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/table/SparkBaseTable.java
if we use combined interface，it seems that we can no longer extract a common baseTable.
And Then each data source needs to implement Table, SupportsRead, and SupportsWrite interfaces separately, it will cause redundant code and the readability of the code becomes terrible.

some physical rules also have same problem, such as ExtendedDataSourceV2Strategy, it explicitly uses SparkCatalog to identify whether it is an Iceberg Catalog. https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala#L166
therefore, we also have to make GravitinoIcebergCatalog extend SparkCatalog of Iceberg, and then we have to make BaseCatalog of Gravitino as an interface. It will cause the same problem as above.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/catalog/BaseCatalog.java

This is not a problem. We can extend the interface. We can extract a common a class as new class field. It's ok.

caican00 · 2024-04-30T07:52:19Z

Why not just return org.apache.iceberg.spark.source.SparkTable but wrap it? Did we do any extra work?

we have wrapped org.apache.iceberg.spark.source.SparkTable in com.datastrato.gravitino.spark.connector.iceberg.SparkIcebergTable, but the parent class of com.datastrato.gravitino.spark.connector.iceberg.SparkIcebergTable is not org.apache.iceberg.spark.source.SparkTable.

caican00 · 2024-04-30T07:55:17Z

What's the problem to extend the SparkTable?

@qqqttt123

we have to make SparkBaseTable as an interface, as java can only extend one class.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/table/SparkBaseTable.java
if we use combined interface，it seems that we can no longer extract a common baseTable.
And Then each data source needs to implement Table, SupportsRead, and SupportsWrite interfaces separately, it will cause redundant code and the readability of the code becomes terrible.

some physical rules also have same problem, such as ExtendedDataSourceV2Strategy, it explicitly uses SparkCatalog to identify whether it is an Iceberg Catalog. https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala#L166
therefore, we also have to make GravitinoIcebergCatalog extend SparkCatalog of Iceberg, and then we have to make BaseCatalog of Gravitino as an interface. It will cause the same problem as above.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/catalog/BaseCatalog.java

This is not a problem. We can extend the interface. We can extract a common a class as new class field. It's ok.

ok for me, @FANNG1 WDYT?

caican00 · 2024-04-30T08:08:37Z

What's the problem to extend the SparkTable?

@qqqttt123

we have to make SparkBaseTable as an interface, as java can only extend one class.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/table/SparkBaseTable.java
if we use combined interface，it seems that we can no longer extract a common baseTable.
And Then each data source needs to implement Table, SupportsRead, and SupportsWrite interfaces separately, it will cause redundant code and the readability of the code becomes terrible.

some physical rules also have same problem, such as ExtendedDataSourceV2Strategy, it explicitly uses SparkCatalog to identify whether it is an Iceberg Catalog. https://github.com/apache/iceberg/blob/426818bfe7fa93e8c677ebf886638d5c50db597b/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala#L166
therefore, we also have to make GravitinoIcebergCatalog extend SparkCatalog of Iceberg, and then we have to make BaseCatalog of Gravitino as an interface. It will cause the same problem as above.
https://github.com/datastrato/gravitino/blob/main/spark-connector/spark-connector/src/main/java/com/datastrato/gravitino/spark/connector/catalog/BaseCatalog.java

This is not a problem. We can extend the interface. We can extract a common a class as new class field. It's ok.

ok for me, @FANNG1 WDYT?

if we all think it is ok, i will go ahead, thank you.

FANNG1 · 2024-04-30T09:44:16Z

if we all think it is ok, i will go ahead, thank you.

It's ok for me, something to keep consistent before refactor,

please keep consistent of the implement of HiveTable and IcebergTable.
refacor the common logic in SparkBaseTable as a common helper class, not interface.
close current PR and propose a new PR.

WDYT? @caican00

caican00 · 2024-04-30T12:01:39Z

please keep consistent of the implement of HiveTable and IcebergTable.

@FANNG1 I'm a little confused，is it necessary to keep consistent?
if do it like this，why not just get the implementation of the parent class directly? In this case, why implement SparkBaseTable, which seems a little redundant?

caican00 · 2024-04-30T12:10:55Z

please keep consistent of the implement of HiveTable and IcebergTable.

@FANNG1 I'm a little confused，is it necessary to keep consistent? if do it like this，why not just get the implementation of the parent class? In this case, why implement SparkBaseTable, which seems a bit redundant?

and in addition, kyuubi's HiveTable is a scala class, and extending it in java requires overriding some of scala's methods

FANNG1 · 2024-04-30T12:23:34Z

please keep consistent of the implement of HiveTable and IcebergTable.

@FANNG1 I'm a little confused，is it necessary to keep consistent? if do it like this，why not just get the implementation of the parent class? In this case, why implement SparkBaseTable, which seems a bit redundant?

because keep consistent is easy to maintenance, SparkIcebergTable extends SparkTable while SparkHiveTable compose KyubbyHiveTable is really confusing to new developers, when implementing new features should consider the two cases, if not, may encounter bugs which is unmaintainable.

caican00 · 2024-04-30T12:36:56Z

please keep consistent of the implement of HiveTable and IcebergTable.

@FANNG1 I'm a little confused，is it necessary to keep consistent? if do it like this，why not just get the implementation of the parent class? In this case, why implement SparkBaseTable, which seems a bit redundant?

because keep consistent is easy to maintenance, SparkIcebergTable extends SparkTable while SparkHiveTable compose KyubbyHiveTable is really confusing to new developers, when implementing new features should consider the two cases, if not, may encounter bugs which is unmaintainable.

@FANNG1 got it, should I submit a separate pr to refactor the table implementation? This does not include row level operations.
In this new pr we can explain the reasons for the refactoring and then submit the row level pr

FANNG1 · 2024-04-30T12:44:43Z

please keep consistent of the implement of HiveTable and IcebergTable.

@FANNG1 I'm a little confused，is it necessary to keep consistent? if do it like this，why not just get the implementation of the parent class? In this case, why implement SparkBaseTable, which seems a bit redundant?

because keep consistent is easy to maintenance, SparkIcebergTable extends SparkTable while SparkHiveTable compose KyubbyHiveTable is really confusing to new developers, when implementing new features should consider the two cases, if not, may encounter bugs which is unmaintainable.

@FANNG1 got it, should I submit a separate pr to refactor the table implementation? This does not include row level operations. In this new pr we can explain the reasons for the refactoring and then submit the row level pr

both are ok for me.

caican00 · 2024-04-30T12:48:40Z

please keep consistent of the implement of HiveTable and IcebergTable.

@FANNG1 I'm a little confused，is it necessary to keep consistent? if do it like this，why not just get the implementation of the parent class? In this case, why implement SparkBaseTable, which seems a bit redundant?

because keep consistent is easy to maintenance, SparkIcebergTable extends SparkTable while SparkHiveTable compose KyubbyHiveTable is really confusing to new developers, when implementing new features should consider the two cases, if not, may encounter bugs which is unmaintainable.

@FANNG1 got it, should I submit a separate pr to refactor the table implementation? This does not include row level operations. In this new pr we can explain the reasons for the refactoring and then submit the row level pr

both are ok for me.

ok

caican00 · 2024-05-01T15:41:59Z

close this pr and create a new pr to refactor table implementation and support row-level operations feature.

caican00 marked this pull request as draft March 22, 2024 07:33

caican00 changed the title ~~[WIP][#2543] feat(spark-connector): support row-level operations to iceberg Table~~ [#2543] feat(spark-connector): support row-level operations to iceberg Table Mar 27, 2024

caican00 added 3 commits March 27, 2024 10:25

fix

b7145a4

Merge remote-tracking branch 'upstream_dev/iceberg-read-write' into i…

35bb0bf

…ceberg-delete

fix

e543cdb

caican00 changed the title ~~[#2543] feat(spark-connector): support row-level operations to iceberg Table~~ [WIP][#2543] feat(spark-connector): support row-level operations to iceberg Table Mar 27, 2024

caican00 and others added 7 commits March 27, 2024 10:51

fix

497b1dd

fix

df700cc

Merge remote-tracking branch 'upstream_dev/iceberg-read-write' into i…

1238187

…ceberg-row-level-update

Merge branch 'main' into iceberg-delete

87d6dd9

Merge remote-tracking branch 'upstream/main' into iceberg-delete

b49b585

Merge remote-tracking branch 'upstream_dev/iceberg-read-write' into i…

38a069e

…ceberg-delete

Merge branch 'iceberg-delete' of github.com:caican00/gravitino into i…

6e3fbcb

…ceberg-delete

caican00 changed the title ~~[WIP][#2543] feat(spark-connector): support row-level operations to iceberg Table~~ [#2543] feat(spark-connector): support row-level operations to iceberg Table Mar 27, 2024

caican00 and others added 16 commits March 27, 2024 12:04

fix

74c7d2e

Merge branch 'main' into iceberg-read-write

03a5b4b

Merge branch 'main' into iceberg-row-level-update

34aa9e4

fix

eeb6318

Merge branch 'main' of github.com:datastrato/gravitino into iceberg-r…

2d09006

…ead-write

Merge branch 'iceberg-read-write' of github.com:caican00/gravitino in…

e4ad4d4

…to iceberg-read-write

Merge branch 'main' into iceberg-delete

9b85805

fix

66f45e6

Merge remote-tracking branch 'upstream/main' into iceberg-row-level-u…

362acfe

…pdate

Merge remote-tracking branch 'upstream_dev/iceberg-row-level-update' …

f67d7a6

…into iceberg-row-level-update

fix

2a994c7

fix

d549207

fix

c8e8948

Merge branch 'main' into iceberg-read-write

b8e61dc

fix

820535d

Merge remote-tracking branch 'upstream_dev/iceberg-read-write' into i…

0bfdaaf

…ceberg-row-level-update

caican00 and others added 3 commits April 30, 2024 10:38

Merge branch 'main' into iceberg-row-level-update

5ae086b

update

51921e7

Merge branch 'iceberg-row-level-update' of github.com:caican00/gravit…

7303841

…ino into iceberg-row-level-update

caican00 closed this May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#2543] feat(spark-connector): support row-level operations to iceberg Table #2642

[#2543] feat(spark-connector): support row-level operations to iceberg Table #2642

caican00 commented Mar 22, 2024 •

edited

Loading

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented Apr 30, 2024 •

edited

Loading

FANNG1 commented Apr 30, 2024

qqqttt123 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024

qqqttt123 commented Apr 30, 2024

wForget commented Apr 30, 2024

caican00 commented Apr 30, 2024 •

edited

Loading

qqqttt123 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented May 1, 2024

[#2543] feat(spark-connector): support row-level operations to iceberg Table #2642

[#2543] feat(spark-connector): support row-level operations to iceberg Table #2642

Conversation

caican00 commented Mar 22, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024 • edited Loading

caican00 commented Apr 30, 2024 • edited Loading

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented Apr 30, 2024 • edited Loading

FANNG1 commented Apr 30, 2024

qqqttt123 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024 • edited Loading

caican00 commented Apr 30, 2024

qqqttt123 commented Apr 30, 2024

wForget commented Apr 30, 2024

caican00 commented Apr 30, 2024 • edited Loading

qqqttt123 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024 • edited Loading

caican00 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024 • edited Loading

caican00 commented Apr 30, 2024

FANNG1 commented Apr 30, 2024

caican00 commented Apr 30, 2024

caican00 commented May 1, 2024

caican00 commented Mar 22, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

caican00 commented Apr 30, 2024 •

edited

Loading

FANNG1 commented Apr 30, 2024 •

edited

Loading