[SPARK-34360][SQL] Support truncation of v2 tables #31475

MaxGekk · 2021-02-04T14:45:00Z

What changes were proposed in this pull request?

Add new interface TruncatableTable which represents tables that allow atomic truncation.
Implement new method in InMemoryTable and in InMemoryPartitionTable.

Why are the changes needed?

To support TRUNCATE TABLE for v2 tables.

Does this PR introduce any user-facing change?

Should not.

How was this patch tested?

Added new tests to TableCatalogSuite that check truncation of non-partitioned and partitioned tables:

$ build/sbt "test:testOnly *TableCatalogSuite"

SparkQA · 2021-02-04T16:00:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39464/

SparkQA · 2021-02-04T16:28:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39464/

SparkQA · 2021-02-04T19:39:24Z

Test build #134876 has finished for PR 31475 at commit b77a210.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-02-04T19:42:04Z

@cloud-fan @HyukjinKwon Could you review this, please.

MaxGekk · 2021-02-08T10:08:05Z

@cloud-fan @HyukjinKwon Any objections to this PR?

cloud-fan · 2021-02-08T11:56:22Z

LGTM, also cc @rdblue @imback82 @dongjoon-hyun

jaceklaskowski

LGTM (non-binding)

imback82

LGTM

rdblue · 2021-02-08T17:39:07Z

@MaxGekk, why is this necessary instead of deleting from the table or overwriting everything with no new records? I don't see a good reason to do this, especially at the catalog level instead of the table level. Introducing new ways to do something that is already possible over-complicates the API and is a step in the wrong direction.

Please consider this a -1 until we come to consensus -- I may support it in the end, but I don't want anyone choosing to commit despite disagreement in the mean time.

MaxGekk · 2021-02-08T19:06:01Z

... why is this necessary instead of deleting from the table or overwriting everything with no new records?

By emulating table truncation via the insertion of no rows, you require atomic operations: delete + insert but a concrete implementation might not support this though it can atomically truncate a table.
You close the room for truncation specific optimizations. If a catalog implementation would know in advance that we want to truncate the entire table instead of deleting all rows, it could do that in a more optimal way. Let's say some file based implementation could move the table folder to a trash folder using one atomic syscall.
From security or permissions controls point of view, we could distinguish insert with overwrite (or delete) from truncation. I could imagine a case when some roles/users can have only truncation permissions but not insert or delete permissions.
Also it is possible that truncation op is just a record at catalog level log but inserts/deletes are records at table level logs. So, we cannot smoothly sit on such implementation if we emulate table truncation via inserts/deletes.

In general, I do believe we should not hide our intention from catalog implementations - truncation should be explicit. Table catalog implementation should decide how to implement in a more optimal way. So, if they can emulate truncation via overwriting with no rows, ok, this is up to them.

rdblue · 2021-02-08T21:58:14Z

@MaxGekk, can you share the use case that you have for this? You mentioned truncation-specific optimizations. I think working with concrete use cases is usually a good idea. If these are theoretical only -- like a user that can drop all data but not a subset -- then we should put this off. If there's a specific case, then let's discuss it.

I agree that there may be good reason to pass that the engine's intent was to truncate. That's why we have SupportsTruncate for the write builder. And I agree with you that we don't necessarily need to use an atomic operation that could truncate and add data at the same time. Your point about not having insert permissions is a good one to justify not using SupportsTruncate, although the case of a user that can drop all data but not subsets doesn't sound real. The point about truncation possibly being a metadata operation is why we added SupportsDelete at the table level.

Those points may indicate that an interface to truncate a table as a stand-alone operation is valid, although I still think that it is a bad idea to add more interfaces to v2 without a reasonable expectation that they will actually be used.

Another problem here is that this is operation is proposed at the catalog level, which does not fit with how v2 works. I think that the reason for this is emulating what the Hive does, but that's not usually a good choice.

In v2, catalogs load tables and tables are modified. That's why SupportsDelete extends Table and not TableCatalog. This keeps concerns separate, so we have a way to handle tables that don't exist and a separate way to handle tables that don't support a certain operation. Mixing those two together at the catalog level over-complicates the API, requiring a source to throw one exception if the table doesn't exist and another if it doesn't support truncation. (We also went through this discussion with the recently added interfaces to add/drop partitions.)

Assuming that it is worth adding this interface, I would expect it to be a mix-in for Table. And like SupportsOverwrite that implements SupportsTruncate, I think this should also update SupportsDelete so that tables don't need to implement both interfaces.

MaxGekk · 2021-02-09T12:27:18Z

... can you share the use case that you have for this?

@rdblue For instance, v2 table catalog for JDBC:

I assume if a table supports the SupportsTruncate interface, it must support atomic remove of all data and write any set of rows (empty and non-empty).
Support of atomic SupportsTruncate is not so easy in the case of JDBC. For example, we still don't support it in JDBC v2 Table Catalog, see SPARK-32595.
DBMS usually provides special command for table truncation (see DB2, Oracle, PostgreSQL, Hive). So, we could map the new method truncateTable to a DBMS command.

rdblue · 2021-02-09T17:48:32Z

@MaxGekk, thanks. Then let's work on updating this to fit more cleanly with the design of v2 catalogs and tables. This should be a Table interface, not an extension to TableCatalog.

rdblue · 2021-02-09T19:45:37Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TruncatableTable.java

+   *
+   * @since 3.2.0
+   */
+  boolean truncateTable();


Why include "table" in the method name? table.truncate() seems clear enough to me.

Actually, this seems to mirror the other uses so I'm okay with it. I'd probably remove it but it seems okay to leave as is if you feel strongly about it.

I just followed naming convention (maybe informal one) in other interfaces. For instance, if a table implements TruncatableTable, SupportsPartitionManagement and SupportsAtomicPartitionManagement, we have 3 methods in the same namespace:

truncatePartition()

truncatePartitions()

and truncate()

In that case, maybe it is better to name this method as truncateTable() which can highlight that it is applied to entire table.

rdblue · 2021-02-09T19:47:05Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TruncatableTable.java

+public interface TruncatableTable extends Table {
+  /**
+   * Truncate a table by removing all rows from the table atomically.
+   * If the table supports partitions, the method removes all existing partitions.


Why is this required? A table may support empty partitions that exist independent of rows, like Hive tables. Is there a compelling reason to force this behavior?

SupportsPartitionManagement has a truncatePartition method that will "Truncate a partition in the table by completely removing partition data."

That conflicts with the behavior of truncate here, which drops partitions. I think this requirement should be removed.

Is there a compelling reason to force this behavior?

Yes, I wanted to highlight that it can remove not only all rows but also partitions to align to Spark's v1 (and Hive) TRUNCATE TABLE but the v1 command doesn't drop partitions in fact:

spark-sql> CREATE TABLE tbl (col0 INT) PARTITIONED BY (part INT); spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0; spark-sql> ALTER TABLE tbl ADD PARTITION (part=1); spark-sql> SHOW PARTITIONS tbl; part=0 part=1 spark-sql> SELECT * FROM tbl; 0 0 spark-sql> TRUNCATE TABLE tbl; spark-sql> SHOW PARTITIONS tbl; part=0 part=1 spark-sql> SELECT * FROM tbl; spark-sql>

but in this PR, I drop empty partitions as well. The question is should v2 TRUNCATE TABLE be aligned to v1 implementation, and preserve empty partitions? I guess the answer is yes.

Surprising but we don't have any tests that checks partition existence after entire table truncation. I opened this PR #31544 with such checks for tables from v1 In-Memory and Hive external catalogs.

rdblue · 2021-02-09T19:47:55Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TruncatableTable.java

+ * Represents a table which can be atomically truncated.
+ */
+@Evolving
+public interface TruncatableTable extends Table {


The other traits use Supports as the first word, but SupportsTruncate already exists. This name is okay, but others may want to change it.

Maybe, do we need Atomic in the name? For example, SupportsAtomicTruncate is better?

All of the operations should be atomic, right? Certainly, that's the expectation of a write that truncates and then appends data. @MaxGekk cited that as a reason above why this interface is needed.

if we name this interface as SupportsAtomicTruncate, someone may guess that SupportsTruncate is opposite to it (means non-atomic truncate) or both atomic and non-atomic truncate. But actually, those two interfaces are "orthogonals" in some sense.

I don't think that adding Atomic to the name is a good idea. The other operations don't specify whether an operation is atomic and I don't think that this should necessarily either. If I understand correctly, the purpose of this is to allow using TRUNCATE for JDBC or similar optimizations. That's more likely to be atomic, but may not be. It looks like the Hive implementation would not be because the partitions are kept.

rdblue · 2021-02-09T19:49:19Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

    val ordering: Array[SortOrder] = Array.empty)
  extends Table with SupportsRead with SupportsWrite with SupportsDelete
-      with SupportsMetadataColumns {
+      with SupportsMetadataColumns with TruncatableTable {


I think that SupportsDelete should implement TruncatableTable, similar to how SupportsOverwrite implements SupportsTruncate. That way it isn't necessary to for implementations to support both delete and truncate separately.

If you are sure that if an implementation supports SupportsDelete then it must support TruncatableTable too. But I slightly doubt about it. I could imagine a case when an implementation can delete rows by a filter but cannot guarantee atomic truncation.

Truncate is equivalent to deleteWhere(true). Why would that not be equivalent?

SparkQA · 2021-02-09T20:21:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39663/

SparkQA · 2021-02-09T20:26:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39663/

SparkQA · 2021-02-09T23:34:15Z

Test build #135081 has finished for PR 31475 at commit 141919c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-10T02:46:51Z

I like the idea to allow the catalog implementations to distinguish between TRUNCATE TABLE t and INSERT OVERWRITE t SELECT * FROM empty_table. It seems fine to move the API to the table side. But the name conflict is a bit annoying.

One idea is to keep the name SupportsTruncate, and put it in org.apache.spark.sql.connector.catalog, which is the same package of SupportsDelete. Then there is no conflict.

SparkQA · 2021-02-10T15:38:06Z

Test build #135098 has finished for PR 31475 at commit 2741a40.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-10T21:18:22Z

Test build #135103 has finished for PR 31475 at commit bcc01e1.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

SparkQA · 2021-02-11T07:15:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39694/

SparkQA · 2021-02-11T07:45:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39694/

SparkQA · 2021-02-11T11:15:51Z

Test build #135112 has finished for PR 31475 at commit 5af950a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2021-02-11T21:31:55Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

+
+  @Override
+  default boolean truncateTable() {
+    Filter[] filters = new Filter[] { new AlwaysTrue() };


Could you make this a constant instead of creating a new filter and array instance here?

rdblue · 2021-02-11T21:33:17Z

Looks good now. Thanks for fixing this, @MaxGekk!

SparkQA · 2021-02-12T09:31:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39702/

SparkQA · 2021-02-12T10:11:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39702/

SparkQA · 2021-02-12T12:59:21Z

Test build #135121 has finished for PR 31475 at commit d1e5a18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-17T13:35:47Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

   */
  void deleteWhere(Filter[] filters);
+
+  Filter[] ALWAYS_TRUE_FILTER = new Filter[] { new AlwaysTrue() };


This becomes a public API as well. Shall we put it in an internal object like CatalogV2Util?

I wouldn't do that because of:

I believe interfaces should be independent from internals as much as possible.

ALWAYS_TRUE_FILTER can be used in other methods of the interface like deleteWhere(). For example, when an implementation overrides truncateTable() but an user want to delete all rows via deleteWhere(), so, he/she can re-use the constant.

Other interfaces have constants too, for example

spark/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

Line 47 in ec1560a

String PROP_LOCATION = "location";

. I don't see much difference between PROP_LOCATION and ALWAYS_TRUE_FILTER.

Can't people use AlwaysTrue directly if they want? I would also prefer to avoid having multiple ways to do the same thing. It looks odd to have the AlwaysTrue predicate constant as an API under SupportsDelete interface.

How about to revert this commit d1e5a18 , and implement it as:

default boolean truncateTable() { Filter[] filters = new Filter[] { new AlwaysTrue() }; ...

I am not sure it is worth to do this premature optimization. Comparing to the truncation op, the allocation overhead is small. If it is a hot spot, JVM should do all work for us and optimize it, I do believe. @rdblue @cloud-fan @HyukjinKwon WDYT?

Okay, I just noticed that it was changed per the review comment above. I think it's fine to remove as the overhead is small, and avoid exposing ALWAYS_TRUE_FILTER as an API.

This reverts commit d1e5a18.

HyukjinKwon

Looks fine to me too.

SparkQA · 2021-02-18T10:54:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39807/

SparkQA · 2021-02-18T11:04:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39807/

SparkQA · 2021-02-18T14:24:54Z

Test build #135225 has finished for PR 31475 at commit b758060.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

HyukjinKwon · 2021-02-21T08:50:26Z

Merged to master.

MaxGekk added 4 commits February 3, 2021 22:58

Add truncateTable()

361a23e

Implement truncateTable()

04db1d8

Add tests

53cf5ab

Fix an import

b77a210

github-actions bot added the SQL label Feb 4, 2021

jaceklaskowski approved these changes Feb 8, 2021

View reviewed changes

imback82 approved these changes Feb 8, 2021

View reviewed changes

MaxGekk added 2 commits February 9, 2021 22:02

Move truncateTable() to new table interface

e4343c8

Remove debugging code

ca42c10

MaxGekk changed the title ~~[SPARK-34360][SQL] Support table truncation by v2 Table Catalogs~~ [SPARK-34360][SQL] Support truncation of v2 tables Feb 9, 2021

Extend the Table interface

141919c

rdblue reviewed Feb 9, 2021

View reviewed changes

Preserve partitions

2741a40

SupportsDelete extends TruncatableTable

bcc01e1

Merge remote-tracking branch 'origin/master' into dsv2-truncate-table

5af950a

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

rdblue reviewed Feb 11, 2021

View reviewed changes

rdblue approved these changes Feb 11, 2021

View reviewed changes

Add a constant for always true filter

d1e5a18

cloud-fan reviewed Feb 17, 2021

View reviewed changes

Revert "Add a constant for always true filter"

b758060

This reverts commit d1e5a18.

HyukjinKwon approved these changes Feb 18, 2021

View reviewed changes

dongjoon-hyun approved these changes Feb 18, 2021

View reviewed changes

HyukjinKwon closed this in 04c3125 Feb 21, 2021

[SPARK-34360][SQL] Support truncation of v2 tables #31475

[SPARK-34360][SQL] Support truncation of v2 tables #31475

Uh oh!

Conversation

MaxGekk commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

SparkQA commented Feb 4, 2021

Uh oh!

MaxGekk commented Feb 4, 2021

Uh oh!

MaxGekk commented Feb 8, 2021

Uh oh!

cloud-fan commented Feb 8, 2021

Uh oh!

jaceklaskowski left a comment

Choose a reason for hiding this comment

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 8, 2021

Uh oh!

MaxGekk commented Feb 8, 2021

Uh oh!

rdblue commented Feb 8, 2021

Uh oh!

MaxGekk commented Feb 9, 2021

Uh oh!

rdblue commented Feb 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

Uh oh!

SparkQA commented Feb 9, 2021

MaxGekk commented Feb 4, 2021 •

edited

Loading

MaxGekk Feb 10, 2021 •

edited

Loading

rdblue Feb 9, 2021 •

edited

Loading

MaxGekk Feb 18, 2021 •

edited

Loading