Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 4, 2021

What changes were proposed in this pull request?

  1. Add new interface TruncatableTable which represents tables that allow atomic truncation.
  2. Implement new method in InMemoryTable and in InMemoryPartitionTable.

Why are the changes needed?

To support TRUNCATE TABLE for v2 tables.

Does this PR introduce any user-facing change?

Should not.

How was this patch tested?

Added new tests to TableCatalogSuite that check truncation of non-partitioned and partitioned tables:

$ build/sbt "test:testOnly *TableCatalogSuite"

@github-actions github-actions bot added the SQL label Feb 4, 2021
@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39464/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39464/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134876 has finished for PR 31475 at commit b77a210.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 4, 2021

@cloud-fan @HyukjinKwon Could you review this, please.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 8, 2021

@cloud-fan @HyukjinKwon Any objections to this PR?

@cloud-fan
Copy link
Contributor

LGTM, also cc @rdblue @imback82 @dongjoon-hyun

Copy link
Contributor

@jaceklaskowski jaceklaskowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdblue
Copy link
Contributor

rdblue commented Feb 8, 2021

@MaxGekk, why is this necessary instead of deleting from the table or overwriting everything with no new records? I don't see a good reason to do this, especially at the catalog level instead of the table level. Introducing new ways to do something that is already possible over-complicates the API and is a step in the wrong direction.

Please consider this a -1 until we come to consensus -- I may support it in the end, but I don't want anyone choosing to commit despite disagreement in the mean time.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 8, 2021

... why is this necessary instead of deleting from the table or overwriting everything with no new records?

  1. By emulating table truncation via the insertion of no rows, you require atomic operations: delete + insert but a concrete implementation might not support this though it can atomically truncate a table.
  2. You close the room for truncation specific optimizations. If a catalog implementation would know in advance that we want to truncate the entire table instead of deleting all rows, it could do that in a more optimal way. Let's say some file based implementation could move the table folder to a trash folder using one atomic syscall.
  3. From security or permissions controls point of view, we could distinguish insert with overwrite (or delete) from truncation. I could imagine a case when some roles/users can have only truncation permissions but not insert or delete permissions.
  4. Also it is possible that truncation op is just a record at catalog level log but inserts/deletes are records at table level logs. So, we cannot smoothly sit on such implementation if we emulate table truncation via inserts/deletes.

In general, I do believe we should not hide our intention from catalog implementations - truncation should be explicit. Table catalog implementation should decide how to implement in a more optimal way. So, if they can emulate truncation via overwriting with no rows, ok, this is up to them.

@rdblue
Copy link
Contributor

rdblue commented Feb 8, 2021

@MaxGekk, can you share the use case that you have for this? You mentioned truncation-specific optimizations. I think working with concrete use cases is usually a good idea. If these are theoretical only -- like a user that can drop all data but not a subset -- then we should put this off. If there's a specific case, then let's discuss it.

I agree that there may be good reason to pass that the engine's intent was to truncate. That's why we have SupportsTruncate for the write builder. And I agree with you that we don't necessarily need to use an atomic operation that could truncate and add data at the same time. Your point about not having insert permissions is a good one to justify not using SupportsTruncate, although the case of a user that can drop all data but not subsets doesn't sound real. The point about truncation possibly being a metadata operation is why we added SupportsDelete at the table level.

Those points may indicate that an interface to truncate a table as a stand-alone operation is valid, although I still think that it is a bad idea to add more interfaces to v2 without a reasonable expectation that they will actually be used.

Another problem here is that this is operation is proposed at the catalog level, which does not fit with how v2 works. I think that the reason for this is emulating what the Hive does, but that's not usually a good choice.

In v2, catalogs load tables and tables are modified. That's why SupportsDelete extends Table and not TableCatalog. This keeps concerns separate, so we have a way to handle tables that don't exist and a separate way to handle tables that don't support a certain operation. Mixing those two together at the catalog level over-complicates the API, requiring a source to throw one exception if the table doesn't exist and another if it doesn't support truncation. (We also went through this discussion with the recently added interfaces to add/drop partitions.)

Assuming that it is worth adding this interface, I would expect it to be a mix-in for Table. And like SupportsOverwrite that implements SupportsTruncate, I think this should also update SupportsDelete so that tables don't need to implement both interfaces.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 9, 2021

... can you share the use case that you have for this?

@rdblue For instance, v2 table catalog for JDBC:

  1. I assume if a table supports the SupportsTruncate interface, it must support atomic remove of all data and write any set of rows (empty and non-empty).
  2. Support of atomic SupportsTruncate is not so easy in the case of JDBC. For example, we still don't support it in JDBC v2 Table Catalog, see SPARK-32595.
  3. DBMS usually provides special command for table truncation (see DB2, Oracle, PostgreSQL, Hive). So, we could map the new method truncateTable to a DBMS command.

@rdblue
Copy link
Contributor

rdblue commented Feb 9, 2021

@MaxGekk, thanks. Then let's work on updating this to fit more cleanly with the design of v2 catalogs and tables. This should be a Table interface, not an extension to TableCatalog.

@MaxGekk MaxGekk changed the title [SPARK-34360][SQL] Support table truncation by v2 Table Catalogs [SPARK-34360][SQL] Support truncation of v2 tables Feb 9, 2021
*
* @since 3.2.0
*/
boolean truncateTable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include "table" in the method name? table.truncate() seems clear enough to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this seems to mirror the other uses so I'm okay with it. I'd probably remove it but it seems okay to leave as is if you feel strongly about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed naming convention (maybe informal one) in other interfaces. For instance, if a table implements TruncatableTable, SupportsPartitionManagement and SupportsAtomicPartitionManagement, we have 3 methods in the same namespace:

  • truncatePartition()
  • truncatePartitions()
  • and truncate()

In that case, maybe it is better to name this method as truncateTable() which can highlight that it is applied to entire table.

public interface TruncatableTable extends Table {
/**
* Truncate a table by removing all rows from the table atomically.
* If the table supports partitions, the method removes all existing partitions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? A table may support empty partitions that exist independent of rows, like Hive tables. Is there a compelling reason to force this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SupportsPartitionManagement has a truncatePartition method that will "Truncate a partition in the table by completely removing partition data."

That conflicts with the behavior of truncate here, which drops partitions. I think this requirement should be removed.

Copy link
Member Author

@MaxGekk MaxGekk Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a compelling reason to force this behavior?

Yes, I wanted to highlight that it can remove not only all rows but also partitions to align to Spark's v1 (and Hive) TRUNCATE TABLE but the v1 command doesn't drop partitions in fact:

spark-sql> CREATE TABLE tbl (col0 INT) PARTITIONED BY (part INT);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> ALTER TABLE tbl ADD PARTITION (part=1);
spark-sql> SHOW PARTITIONS tbl;
part=0
part=1
spark-sql> SELECT * FROM tbl;
0	0
spark-sql> TRUNCATE TABLE tbl;
spark-sql> SHOW PARTITIONS tbl;
part=0
part=1
spark-sql> SELECT * FROM tbl;
spark-sql>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in this PR, I drop empty partitions as well. The question is should v2 TRUNCATE TABLE be aligned to v1 implementation, and preserve empty partitions? I guess the answer is yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprising but we don't have any tests that checks partition existence after entire table truncation. I opened this PR #31544 with such checks for tables from v1 In-Memory and Hive external catalogs.

* Represents a table which can be atomically truncated.
*/
@Evolving
public interface TruncatableTable extends Table {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other traits use Supports as the first word, but SupportsTruncate already exists. This name is okay, but others may want to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, do we need Atomic in the name? For example, SupportsAtomicTruncate is better?

Copy link
Contributor

@rdblue rdblue Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the operations should be atomic, right? Certainly, that's the expectation of a write that truncates and then appends data. @MaxGekk cited that as a reason above why this interface is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we name this interface as SupportsAtomicTruncate, someone may guess that SupportsTruncate is opposite to it (means non-atomic truncate) or both atomic and non-atomic truncate. But actually, those two interfaces are "orthogonals" in some sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that adding Atomic to the name is a good idea. The other operations don't specify whether an operation is atomic and I don't think that this should necessarily either. If I understand correctly, the purpose of this is to allow using TRUNCATE for JDBC or similar optimizations. That's more likely to be atomic, but may not be. It looks like the Hive implementation would not be because the partitions are kept.

val ordering: Array[SortOrder] = Array.empty)
extends Table with SupportsRead with SupportsWrite with SupportsDelete
with SupportsMetadataColumns {
with SupportsMetadataColumns with TruncatableTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that SupportsDelete should implement TruncatableTable, similar to how SupportsOverwrite implements SupportsTruncate. That way it isn't necessary to for implementations to support both delete and truncate separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are sure that if an implementation supports SupportsDelete then it must support TruncatableTable too. But I slightly doubt about it. I could imagine a case when an implementation can delete rows by a filter but cannot guarantee atomic truncation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncate is equivalent to deleteWhere(true). Why would that not be equivalent?

@SparkQA
Copy link

SparkQA commented Feb 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39663/

@SparkQA
Copy link

SparkQA commented Feb 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39663/

@SparkQA
Copy link

SparkQA commented Feb 9, 2021

Test build #135081 has finished for PR 31475 at commit 141919c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I like the idea to allow the catalog implementations to distinguish between TRUNCATE TABLE t and INSERT OVERWRITE t SELECT * FROM empty_table. It seems fine to move the API to the table side. But the name conflict is a bit annoying.

One idea is to keep the name SupportsTruncate, and put it in org.apache.spark.sql.connector.catalog, which is the same package of SupportsDelete. Then there is no conflict.

@SparkQA
Copy link

SparkQA commented Feb 10, 2021

Test build #135098 has finished for PR 31475 at commit 2741a40.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2021

Test build #135103 has finished for PR 31475 at commit bcc01e1.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

# Conflicts:
#	sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
@SparkQA
Copy link

SparkQA commented Feb 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39694/

@SparkQA
Copy link

SparkQA commented Feb 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39694/

@SparkQA
Copy link

SparkQA commented Feb 11, 2021

Test build #135112 has finished for PR 31475 at commit 5af950a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@Override
default boolean truncateTable() {
Filter[] filters = new Filter[] { new AlwaysTrue() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a constant instead of creating a new filter and array instance here?

@rdblue
Copy link
Contributor

rdblue commented Feb 11, 2021

Looks good now. Thanks for fixing this, @MaxGekk!

@SparkQA
Copy link

SparkQA commented Feb 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39702/

@SparkQA
Copy link

SparkQA commented Feb 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39702/

@SparkQA
Copy link

SparkQA commented Feb 12, 2021

Test build #135121 has finished for PR 31475 at commit d1e5a18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
void deleteWhere(Filter[] filters);

Filter[] ALWAYS_TRUE_FILTER = new Filter[] { new AlwaysTrue() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes a public API as well. Shall we put it in an internal object like CatalogV2Util?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do that because of:

  • I believe interfaces should be independent from internals as much as possible.
  • ALWAYS_TRUE_FILTER can be used in other methods of the interface like deleteWhere(). For example, when an implementation overrides truncateTable() but an user want to delete all rows via deleteWhere(), so, he/she can re-use the constant.
  • Other interfaces have constants too, for example . I don't see much difference between PROP_LOCATION and ALWAYS_TRUE_FILTER.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't people use AlwaysTrue directly if they want? I would also prefer to avoid having multiple ways to do the same thing. It looks odd to have the AlwaysTrue predicate constant as an API under SupportsDelete interface.

Copy link
Member Author

@MaxGekk MaxGekk Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to revert this commit d1e5a18 , and implement it as:

default boolean truncateTable() {
    Filter[] filters = new Filter[] { new AlwaysTrue() };
   ...

I am not sure it is worth to do this premature optimization. Comparing to the truncation op, the allocation overhead is small. If it is a hot spot, JVM should do all work for us and optimize it, I do believe. @rdblue @cloud-fan @HyukjinKwon WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I just noticed that it was changed per the review comment above. I think it's fine to remove as the overhead is small, and avoid exposing ALWAYS_TRUE_FILTER as an API.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me too.

@SparkQA
Copy link

SparkQA commented Feb 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39807/

@SparkQA
Copy link

SparkQA commented Feb 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39807/

@SparkQA
Copy link

SparkQA commented Feb 18, 2021

Test build #135225 has finished for PR 31475 at commit b758060.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants