-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34360][SQL] Support truncation of v2 tables #31475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134876 has finished for PR 31475 at commit
|
|
@cloud-fan @HyukjinKwon Could you review this, please. |
|
@cloud-fan @HyukjinKwon Any objections to this PR? |
|
LGTM, also cc @rdblue @imback82 @dongjoon-hyun |
jaceklaskowski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (non-binding)
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@MaxGekk, why is this necessary instead of deleting from the table or overwriting everything with no new records? I don't see a good reason to do this, especially at the catalog level instead of the table level. Introducing new ways to do something that is already possible over-complicates the API and is a step in the wrong direction. Please consider this a -1 until we come to consensus -- I may support it in the end, but I don't want anyone choosing to commit despite disagreement in the mean time. |
In general, I do believe we should not hide our intention from catalog implementations - truncation should be explicit. Table catalog implementation should decide how to implement in a more optimal way. So, if they can emulate truncation via overwriting with no rows, ok, this is up to them. |
|
@MaxGekk, can you share the use case that you have for this? You mentioned truncation-specific optimizations. I think working with concrete use cases is usually a good idea. If these are theoretical only -- like a user that can drop all data but not a subset -- then we should put this off. If there's a specific case, then let's discuss it. I agree that there may be good reason to pass that the engine's intent was to truncate. That's why we have Those points may indicate that an interface to truncate a table as a stand-alone operation is valid, although I still think that it is a bad idea to add more interfaces to v2 without a reasonable expectation that they will actually be used. Another problem here is that this is operation is proposed at the catalog level, which does not fit with how v2 works. I think that the reason for this is emulating what the Hive does, but that's not usually a good choice. In v2, catalogs load tables and tables are modified. That's why Assuming that it is worth adding this interface, I would expect it to be a mix-in for |
@rdblue For instance, v2 table catalog for JDBC:
|
|
@MaxGekk, thanks. Then let's work on updating this to fit more cleanly with the design of v2 catalogs and tables. This should be a |
| * | ||
| * @since 3.2.0 | ||
| */ | ||
| boolean truncateTable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include "table" in the method name? table.truncate() seems clear enough to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this seems to mirror the other uses so I'm okay with it. I'd probably remove it but it seems okay to leave as is if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed naming convention (maybe informal one) in other interfaces. For instance, if a table implements TruncatableTable, SupportsPartitionManagement and SupportsAtomicPartitionManagement, we have 3 methods in the same namespace:
- truncatePartition()
- truncatePartitions()
- and truncate()
In that case, maybe it is better to name this method as truncateTable() which can highlight that it is applied to entire table.
| public interface TruncatableTable extends Table { | ||
| /** | ||
| * Truncate a table by removing all rows from the table atomically. | ||
| * If the table supports partitions, the method removes all existing partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required? A table may support empty partitions that exist independent of rows, like Hive tables. Is there a compelling reason to force this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SupportsPartitionManagement has a truncatePartition method that will "Truncate a partition in the table by completely removing partition data."
That conflicts with the behavior of truncate here, which drops partitions. I think this requirement should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a compelling reason to force this behavior?
Yes, I wanted to highlight that it can remove not only all rows but also partitions to align to Spark's v1 (and Hive) TRUNCATE TABLE but the v1 command doesn't drop partitions in fact:
spark-sql> CREATE TABLE tbl (col0 INT) PARTITIONED BY (part INT);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> ALTER TABLE tbl ADD PARTITION (part=1);
spark-sql> SHOW PARTITIONS tbl;
part=0
part=1
spark-sql> SELECT * FROM tbl;
0 0
spark-sql> TRUNCATE TABLE tbl;
spark-sql> SHOW PARTITIONS tbl;
part=0
part=1
spark-sql> SELECT * FROM tbl;
spark-sql>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in this PR, I drop empty partitions as well. The question is should v2 TRUNCATE TABLE be aligned to v1 implementation, and preserve empty partitions? I guess the answer is yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprising but we don't have any tests that checks partition existence after entire table truncation. I opened this PR #31544 with such checks for tables from v1 In-Memory and Hive external catalogs.
| * Represents a table which can be atomically truncated. | ||
| */ | ||
| @Evolving | ||
| public interface TruncatableTable extends Table { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other traits use Supports as the first word, but SupportsTruncate already exists. This name is okay, but others may want to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, do we need Atomic in the name? For example, SupportsAtomicTruncate is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the operations should be atomic, right? Certainly, that's the expectation of a write that truncates and then appends data. @MaxGekk cited that as a reason above why this interface is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we name this interface as SupportsAtomicTruncate, someone may guess that SupportsTruncate is opposite to it (means non-atomic truncate) or both atomic and non-atomic truncate. But actually, those two interfaces are "orthogonals" in some sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that adding Atomic to the name is a good idea. The other operations don't specify whether an operation is atomic and I don't think that this should necessarily either. If I understand correctly, the purpose of this is to allow using TRUNCATE for JDBC or similar optimizations. That's more likely to be atomic, but may not be. It looks like the Hive implementation would not be because the partitions are kept.
| val ordering: Array[SortOrder] = Array.empty) | ||
| extends Table with SupportsRead with SupportsWrite with SupportsDelete | ||
| with SupportsMetadataColumns { | ||
| with SupportsMetadataColumns with TruncatableTable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that SupportsDelete should implement TruncatableTable, similar to how SupportsOverwrite implements SupportsTruncate. That way it isn't necessary to for implementations to support both delete and truncate separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are sure that if an implementation supports SupportsDelete then it must support TruncatableTable too. But I slightly doubt about it. I could imagine a case when an implementation can delete rows by a filter but cannot guarantee atomic truncation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncate is equivalent to deleteWhere(true). Why would that not be equivalent?
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #135081 has finished for PR 31475 at commit
|
|
I like the idea to allow the catalog implementations to distinguish between One idea is to keep the name |
|
Test build #135098 has finished for PR 31475 at commit
|
|
Test build #135103 has finished for PR 31475 at commit
|
# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #135112 has finished for PR 31475 at commit
|
|
|
||
| @Override | ||
| default boolean truncateTable() { | ||
| Filter[] filters = new Filter[] { new AlwaysTrue() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this a constant instead of creating a new filter and array instance here?
|
Looks good now. Thanks for fixing this, @MaxGekk! |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #135121 has finished for PR 31475 at commit
|
| */ | ||
| void deleteWhere(Filter[] filters); | ||
|
|
||
| Filter[] ALWAYS_TRUE_FILTER = new Filter[] { new AlwaysTrue() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes a public API as well. Shall we put it in an internal object like CatalogV2Util?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't do that because of:
- I believe interfaces should be independent from internals as much as possible.
- ALWAYS_TRUE_FILTER can be used in other methods of the interface like
deleteWhere(). For example, when an implementation overridestruncateTable()but an user want to delete all rows viadeleteWhere(), so, he/she can re-use the constant. - Other interfaces have constants too, for example . I don't see much difference between
spark/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Line 47 in ec1560a
String PROP_LOCATION = "location"; PROP_LOCATIONandALWAYS_TRUE_FILTER.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't people use AlwaysTrue directly if they want? I would also prefer to avoid having multiple ways to do the same thing. It looks odd to have the AlwaysTrue predicate constant as an API under SupportsDelete interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to revert this commit d1e5a18 , and implement it as:
default boolean truncateTable() {
Filter[] filters = new Filter[] { new AlwaysTrue() };
...I am not sure it is worth to do this premature optimization. Comparing to the truncation op, the allocation overhead is small. If it is a hot spot, JVM should do all work for us and optimize it, I do believe. @rdblue @cloud-fan @HyukjinKwon WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I just noticed that it was changed per the review comment above. I think it's fine to remove as the overhead is small, and avoid exposing ALWAYS_TRUE_FILTER as an API.
This reverts commit d1e5a18.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me too.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #135225 has finished for PR 31475 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master. |
What changes were proposed in this pull request?
TruncatableTablewhich represents tables that allow atomic truncation.InMemoryTableand inInMemoryPartitionTable.Why are the changes needed?
To support
TRUNCATE TABLEfor v2 tables.Does this PR introduce any user-facing change?
Should not.
How was this patch tested?
Added new tests to
TableCatalogSuitethat check truncation of non-partitioned and partitioned tables: