[WIP] Core: Fix drop table without purge for hadoop catalog by ajantha-bhat · Pull Request #5283 · apache/iceberg

ajantha-bhat · 2022-07-15T06:23:09Z

When dropTable() is called with purge=false it is deleting the data and metadata files.

The expected behaviour for purge = false is that it should not clean the data files and metadata files. Only catalog's entries should be deleted.

ajantha-bhat · 2022-07-16T01:56:39Z

The problem with the test cases is that by default, spark is calling "DROP TABLE" SQL, which doesn't purge the data.
But because warehouse is temp dir, clean-up is happening at the end of each test case automatically.

Now that hadoop catalog, I am supporting purge = false. So, testcase will not clean the data and getting table exists error.

Also, even without the version-hint files, hadoop catalog prepares version by reading the metadata file name.

Probably I need to modify test cases to use "DROP TABLE PURGE" SQL or stop deriving the version info when the version file doesn't exist( but not sure about the impact)

rdblue · 2022-07-16T20:17:28Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

          CatalogUtil.dropTableData(ops.io(), lastMetadata);
+          return fs.delete(tablePath, true /* recursive */);
+        } else {
+          // just drop the version-hint.txt file


The version hint is a hint. If any metadata file exists, then the table will still exist.

I think that the original version is correct. For Hadoop tables, dropping a table means deleting its directory. The confusion here is one reason why Hadoop tables are not recommended for production use.

@rdblue: yeah. That's exactly why test cases are failing.

Also, it is odd to me that HadoopCatalog extends BaseMetastoreCatalog (as it should not be a metastore catalog. should be file system catalog).

I will mostly close this PR.

We are working on a catalog migration API (any catalog to any catalog and of course we will contribute it here). API is simple. But adding testcase with cross catalog is more work.
While adding API, after catalog migration, we call drop table with purge = false on source catalog. At that time Hadoop catalog was cleaning the migrated table's data too. Hence opened this PR.

Core: Fix drop table without purge for hadoop catalog

3f81a24

github-actions bot added the core label Jul 15, 2022

ajantha-bhat referenced this pull request Jul 15, 2022

catalog migration api implementation

4dbdf6e

ajantha-bhat marked this pull request as draft July 15, 2022 08:11

ajantha-bhat changed the title ~~Core: Fix drop table without purge for hadoop catalog~~ [WIP] Core: Fix drop table without purge for hadoop catalog Jul 15, 2022

rdblue reviewed Jul 16, 2022

View reviewed changes

ajantha-bhat closed this Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Core: Fix drop table without purge for hadoop catalog#5283

[WIP] Core: Fix drop table without purge for hadoop catalog#5283
ajantha-bhat wants to merge 1 commit intoapache:masterfrom
ajantha-bhat:hadoop

ajantha-bhat commented Jul 15, 2022

Uh oh!

ajantha-bhat commented Jul 16, 2022

Uh oh!

rdblue Jul 16, 2022

Uh oh!

ajantha-bhat Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ajantha-bhat commented Jul 15, 2022

Uh oh!

ajantha-bhat commented Jul 16, 2022

Uh oh!

rdblue Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Jul 18, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments