Skip to content

[WIP] Core: Fix drop table without purge for hadoop catalog#5283

Closed
ajantha-bhat wants to merge 1 commit intoapache:masterfrom
ajantha-bhat:hadoop
Closed

[WIP] Core: Fix drop table without purge for hadoop catalog#5283
ajantha-bhat wants to merge 1 commit intoapache:masterfrom
ajantha-bhat:hadoop

Conversation

@ajantha-bhat
Copy link
Member

When dropTable() is called with purge=false it is deleting the data and metadata files.

The expected behaviour for purge = false is that it should not clean the data files and metadata files. Only catalog's entries should be deleted.

@github-actions github-actions bot added the core label Jul 15, 2022
@ajantha-bhat ajantha-bhat marked this pull request as draft July 15, 2022 08:11
@ajantha-bhat ajantha-bhat changed the title Core: Fix drop table without purge for hadoop catalog [WIP] Core: Fix drop table without purge for hadoop catalog Jul 15, 2022
@ajantha-bhat
Copy link
Member Author

The problem with the test cases is that by default, spark is calling "DROP TABLE" SQL, which doesn't purge the data.
But because warehouse is temp dir, clean-up is happening at the end of each test case automatically.

Now that hadoop catalog, I am supporting purge = false. So, testcase will not clean the data and getting table exists error.

Also, even without the version-hint files, hadoop catalog prepares version by reading the metadata file name.

Probably I need to modify test cases to use "DROP TABLE PURGE" SQL or stop deriving the version info when the version file doesn't exist( but not sure about the impact)

CatalogUtil.dropTableData(ops.io(), lastMetadata);
return fs.delete(tablePath, true /* recursive */);
} else {
// just drop the version-hint.txt file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version hint is a hint. If any metadata file exists, then the table will still exist.

I think that the original version is correct. For Hadoop tables, dropping a table means deleting its directory. The confusion here is one reason why Hadoop tables are not recommended for production use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue: yeah. That's exactly why test cases are failing.

Also, it is odd to me that HadoopCatalog extends BaseMetastoreCatalog (as it should not be a metastore catalog. should be file system catalog).

I will mostly close this PR.

We are working on a catalog migration API (any catalog to any catalog and of course we will contribute it here). API is simple. But adding testcase with cross catalog is more work.
While adding API, after catalog migration, we call drop table with purge = false on source catalog. At that time Hadoop catalog was cleaning the migrated table's data too. Hence opened this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments