Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3052] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files #2868

Closed
wants to merge 1 commit into from

Conversation

manishgupta88
Copy link
Contributor

Problem
Current drop table command takes more than 1 minute to delete 3000 files during drop table operation from HDFS

Analysis
Even though we are using HDFS file system we are explicitly we are recursively iterating through the table folders and deleting each file. For each file deletion and file listing one rpc call is made to namenode. To delete 3000 files 3000 rpc calls are made to namenode for file deletion and few more rpc calls for file listing in each folder.

Solution
HDFS provides an API for deleting all folders and files recursively for a given path in a single RPC call. Use that API and improve the drop table operation performance.

Result: After these code changes drop table operation time to delete 3000 files from HDFS has reduced from 1 minute to ~2 sec.

  • Any interfaces changed?
    No
  • Any backward compatibility impacted?
    No
  • Document update required?
    No
  • Testing done
    Verified on cluster
  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
    NA

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1102/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1313/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9365/

@manishgupta88 manishgupta88 changed the title [WIP] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files [CARBONDATA-3052] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files Oct 29, 2018
@manishgupta88
Copy link
Contributor Author

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1115/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9377/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1325/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1118/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1330/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9382/

@@ -64,15 +67,14 @@ class SubqueryWithFilterAndSortTestCase extends QueryTest with BeforeAndAfterAll
dis.close()
}
def deleteFile(filePath: String) {
val file = FileFactory.getCarbonFile(filePath, FileFactory.getFileType(filePath))
val file = new File(filePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this modification needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required. I will remove

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1134/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9398/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1346/

@jackylk
Copy link
Contributor

jackylk commented Oct 30, 2018

If the table is on S3, will it behave correctly since it does not have "folder" concept?

/**
* This method will delete the files recursively from file system
*
* @return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

try {
return deleteFile(file.getAbsolutePath(), FileFactory.getFileType(file.getAbsolutePath()));
} catch (IOException e) {
LOGGER.error("Exception occurred:" + e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include the exception in the error log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@manishgupta88
Copy link
Contributor Author

If the table is on S3, will it behave correctly since it does not have "folder" concept?

I have not changed any existing behavior, so it should work fine

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1167/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1380/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9431/

CarbonFile carbonFile = FileFactory.getCarbonFile(path[i].getAbsolutePath());
boolean delete = carbonFile.delete();
if (!delete) {
throw new IOException("Error while deleting the folders and files");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to print the file location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

deleteRecursive(file[i]);
boolean delete = file[i].delete();
if (!delete) {
throw new IOException("Error while deleting the folders and files");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to print the file location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1176/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1389/

@manishgupta88
Copy link
Contributor Author

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1179/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9443/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1392/

@jackylk
Copy link
Contributor

jackylk commented Nov 1, 2018

LGTM

@asfgit asfgit closed this in 82eec10 Nov 1, 2018
asfgit pushed a commit that referenced this pull request Nov 21, 2018
…node RPC calls during physical deletion of files

Problem
Current drop table command takes more than 1 minute to delete 3000 files during drop table operation from HDFS

Analysis
Even though we are using HDFS file system we are explicitly we are recursively iterating through the table folders and deleting each file. For each file deletion and file listing one rpc call is made to namenode. To delete 3000 files 3000 rpc calls are made to namenode for file deletion and few more rpc calls for file listing in each folder.

Solution
HDFS provides an API for deleting all folders and files recursively for a given path in a single RPC call. Use that API and improve the drop table operation performance.

Result: After these code changes drop table operation time to delete 3000 files from HDFS has reduced from 1 minute to ~2 sec.

This closes #2868
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants