[CARBONDATA-3052] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files #2868

manishgupta88 · 2018-10-29T06:13:03Z

Problem
Current drop table command takes more than 1 minute to delete 3000 files during drop table operation from HDFS

Analysis
Even though we are using HDFS file system we are explicitly we are recursively iterating through the table folders and deleting each file. For each file deletion and file listing one rpc call is made to namenode. To delete 3000 files 3000 rpc calls are made to namenode for file deletion and few more rpc calls for file listing in each folder.

Solution
HDFS provides an API for deleting all folders and files recursively for a given path in a single RPC call. Use that API and improve the drop table operation performance.

Result: After these code changes drop table operation time to delete 3000 files from HDFS has reduced from 1 minute to ~2 sec.

Any interfaces changed?
No
Any backward compatibility impacted?
No
Document update required?
No
Testing done
Verified on cluster
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
NA

CarbonDataQA · 2018-10-29T06:24:46Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1102/

CarbonDataQA · 2018-10-29T07:28:35Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1313/

CarbonDataQA · 2018-10-29T07:37:03Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9365/

manishgupta88 · 2018-10-29T09:04:24Z

retest this please

CarbonDataQA · 2018-10-29T10:28:51Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1115/

CarbonDataQA · 2018-10-29T10:29:22Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9377/

CarbonDataQA · 2018-10-29T11:05:12Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1325/

CarbonDataQA · 2018-10-29T12:02:03Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1118/

CarbonDataQA · 2018-10-29T12:50:48Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1330/

CarbonDataQA · 2018-10-29T13:13:06Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9382/

xuchuanyin · 2018-10-30T00:50:47Z

...la/org/apache/carbondata/spark/testsuite/detailquery/SubqueryWithFilterAndSortTestCase.scala

@@ -64,15 +67,14 @@ class SubqueryWithFilterAndSortTestCase extends QueryTest with BeforeAndAfterAll
    dis.close()
  }
  def deleteFile(filePath: String) {
-    val file = FileFactory.getCarbonFile(filePath, FileFactory.getFileType(filePath))
+    val file = new File(filePath)


why is this modification needed?

Not required. I will remove

CarbonDataQA · 2018-10-30T04:49:53Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1134/

CarbonDataQA · 2018-10-30T05:44:21Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9398/

CarbonDataQA · 2018-10-30T05:44:40Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1346/

jackylk · 2018-10-30T12:19:19Z

If the table is on S3, will it behave correctly since it does not have "folder" concept?

jackylk · 2018-10-30T12:19:54Z

core/src/main/java/org/apache/carbondata/core/datastore/filesystem/CarbonFile.java

+  /**
+   * This method will delete the files recursively from file system
+   *
+   * @return


complete the comment

jackylk · 2018-10-30T12:20:17Z

core/src/main/java/org/apache/carbondata/core/datastore/filesystem/LocalCarbonFile.java

+    try {
+      return deleteFile(file.getAbsolutePath(), FileFactory.getFileType(file.getAbsolutePath()));
+    } catch (IOException e) {
+      LOGGER.error("Exception occurred:" + e.getMessage());


include the exception in the error log

manishgupta88 · 2018-10-30T15:43:07Z

If the table is on S3, will it behave correctly since it does not have "folder" concept?

I have not changed any existing behavior, so it should work fine

CarbonDataQA · 2018-10-30T16:27:02Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1167/

CarbonDataQA · 2018-10-30T17:27:02Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1380/

CarbonDataQA · 2018-10-30T17:27:02Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9431/

jackylk · 2018-10-31T06:40:13Z

core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java

+          CarbonFile carbonFile = FileFactory.getCarbonFile(path[i].getAbsolutePath());
+          boolean delete = carbonFile.delete();
+          if (!delete) {
+            throw new IOException("Error while deleting the folders and files");


better to print the file location

jackylk · 2018-10-31T06:40:27Z

core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java

-          deleteRecursive(file[i]);
+          boolean delete = file[i].delete();
+          if (!delete) {
+            throw new IOException("Error while deleting the folders and files");


better to print the file location

CarbonDataQA · 2018-10-31T07:43:24Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1176/

CarbonDataQA · 2018-10-31T07:58:53Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1389/

manishgupta88 · 2018-10-31T08:35:52Z

retest this please

CarbonDataQA · 2018-10-31T08:51:21Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1179/

CarbonDataQA · 2018-10-31T09:59:11Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9443/

CarbonDataQA · 2018-10-31T10:11:04Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1392/

jackylk · 2018-11-01T07:48:42Z

LGTM

…node RPC calls during physical deletion of files Problem Current drop table command takes more than 1 minute to delete 3000 files during drop table operation from HDFS Analysis Even though we are using HDFS file system we are explicitly we are recursively iterating through the table folders and deleting each file. For each file deletion and file listing one rpc call is made to namenode. To delete 3000 files 3000 rpc calls are made to namenode for file deletion and few more rpc calls for file listing in each folder. Solution HDFS provides an API for deleting all folders and files recursively for a given path in a single RPC call. Use that API and improve the drop table operation performance. Result: After these code changes drop table operation time to delete 3000 files from HDFS has reduced from 1 minute to ~2 sec. This closes #2868

manishgupta88 changed the title ~~[WIP] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files~~ [CARBONDATA-3052] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files Oct 29, 2018

manishgupta88 force-pushed the drop_table_slow branch from f79f0fa to c415d4b Compare October 29, 2018 11:16

xuchuanyin reviewed Oct 30, 2018

View reviewed changes

manishgupta88 force-pushed the drop_table_slow branch from c415d4b to eddcc00 Compare October 30, 2018 04:35

jackylk reviewed Oct 30, 2018

View reviewed changes

manishgupta88 force-pushed the drop_table_slow branch from eddcc00 to 4cdcbc6 Compare October 30, 2018 15:47

jackylk reviewed Oct 31, 2018

View reviewed changes

manishgupta88 force-pushed the drop_table_slow branch from 4cdcbc6 to f9cc4dd Compare October 31, 2018 06:46

Modified code to improve the drop table command performance

f9cc4dd

asfgit closed this in 82eec10 Nov 1, 2018

[CARBONDATA-3052] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files #2868

[CARBONDATA-3052] Improve drop table performance by reducing the namenode RPC calls during physical deletion of files #2868

Conversation

manishgupta88 commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

manishgupta88 commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

CarbonDataQA commented Oct 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

jackylk commented Oct 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manishgupta88 commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

CarbonDataQA commented Oct 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 31, 2018

CarbonDataQA commented Oct 31, 2018

manishgupta88 commented Oct 31, 2018

CarbonDataQA commented Oct 31, 2018

CarbonDataQA commented Oct 31, 2018

CarbonDataQA commented Oct 31, 2018

jackylk commented Nov 1, 2018