[CARBONDATA-3906] Optimize sort performance in writting file #3847

shunlean · 2020-07-16T09:41:50Z

Why is this PR needed?

Only after sorting temp, the write(sortTemp file) operation can run.
For better performance, we want to do the writeDataToFile and SortDataRows operations in parallel.

What changes were proposed in this PR?

In (Unsafe)SortDataRows, we add new threads to run write the file operation.
About 10% time is reduced with parallel operation in one case.

Does this PR introduce any user interface change?

No
Yes. (please explain the change and update document)

Is any new testcase added?

No
Yes

CarbonDataQA1 · 2020-07-16T09:42:09Z

Can one of the admins verify this patch?

ajantha-bhat · 2020-07-16T09:57:30Z

Add to whitelist

ajantha-bhat · 2020-07-16T09:57:49Z

retest this please

CarbonDataQA1 · 2020-07-16T10:01:35Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3402/

CarbonDataQA1 · 2020-07-16T10:04:56Z

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1661/

Zhangshunyu · 2020-07-17T03:05:33Z

please check the build failure info

Zhangshunyu · 2020-07-17T03:10:31Z

...g/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java

+        // intermediate merging of sort temp files will be triggered
+        unsafeInMemoryIntermediateFileMerger.addFileToMerge(file);
+      } catch (IOException | MemoryException e) {
+        e.printStackTrace();


use log4j instead of printStackStrace

Zhangshunyu · 2020-07-17T03:11:26Z

processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java

@@ -37,6 +40,13 @@
 import org.apache.log4j.Logger;

 public class SortParameters implements Serializable {
+
+  private ExecutorService writeService = Executors.newFixedThreadPool(5,


Suggest to make it configurable when set core pool size for threadpool

shunlean · 2020-07-20T07:14:18Z

retest this please

CarbonDataQA1 · 2020-07-20T07:33:41Z

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1693/

CarbonDataQA1 · 2020-07-20T07:36:30Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3435/

shunlean · 2020-07-20T08:26:45Z

retest this please

CarbonDataQA1 · 2020-07-20T09:18:59Z

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1695/

CarbonDataQA1 · 2020-07-20T09:20:01Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3437/

CarbonDataQA1 · 2020-07-21T03:58:46Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3447/

CarbonDataQA1 · 2020-07-21T03:59:57Z

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1705/

CarbonDataQA1 · 2020-07-22T08:39:38Z

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1718/

CarbonDataQA1 · 2020-07-22T08:40:39Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3460/

shunlean · 2020-07-22T09:46:46Z

retest this please

CarbonDataQA1 · 2020-07-22T12:04:04Z

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1722/

CarbonDataQA1 · 2020-07-22T12:05:08Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3464/

ajantha-bhat · 2020-07-27T06:11:35Z

@shunlean : please handle the comments given by @Zhangshunyu

ajantha-bhat · 2020-07-27T06:15:18Z

...arbondata/processing/loading/sort/impl/UnsafeParallelReadMergeSorterWithColumnRangeImpl.java

@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
    UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
    intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
    SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
+    this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),


If we increase carbon.number.of.cores.while.loading, there will be more UnsafeSortDataRows and writing temp files can finish faster without any of these changes.

Is it necessary to introduce another multi-thread here ?
please tell your opinion @kevinjmh @kumarvishal09

@ajantha-bhat Good point. So the only difference is adding threads horizontally or vertically. If each thread takes same time to process the data and writes at same time, performance may degrade caused by IO preemption. But the different may not big when number of input split is large enough. @shunlean could you please do some test to confirm ?

@kevinjmh : Yes, If cores are available, adding threads horizontally can speedup not just sort, but other steps in data loading also.
If cores are not available, adding threads vertically also no use as they will end up waiting for cpu.

so, I felt. This PR changes not required and user can increase carbon.number.of.cores.while.loading

shunlean marked this pull request as ready for review July 17, 2020 01:36

Zhangshunyu reviewed Jul 17, 2020

View reviewed changes

Optimize sort performance in writting file

6891083

shunlean force-pushed the sortfile branch from 12d54c4 to 6891083 Compare July 22, 2020 06:21

ajantha-bhat reviewed Jul 27, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-3906] Optimize sort performance in writting file #3847

[CARBONDATA-3906] Optimize sort performance in writting file #3847

shunlean commented Jul 16, 2020

CarbonDataQA1 commented Jul 16, 2020

ajantha-bhat commented Jul 16, 2020

ajantha-bhat commented Jul 16, 2020

CarbonDataQA1 commented Jul 16, 2020

CarbonDataQA1 commented Jul 16, 2020

Zhangshunyu commented Jul 17, 2020

Zhangshunyu Jul 17, 2020

shunlean Jul 27, 2020

Zhangshunyu Jul 17, 2020

shunlean Jul 27, 2020

shunlean commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

shunlean commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

CarbonDataQA1 commented Jul 21, 2020

CarbonDataQA1 commented Jul 21, 2020

CarbonDataQA1 commented Jul 22, 2020

CarbonDataQA1 commented Jul 22, 2020

shunlean commented Jul 22, 2020

CarbonDataQA1 commented Jul 22, 2020

CarbonDataQA1 commented Jul 22, 2020

ajantha-bhat commented Jul 27, 2020

ajantha-bhat Jul 27, 2020 •

edited

kevinjmh Jul 27, 2020

ajantha-bhat Jul 27, 2020 •

edited

[CARBONDATA-3906] Optimize sort performance in writting file #3847

Are you sure you want to change the base?

[CARBONDATA-3906] Optimize sort performance in writting file #3847

Conversation

shunlean commented Jul 16, 2020

Why is this PR needed?

What changes were proposed in this PR?

Does this PR introduce any user interface change?

Is any new testcase added?

CarbonDataQA1 commented Jul 16, 2020

ajantha-bhat commented Jul 16, 2020

ajantha-bhat commented Jul 16, 2020

CarbonDataQA1 commented Jul 16, 2020

CarbonDataQA1 commented Jul 16, 2020

Zhangshunyu commented Jul 17, 2020

Zhangshunyu Jul 17, 2020

Choose a reason for hiding this comment

shunlean Jul 27, 2020

Choose a reason for hiding this comment

Zhangshunyu Jul 17, 2020

Choose a reason for hiding this comment

shunlean Jul 27, 2020

Choose a reason for hiding this comment

shunlean commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

shunlean commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

CarbonDataQA1 commented Jul 20, 2020

CarbonDataQA1 commented Jul 21, 2020

CarbonDataQA1 commented Jul 21, 2020

CarbonDataQA1 commented Jul 22, 2020

CarbonDataQA1 commented Jul 22, 2020

shunlean commented Jul 22, 2020

CarbonDataQA1 commented Jul 22, 2020

CarbonDataQA1 commented Jul 22, 2020

ajantha-bhat commented Jul 27, 2020

ajantha-bhat Jul 27, 2020 • edited

Choose a reason for hiding this comment

kevinjmh Jul 27, 2020

Choose a reason for hiding this comment

ajantha-bhat Jul 27, 2020 • edited

Choose a reason for hiding this comment

ajantha-bhat Jul 27, 2020 •

edited

ajantha-bhat Jul 27, 2020 •

edited