Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3906] Optimize sort performance in writting file #3847

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shunlean
Copy link

Why is this PR needed?

Only after sorting temp, the write(sortTemp file) operation can run.
For better performance, we want to do the writeDataToFile and SortDataRows operations in parallel.

What changes were proposed in this PR?

In (Unsafe)SortDataRows, we add new threads to run write the file operation.
About 10% time is reduced with parallel operation in one case.

Does this PR introduce any user interface change?

  • No
  • Yes. (please explain the change and update document)

Is any new testcase added?

  • No
  • Yes

@CarbonDataQA1
Copy link

Can one of the admins verify this patch?

@ajantha-bhat
Copy link
Member

Add to whitelist

@ajantha-bhat
Copy link
Member

retest this please

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3402/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1661/

@shunlean shunlean marked this pull request as ready for review July 17, 2020 01:36
@Zhangshunyu
Copy link
Contributor

please check the build failure info

// intermediate merging of sort temp files will be triggered
unsafeInMemoryIntermediateFileMerger.addFileToMerge(file);
} catch (IOException | MemoryException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use log4j instead of printStackStrace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done.

@@ -37,6 +40,13 @@
import org.apache.log4j.Logger;

public class SortParameters implements Serializable {

private ExecutorService writeService = Executors.newFixedThreadPool(5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to make it configurable when set core pool size for threadpool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,done.

@shunlean
Copy link
Author

retest this please

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1693/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3435/

@shunlean
Copy link
Author

retest this please

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1695/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3437/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3447/

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1705/

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1718/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3460/

@shunlean
Copy link
Author

retest this please

@CarbonDataQA1
Copy link

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1722/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3464/

@ajantha-bhat
Copy link
Member

@shunlean : please handle the comments given by @Zhangshunyu

@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) {
UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()];
intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()];
SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()];
this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we increase carbon.number.of.cores.while.loading, there will be more UnsafeSortDataRows and writing temp files can finish faster without any of these changes.

Is it necessary to introduce another multi-thread here ?
please tell your opinion @kevinjmh @kumarvishal09

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajantha-bhat Good point. So the only difference is adding threads horizontally or vertically. If each thread takes same time to process the data and writes at same time, performance may degrade caused by IO preemption. But the different may not big when number of input split is large enough. @shunlean could you please do some test to confirm ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjmh : Yes, If cores are available, adding threads horizontally can speedup not just sort, but other steps in data loading also.
If cores are not available, adding threads vertically also no use as they will end up waiting for cpu.

so, I felt. This PR changes not required and user can increase carbon.number.of.cores.while.loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants