New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3906] Optimize sort performance in writting file #3847
base: master
Are you sure you want to change the base?
Conversation
Can one of the admins verify this patch? |
Add to whitelist |
retest this please |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3402/ |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1661/ |
please check the build failure info |
// intermediate merging of sort temp files will be triggered | ||
unsafeInMemoryIntermediateFileMerger.addFileToMerge(file); | ||
} catch (IOException | MemoryException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use log4j instead of printStackStrace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done.
@@ -37,6 +40,13 @@ | |||
import org.apache.log4j.Logger; | |||
|
|||
public class SortParameters implements Serializable { | |||
|
|||
private ExecutorService writeService = Executors.newFixedThreadPool(5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to make it configurable when set core pool size for threadpool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,done.
retest this please |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1693/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3435/ |
retest this please |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1695/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3437/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3447/ |
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1705/ |
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1718/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3460/ |
retest this please |
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1722/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3464/ |
@shunlean : please handle the comments given by @Zhangshunyu |
@@ -99,6 +101,8 @@ public void initialize(SortParameters sortParameters) { | |||
UnsafeSortDataRows[] sortDataRows = new UnsafeSortDataRows[columnRangeInfo.getNumOfRanges()]; | |||
intermediateFileMergers = new UnsafeIntermediateMerger[columnRangeInfo.getNumOfRanges()]; | |||
SortParameters[] sortParameterArray = new SortParameters[columnRangeInfo.getNumOfRanges()]; | |||
this.writeService = Executors.newFixedThreadPool(originSortParameters.getNumberOfCores(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we increase carbon.number.of.cores.while.loading
, there will be more UnsafeSortDataRows and writing temp files can finish faster without any of these changes.
Is it necessary to introduce another multi-thread here ?
please tell your opinion @kevinjmh @kumarvishal09
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat Good point. So the only difference is adding threads horizontally or vertically. If each thread takes same time to process the data and writes at same time, performance may degrade caused by IO preemption. But the different may not big when number of input split is large enough. @shunlean could you please do some test to confirm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjmh : Yes, If cores are available, adding threads horizontally can speedup not just sort, but other steps in data loading also.
If cores are not available, adding threads vertically also no use as they will end up waiting for cpu.
so, I felt. This PR changes not required and user can increase carbon.number.of.cores.while.loading
Why is this PR needed?
Only after sorting temp, the write(sortTemp file) operation can run.
For better performance, we want to do the writeDataToFile and SortDataRows operations in parallel.
What changes were proposed in this PR?
In (Unsafe)SortDataRows, we add new threads to run write the file operation.
About 10% time is reduced with parallel operation in one case.
Does this PR introduce any user interface change?
Is any new testcase added?