New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-2836]Fixed data loading performance issue #2611
Conversation
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7801/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6525/ |
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6180/ |
5a2ebf3
to
a43e4e9
Compare
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6181/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7802/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6526/ |
@kumarvishal09 can you explain this modification? In previous implementation, we can see about 8% enhancement in data loading. |
@xuchuanyin Please check description |
@@ -31,6 +25,7 @@ | |||
private int[] dictSortDims; | |||
private byte[][] noDictSortDims; | |||
private byte[] noSortDimsAndMeasures; | |||
private Object[] measures; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment why it is needed
sortTempRow.unpackNoSortFromBytes(dictNoSortDims, noDictNoSortAndVarcharComplexDims, measures, | ||
this.dataTypes, this.varcharDimCnt, this.complexDimCnt); | ||
/** | ||
* Read intermediate sort temp row from InputStream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6186/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7809/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6533/ |
LGTM |
Problem: Data Loading is taking more time when number of records are high(3.5 billion) records Root Cause: In case of Final merge sort temp row conversion is done in main thread because of this final step processing became slower. Solution: Move conversion logic to pre-fetch thread for parallel processing. This only done for single merge, intermediate merge no need to convert no sort columns. This closes #2611
Problem: Data Loading is taking more time when number of records are high(3.5 billion) records
Root Cause: In case of Final merge sort temp row conversion is done in main thread because of this final step processing became slower.
Solution: Move conversion logic to pre-fetch thread for parallel processing. This only done for single merge, intermediate merge no need to convert no sort columns.
Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Tested with 3.5 billion records ...
Older Time:4.7 hours
New Time:4.3 hours
Total improvement: 24 minutes
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.