-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210
Conversation
857e5d2
to
376253f
Compare
376253f
to
3c33c71
Compare
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4145/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4150/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5329/ |
3c33c71
to
b42dc5a
Compare
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4176/ |
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4491/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5348/ |
@@ -45,31 +46,37 @@ | |||
* @return | |||
*/ | |||
public static AbstractScannedResultCollector getScannedResultCollector( | |||
BlockExecutionInfo blockExecutionInfo) { | |||
BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing constructor, can we add statistics model to block execution info??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to pass in constructor instead of making the blockExecutionInfo object more heavier. Also because the QueryStatisticsModel is at the reader level and one reader can have multiple blocks so in my opinion keeping in blockExecutionInfo will not be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one task we are creating only one QueryStatisticsModel even one task is handling multiple block (in case of merge small files) . Passing in constructor or keeping in blockexecutioninfo is same cost(same reference). Just to hold the reference we are changing multiple classes constructor. BlockExecutionInfo is also just a holder. If we add in abstract class constructor In future every concrete class needs to maintain querystatisticsmodel even if it doesn't use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
* dimension filling time includes time taken for reading all dimensions data from a given offset | ||
* and filling each column data to byte array. Includes total time for 1 query result iterator. | ||
*/ | ||
String DIMENSION_FILLING_TIME = "dimension filling time"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to key column filling time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
I think we need these changes in case of DictionaryBasedResultCollector also, as when number of column is more than 100 it goes to dictionarybasedresult collector, it will improve the query performance when number of projection columns is more than 100 |
@kumarvishal09 ...I agree with you that we need these changes in DictionaryBasedResultCollector also to improve the query performance when number of columns are greater than 100. |
b42dc5a
to
9fb1e7d
Compare
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5420/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4252/ |
9fb1e7d
to
72444b6
Compare
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4558/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4265/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5432/ |
72444b6
to
cb7a4fd
Compare
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4289/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5454/ |
retest this please |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5464/ |
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4575/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4301/ |
Analysis: 1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column. 2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time increases in this model. Solution: 1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data. 2. Perform IO at blocklet level and uncompression at page level.
cb7a4fd
to
d966ab0
Compare
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4310/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5477/ |
retest sdv please |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4625/ |
LGTM |
During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.Implement a columnar format filling data structure for compaction process for filling dimension and measure data This closes apache#2210
Problem: Compaction performance is slow as compared to data load
Analysis:
Solution:
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.