Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210

Closed

Conversation

manishgupta88
Copy link
Contributor

Problem: Compaction performance is slow as compared to data load
Analysis:

  1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
  2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time increases in this model.

Solution:

  1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
  2. Perform IO at blocklet level and uncompression at page level.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Please provide details on
    - Whether new unit test cases have been added or why no new tests are required?
    - How it is tested? Please attach test report.
    - Is it a performance related change? Please attach the performance test report.
    - Any additional information to help reviewers in testing this change.

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@manishgupta88 manishgupta88 changed the title [CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level [WIP] [CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level Apr 23, 2018
@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4145/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4150/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5329/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4176/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4491/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5348/

@@ -45,31 +46,37 @@
* @return
*/
public static AbstractScannedResultCollector getScannedResultCollector(
BlockExecutionInfo blockExecutionInfo) {
BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing constructor, can we add statistics model to block execution info??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to pass in constructor instead of making the blockExecutionInfo object more heavier. Also because the QueryStatisticsModel is at the reader level and one reader can have multiple blocks so in my opinion keeping in blockExecutionInfo will not be a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one task we are creating only one QueryStatisticsModel even one task is handling multiple block (in case of merge small files) . Passing in constructor or keeping in blockexecutioninfo is same cost(same reference). Just to hold the reference we are changing multiple classes constructor. BlockExecutionInfo is also just a holder. If we add in abstract class constructor In future every concrete class needs to maintain querystatisticsmodel even if it doesn't use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* dimension filling time includes time taken for reading all dimensions data from a given offset
* and filling each column data to byte array. Includes total time for 1 query result iterator.
*/
String DIMENSION_FILLING_TIME = "dimension filling time";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to key column filling time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@kumarvishal09
Copy link
Contributor

I think we need these changes in case of DictionaryBasedResultCollector also, as when number of column is more than 100 it goes to dictionarybasedresult collector, it will improve the query performance when number of projection columns is more than 100

@manishgupta88
Copy link
Contributor Author

@kumarvishal09 ...I agree with you that we need these changes in DictionaryBasedResultCollector also to improve the query performance when number of columns are greater than 100.
As this PR is specifically for compaction performance fix, I will raise a separate PR for this. I have raised a sub jira task under same jira to track the issue.
https://issues.apache.org/jira/browse/CARBONDATA-2405

@manishgupta88 manishgupta88 changed the title [WIP] [CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level [CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level Apr 26, 2018
@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5420/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4252/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4558/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4265/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5432/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4289/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5454/

@ravipesala
Copy link
Contributor

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5464/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4575/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4301/

Analysis:
1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time
increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time
increases in this model.

Solution:
1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
2. Perform IO at blocklet level and uncompression at page level.
@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4310/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5477/

@manishgupta88
Copy link
Contributor Author

retest sdv please

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4625/

@kumarvishal09
Copy link
Contributor

LGTM

@asfgit asfgit closed this in 26607fb Apr 30, 2018
anubhav100 pushed a commit to anubhav100/incubator-carbondata that referenced this pull request Jun 22, 2018
During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.Implement a columnar format filling data structure for compaction process for filling dimension and measure data

This closes apache#2210
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants