[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210

manishgupta88 · 2018-04-23T07:06:18Z

Problem: Compaction performance is slow as compared to data load
Analysis:

During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time increases in this model.

Solution:

Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
Perform IO at blocklet level and uncompression at page level.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2018-04-23T08:57:43Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4145/

CarbonDataQA · 2018-04-23T11:12:37Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4150/

CarbonDataQA · 2018-04-23T13:06:29Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5329/

CarbonDataQA · 2018-04-23T22:10:46Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4176/

ravipesala · 2018-04-23T22:35:46Z

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4491/

CarbonDataQA · 2018-04-24T06:42:41Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5348/

kumarvishal09 · 2018-04-24T17:44:53Z

core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java

@@ -45,31 +46,37 @@
   * @return
   */
  public static AbstractScannedResultCollector getScannedResultCollector(
-      BlockExecutionInfo blockExecutionInfo) {
+      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {


Instead of changing constructor, can we add statistics model to block execution info??

I think it is better to pass in constructor instead of making the blockExecutionInfo object more heavier. Also because the QueryStatisticsModel is at the reader level and one reader can have multiple blocks so in my opinion keeping in blockExecutionInfo will not be a good idea.

For one task we are creating only one QueryStatisticsModel even one task is handling multiple block (in case of merge small files) . Passing in constructor or keeping in blockexecutioninfo is same cost(same reference). Just to hold the reference we are changing multiple classes constructor. BlockExecutionInfo is also just a holder. If we add in abstract class constructor In future every concrete class needs to maintain querystatisticsmodel even if it doesn't use

kumarvishal09 · 2018-04-24T17:47:13Z

core/src/main/java/org/apache/carbondata/core/stats/QueryStatisticsConstants.java

+   * dimension filling time includes time taken for reading all dimensions data from a given offset
+   * and filling each column data to byte array. Includes total time for 1 query result iterator.
+   */
+  String DIMENSION_FILLING_TIME = "dimension filling time";


Change it to key column filling time

kumarvishal09 · 2018-04-24T18:02:02Z

I think we need these changes in case of DictionaryBasedResultCollector also, as when number of column is more than 100 it goes to dictionarybasedresult collector, it will improve the query performance when number of projection columns is more than 100

manishgupta88 · 2018-04-26T09:41:18Z

@kumarvishal09 ...I agree with you that we need these changes in DictionaryBasedResultCollector also to improve the query performance when number of columns are greater than 100.
As this PR is specifically for compaction performance fix, I will raise a separate PR for this. I have raised a sub jira task under same jira to track the issue.
https://issues.apache.org/jira/browse/CARBONDATA-2405

CarbonDataQA · 2018-04-26T11:24:22Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5420/

CarbonDataQA · 2018-04-26T11:43:08Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4252/

ravipesala · 2018-04-26T14:14:12Z

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4558/

CarbonDataQA · 2018-04-26T17:36:28Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4265/

CarbonDataQA · 2018-04-26T18:11:30Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5432/

CarbonDataQA · 2018-04-27T08:43:22Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4289/

CarbonDataQA · 2018-04-27T09:02:39Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5454/

ravipesala · 2018-04-27T10:00:02Z

retest this please

CarbonDataQA · 2018-04-27T13:03:34Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5464/

ravipesala · 2018-04-27T13:24:54Z

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4575/

CarbonDataQA · 2018-04-27T13:53:57Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4301/

Analysis: 1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column. 2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time increases in this model. Solution: 1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data. 2. Perform IO at blocklet level and uncompression at page level.

CarbonDataQA · 2018-04-27T17:43:55Z

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4310/

CarbonDataQA · 2018-04-27T18:37:25Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5477/

manishgupta88 · 2018-04-30T04:49:33Z

retest sdv please

ravipesala · 2018-04-30T05:39:49Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4625/

kumarvishal09 · 2018-04-30T09:47:11Z

LGTM

During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.Implement a columnar format filling data structure for compaction process for filling dimension and measure data This closes apache#2210

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from 857e5d2 to 376253f Compare April 23, 2018 07:16

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from 376253f to 3c33c71 Compare April 23, 2018 08:57

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from 3c33c71 to b42dc5a Compare April 23, 2018 14:16

kumarvishal09 reviewed Apr 24, 2018

View reviewed changes

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from b42dc5a to 9fb1e7d Compare April 26, 2018 09:43

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from 9fb1e7d to 72444b6 Compare April 26, 2018 13:13

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from 72444b6 to cb7a4fd Compare April 27, 2018 07:24

manishgupta88 mentioned this pull request Apr 27, 2018

[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2191

Closed

5 tasks

manishgupta88 force-pushed the compaction_slow_fix_1.4 branch from cb7a4fd to d966ab0 Compare April 27, 2018 14:03

asfgit closed this in 26607fb Apr 30, 2018

xuchuanyin mentioned this pull request May 2, 2018

[CARBONDATA-2415] Support Refresh DataMap command for all Index datamap #2254

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210

[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210

manishgupta88 commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

ravipesala commented Apr 23, 2018

CarbonDataQA commented Apr 24, 2018

kumarvishal09 Apr 24, 2018

manishgupta88 Apr 26, 2018

kumarvishal09 Apr 26, 2018

manishgupta88 Apr 27, 2018

kumarvishal09 Apr 24, 2018 •

edited

manishgupta88 Apr 26, 2018

kumarvishal09 commented Apr 24, 2018

manishgupta88 commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

ravipesala commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

CarbonDataQA commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

ravipesala commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

ravipesala commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

manishgupta88 commented Apr 30, 2018

ravipesala commented Apr 30, 2018

kumarvishal09 commented Apr 30, 2018

[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210

[CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level #2210

Conversation

manishgupta88 commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

CarbonDataQA commented Apr 23, 2018

ravipesala commented Apr 23, 2018

CarbonDataQA commented Apr 24, 2018

kumarvishal09 Apr 24, 2018

Choose a reason for hiding this comment

manishgupta88 Apr 26, 2018

Choose a reason for hiding this comment

kumarvishal09 Apr 26, 2018

Choose a reason for hiding this comment

manishgupta88 Apr 27, 2018

Choose a reason for hiding this comment

kumarvishal09 Apr 24, 2018 • edited

Choose a reason for hiding this comment

manishgupta88 Apr 26, 2018

Choose a reason for hiding this comment

kumarvishal09 commented Apr 24, 2018

manishgupta88 commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

ravipesala commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

CarbonDataQA commented Apr 26, 2018

CarbonDataQA commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

ravipesala commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

ravipesala commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

CarbonDataQA commented Apr 27, 2018

manishgupta88 commented Apr 30, 2018

ravipesala commented Apr 30, 2018

kumarvishal09 commented Apr 30, 2018

kumarvishal09 Apr 24, 2018 •

edited