[CARBONDATA-3088][Compaction] support prefetch for compaction #2906

xuchuanyin · 2018-11-07T12:24:04Z

Current compaction performance is low. By adding logs to observe the
compaction procedure, we found that in
CarbonFactDataHandlerColumnar.addDataToStore(CarbonRow), it will wait
about 30ms before submitting a new TablePage producer. Since the method
addDataToStore is called in single thread, it will result the waiting
every 32000 records since it will collect 32000 records to form a
TablePage.

To reduce the waiting time, we can prepare the 32000 records ahead. This
can be achived using prefetch.

We will prepare two buffers, one will provide the records to the
downstream (addDataToStore) and the other one will prepare the records
asynchronously. The first is called working buffer and the second is
called backup buffer. Once working buffer is exhausted, the two buffers
will exchange their roles: the backup buffer will be the new working
buffer and the old working buffer will be the new backup buffer and it
will be filled asynchronously.

Two parameters are involved for this feature:

carbon.detail.batch.size: This is an existed parameter and the default
value is 100. This parameter controls the batch size of records that
return to the client. For normal query, it is OK to keep it as 100. But
for compaction, since all the records will be operated, we suggest you
to set it to a larger value such as 32000. (32000 is the max rows for a
table page that the down stream wants).
carbon.compaction.prefetch.enable: This is a new parameter and the
default value is false (We may change it to true later). This
parameter controls whether we will prefetch the records for compation.

By using this prefetch feature, we can enhance the performance for
compaction. More test results can be found in the PR description.

Test1

3 huawei ecs instances as workers each with 16cores and 32GB. Spark executor use 12 cores and 24GB. Using 74GB LineItem in 100GB TPCH

Code Branch	Inverted Index	Prefetch	Batch Size (default 100)	Load1/2/3 (s)	Compact 3 Loads (s)	Time Reduced	Perf Enhanced
master@20181107	TRUE	NA	100 447.4/445.9/450.1	661.3	Baseline	Baseline
master@20181107	TRUE	NA	32000	441.5/454.4/456.8	641.2	3.0%	3.1%
PR2906@20181107	TRUE	enable	100	445.3/450.2/445.3	411.8	37.7%	60.6%
PR2906@20181107	TRUE	enable	32000	438.7/446.8/441.8	333.1	49.6%	98.5%
PR2906@20181107	TRUE	disable	100	458.1/459.4/450.9	659.5	0.3%	0.3%
PR2906@20181107	TRUE	disable	32000	472.0/446.8/457.1	654.5	1.0%	1.0%
master@20181120	FALSE	NA	100	427.4/440.0/432.0	596.5	9.8%	10.9%
master@20181120	FALSE	NA	32000	431.1/430.7/433.6	603.1	8.8%	9.7%
PR2906@20181120	FALSE	enable	100	436.9/430.8/424.3	386.7	41.5%	71.0%
PR2906@20181120	FALSE	enable	32000 4	43.7/432.9/439.7	306.0	53.7%	116.1%
PR2906@20181120	FALSE	disable	100	448.1/437.1/436.4	611.0	7.6%	8.2%
PR2906@20181120	FALSE	disable	32000	437.5/431.7/448.1	597.2	9.7%	10.7%

Test2

1 huawei RH2288 with 32 cores and 128GB. Spark executor use 30cores and 90GB. Using 7.3GB LineItem in 10GB TPCH

Code Branch	Prefetch	Batch Size (default 100)	Load1 (s)	Load2 (s)	Load3 (s)	Compact 3 Loads (s)	Time Reduced	Perf Enhanced
master	NA	100	147.4	142.3	144.6	201.4	Baseline	Baseline
master	NA	32000	140.8	138.7	141.6	196.2	2.6%	2.7%
PR2906	enable	100	143.9	142.5	146.2	99.9	50.4%	101.6%
PR2906	enable	32000	142.1	139.3	136.9	98.3	51.2%	104.9%
PR2906	disable	100	146.7	137.4	139.6	200.6	0.4%	0.4%
PR2906	disable	32000	145.2	145.0	139.7	195.7	2.8%	2.9%

Note:
Prefetch: carbon.compaction.prefetch.enable
BatchSize: carbon.detail.batch.size

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Current compaction performance is low. By adding logs to observe the compaction procedure, we found that in `CarbonFactDataHandlerColumnar.addDataToStore(CarbonRow)`, it will wait about 30ms before submitting a new TablePage producer. Since the method `addDataToStore` is called in single thread, it will result the waiting every 32000 records since it will collect 32000 records to form a TablePage. To reduce the waiting time, we can prepare the 32000 records ahead. This can be achived using prefetch. We will prepare two buffers, one will provide the records to the downstream (`addDataToStore`) and the other one will prepare the records asynchronously. The first is called working buffer and the second is called backup buffer. Once working buffer is exhausted, the two buffers will exchange their roles: the backup buffer will be the new working buffer and the old working buffer will be the new backup buffer and it will be filled asynchronously. Two parameters are involved for this feature: 1. carbon.detail.batch.size: This is an existed parameter and the default value is 100. This parameter controls the batch size of records that return to the client. For normal query, it is OK to keep it as 100. But for compaction, since all the records will be operated, we suggest you to set it to a larger value such as 32000. (32000 is the max rows for a table page that the down stream wants). 2. carbon.compaction.prefetch.enable: This is a new parameter and the default value is `false` (We may change it to `true` later). This parameter controls whether we will prefetch the records for compation. By using this prefetch feature, we can enhance the performance for compaction. More test results can be found in the PR description.

xuchuanyin · 2018-11-07T12:51:14Z

Please note that this PR is nearly the modification from PR #2133 plus that we meld the convertRow step to the backup buffer procedure.

xuchuanyin · 2018-11-07T12:53:06Z

core/src/main/java/org/apache/carbondata/core/scan/result/iterator/RawResultIterator.java

+    List<Object[]> converted = new ArrayList<>();
+    if (detailRawQueryResultIterator.hasNext()) {
+      for (Object[] r : detailRawQueryResultIterator.next().getRows()) {
+        converted.add(convertRow(r));


FYI: This is the key difference with PR #2133

CarbonDataQA · 2018-11-07T13:02:31Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1325/

CarbonDataQA · 2018-11-07T13:57:43Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1536/

CarbonDataQA · 2018-11-07T13:58:32Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9584/

CarbonDataQA · 2018-11-07T15:39:53Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1326/

CarbonDataQA · 2018-11-07T16:39:35Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9585/

CarbonDataQA · 2018-11-07T16:39:52Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1537/

core/src/main/java/org/apache/carbondata/core/scan/result/iterator/RawResultIterator.java

CarbonDataQA · 2018-11-08T06:24:44Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1328/

CarbonDataQA · 2018-11-08T07:24:30Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1539/

CarbonDataQA · 2018-11-08T07:24:44Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9587/

manishgupta88 · 2018-11-21T04:39:19Z

LGTM

Current compaction performance is low. By adding logs to observe the compaction procedure, we found that in `CarbonFactDataHandlerColumnar.addDataToStore(CarbonRow)`, it will wait about 30ms before submitting a new TablePage producer. Since the method `addDataToStore` is called in single thread, it will result the waiting every 32000 records since it will collect 32000 records to form a TablePage. To reduce the waiting time, we can prepare the 32000 records ahead. This an be achived using prefetch. We will prepare two buffers, one will provide the records to the downstream (`addDataToStore`) and the other one will prepare the records asynchronously. The first is called working buffer and the second is called backup buffer. Once working buffer is exhausted, the two buffers will exchange their roles: the backup buffer will be the new working buffer and the old working buffer will be the new backup buffer and it will be filled asynchronously. Two parameters are involved for this feature: 1. carbon.detail.batch.size: This is an existed parameter and the default value is 100. This parameter controls the batch size of records that return to the client. For normal query, it is OK to keep it as 100. But for compaction, since all the records will be operated, we suggest you to set it to a larger value such as 32000. (32000 is the max rows for a table page that the down stream wants). 2. carbon.compaction.prefetch.enable: This is a new parameter and the default value is `false` (We may change it to `true` later). This parameter controls whether we will prefetch the records for compation. By using this prefetch feature, we can enhance the performance for compaction. More test results can be found in the PR description. This closes #2906

xuchuanyin commented Nov 7, 2018

View reviewed changes

fix bugs

575486e

Sssan520 reviewed Nov 8, 2018

View reviewed changes

core/src/main/java/org/apache/carbondata/core/scan/result/iterator/RawResultIterator.java Outdated Show resolved Hide resolved

fix comments

7bd4d7a

asfgit closed this in 51b10ba Nov 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-3088][Compaction] support prefetch for compaction #2906

[CARBONDATA-3088][Compaction] support prefetch for compaction #2906

xuchuanyin commented Nov 7, 2018 •

edited

xuchuanyin commented Nov 7, 2018

xuchuanyin Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 8, 2018

CarbonDataQA commented Nov 8, 2018

CarbonDataQA commented Nov 8, 2018

manishgupta88 commented Nov 21, 2018

[CARBONDATA-3088][Compaction] support prefetch for compaction #2906

[CARBONDATA-3088][Compaction] support prefetch for compaction #2906

Conversation

xuchuanyin commented Nov 7, 2018 • edited

Test1

Test2

xuchuanyin commented Nov 7, 2018

xuchuanyin Nov 7, 2018

Choose a reason for hiding this comment

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 7, 2018

CarbonDataQA commented Nov 8, 2018

CarbonDataQA commented Nov 8, 2018

CarbonDataQA commented Nov 8, 2018

manishgupta88 commented Nov 21, 2018

xuchuanyin commented Nov 7, 2018 •

edited