[CARBONDATA-2836]Fixed data loading performance issue #2611

kumarvishal09 · 2018-08-06T13:31:10Z

Problem: Data Loading is taking more time when number of records are high(3.5 billion) records

Root Cause: In case of Final merge sort temp row conversion is done in main thread because of this final step processing became slower.

Solution: Move conversion logic to pre-fetch thread for parallel processing. This only done for single merge, intermediate merge no need to convert no sort columns.

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Tested with 3.5 billion records ...
Older Time:4.7 hours
New Time:4.3 hours
Total improvement: 24 minutes
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2018-08-06T13:33:46Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7801/

CarbonDataQA · 2018-08-06T13:35:41Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6525/

ravipesala · 2018-08-06T13:37:43Z

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6180/

ravipesala · 2018-08-06T13:59:42Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6181/

CarbonDataQA · 2018-08-06T14:51:03Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7802/

CarbonDataQA · 2018-08-06T14:53:18Z

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6526/

xuchuanyin · 2018-08-07T02:31:03Z

@kumarvishal09 can you explain this modification?
In previous implementation, we split a record to 'dict-sort', 'nodict-sort' and 'noSortDims & measures'. 'noSortDims & measures' is packed to bytes to avoid serialization-deserialization for them during reading/writing records to sort temp.

In previous implementation, we can see about 8% enhancement in data loading.

kumarvishal09 · 2018-08-07T08:35:57Z

@xuchuanyin Please check description

ravipesala · 2018-08-07T08:43:31Z

...sing/src/main/java/org/apache/carbondata/processing/loading/row/IntermediateSortTempRow.java

@@ -31,6 +25,7 @@
  private int[] dictSortDims;
  private byte[][] noDictSortDims;
  private byte[] noSortDimsAndMeasures;
+  private Object[] measures;


Add comment why it is needed

ravipesala · 2018-08-07T08:45:06Z

processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java

-    sortTempRow.unpackNoSortFromBytes(dictNoSortDims, noDictNoSortAndVarcharComplexDims, measures,
-        this.dataTypes, this.varcharDimCnt, this.complexDimCnt);
+  /**
+   * Read intermediate sort temp row from InputStream.


Update the comment

ravipesala · 2018-08-07T09:14:07Z

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6186/

CarbonDataQA · 2018-08-07T10:15:05Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7809/

CarbonDataQA · 2018-08-07T10:32:40Z

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6533/

ravipesala · 2018-08-07T11:30:20Z

LGTM

Problem: Data Loading is taking more time when number of records are high(3.5 billion) records Root Cause: In case of Final merge sort temp row conversion is done in main thread because of this final step processing became slower. Solution: Move conversion logic to pre-fetch thread for parallel processing. This only done for single merge, intermediate merge no need to convert no sort columns. This closes #2611

Fixed data loading performance issue

a43e4e9

kumarvishal09 force-pushed the dataloadPerFix branch from 5a2ebf3 to a43e4e9 Compare August 6, 2018 13:43

kumarvishal09 changed the title ~~[WIP]Fixed data loading performance issue~~ [CARBONDATA-2836]Fixed data loading performance issue Aug 7, 2018

ravipesala reviewed Aug 7, 2018

View reviewed changes

Fixed data loading performance issue

9481eba

asfgit closed this in f27efb3 Aug 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-2836]Fixed data loading performance issue #2611

[CARBONDATA-2836]Fixed data loading performance issue #2611

kumarvishal09 commented Aug 6, 2018 •

edited

CarbonDataQA commented Aug 6, 2018

CarbonDataQA commented Aug 6, 2018

ravipesala commented Aug 6, 2018

ravipesala commented Aug 6, 2018

CarbonDataQA commented Aug 6, 2018

CarbonDataQA commented Aug 6, 2018

xuchuanyin commented Aug 7, 2018 •

edited

kumarvishal09 commented Aug 7, 2018

ravipesala Aug 7, 2018

ravipesala Aug 7, 2018

ravipesala commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

ravipesala commented Aug 7, 2018

[CARBONDATA-2836]Fixed data loading performance issue #2611

[CARBONDATA-2836]Fixed data loading performance issue #2611

Conversation

kumarvishal09 commented Aug 6, 2018 • edited

CarbonDataQA commented Aug 6, 2018

CarbonDataQA commented Aug 6, 2018

ravipesala commented Aug 6, 2018

ravipesala commented Aug 6, 2018

CarbonDataQA commented Aug 6, 2018

CarbonDataQA commented Aug 6, 2018

xuchuanyin commented Aug 7, 2018 • edited

kumarvishal09 commented Aug 7, 2018

ravipesala Aug 7, 2018

Choose a reason for hiding this comment

ravipesala Aug 7, 2018

Choose a reason for hiding this comment

ravipesala commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

CarbonDataQA commented Aug 7, 2018

ravipesala commented Aug 7, 2018

kumarvishal09 commented Aug 6, 2018 •

edited

xuchuanyin commented Aug 7, 2018 •

edited