Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-2836]Fixed data loading performance issue #2611

Closed
wants to merge 2 commits into from

Conversation

kumarvishal09
Copy link
Contributor

@kumarvishal09 kumarvishal09 commented Aug 6, 2018

Problem: Data Loading is taking more time when number of records are high(3.5 billion) records

Root Cause: In case of Final merge sort temp row conversion is done in main thread because of this final step processing became slower.

Solution: Move conversion logic to pre-fetch thread for parallel processing. This only done for single merge, intermediate merge no need to convert no sort columns.

  • Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Tested with 3.5 billion records ...
    Older Time:4.7 hours
    New Time:4.3 hours
    Total improvement: 24 minutes

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7801/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6525/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6180/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6181/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7802/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6526/

@xuchuanyin
Copy link
Contributor

xuchuanyin commented Aug 7, 2018

@kumarvishal09 can you explain this modification?
In previous implementation, we split a record to 'dict-sort', 'nodict-sort' and 'noSortDims & measures'. 'noSortDims & measures' is packed to bytes to avoid serialization-deserialization for them during reading/writing records to sort temp.

In previous implementation, we can see about 8% enhancement in data loading.

@kumarvishal09 kumarvishal09 changed the title [WIP]Fixed data loading performance issue [CARBONDATA-2836]Fixed data loading performance issue Aug 7, 2018
@kumarvishal09
Copy link
Contributor Author

@xuchuanyin Please check description

@@ -31,6 +25,7 @@
private int[] dictSortDims;
private byte[][] noDictSortDims;
private byte[] noSortDimsAndMeasures;
private Object[] measures;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment why it is needed

sortTempRow.unpackNoSortFromBytes(dictNoSortDims, noDictNoSortAndVarcharComplexDims, measures,
this.dataTypes, this.varcharDimCnt, this.complexDimCnt);
/**
* Read intermediate sort temp row from InputStream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6186/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7809/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6533/

@ravipesala
Copy link
Contributor

LGTM

@asfgit asfgit closed this in f27efb3 Aug 7, 2018
asfgit pushed a commit that referenced this pull request Aug 9, 2018
Problem: Data Loading is taking more time when number of records are high(3.5 billion) records

Root Cause: In case of Final merge sort temp row conversion is done in main thread because of this final step processing became slower.

Solution: Move conversion logic to pre-fetch thread for parallel processing. This only done for single merge, intermediate merge no need to convert no sort columns.

This closes #2611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants