New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3219] Support range partition the input data for local_sort/global sort data loading #2971
Conversation
@@ -305,4 +307,107 @@ object DataLoadProcessorStepOnSpark { | |||
e) | |||
} | |||
} | |||
|
|||
def sortAdnWriteFunc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the method name from sortAdnWriteFunc to sortAndWriteFunc
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1617/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1828/ |
Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9877/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1622/ |
Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9882/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1833/ |
dataFrame: Option[DataFrame], | ||
model: CarbonLoadModel, | ||
hadoopConf: Configuration): Array[(String, (LoadMetadataDetails, ExecutionErrors))] = { | ||
val originRDD = if (dataFrame.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has some of the same code as loadDataUsingGlobalSort, I recommend refactoring these two methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better, but after refactoring, the code logic is not clear. Now, these two flows already reuse the process steps.
// here it assumes the compression ratio of CarbonData is about 33%, | ||
// so it multiply by 3 to get the split size of CSV files. | ||
val splitSize = Math.max(blockletSize, (blockSize - blockletSize)) * 3 | ||
numPartitions = Math.ceil(totalSize / splitSize).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If insert using dataframe, I think totalSize will be 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, insert will use global sort
@@ -188,6 +188,8 @@ | |||
optionsFinal.put(CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB, | |||
Maps.getOrDefault(options, CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB, | |||
CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB_DEFAULT)); | |||
|
|||
optionsFinal.put("range_column", Maps.getOrDefault(options, "range_column", null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does makeCreateTableString of CarbonDataFrameWriter need add "range_column" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now it only try to support load data command
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1627/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1838/ |
Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9887/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1630/ |
Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9890/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1841/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1668/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9928/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1880/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1680/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1892/ |
Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9940/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1698/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1908/ |
Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9958/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1943/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10196/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2152/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10396/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2366/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2153/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10407/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2157/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10411/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2370/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2158/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2371/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10412/ |
@ravipesala |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2159/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2372/ |
Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10413/ |
@ravipesala @kumarvishal09 |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2164/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2377/ |
Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10418/ |
@QiangCai My question how the user can benefit if he chooses a different range column for each load. I feel range column should be at the table level not at the load level. |
@ravipesala |
@QiangCai we should restrict changing that property from table properties. |
@ravipesala |
LGTM @QiangCai I feel it is better to keep in tableproprties as it is not supposed changed for each load. We can further discuss and raise another PR if needed, I am merging this now. Thanks for working on it. |
…rt/global sort data loading For global_sort/local_sort table, load data command add RANGE_COLUMN option load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>') when we know the total size of input data, we can calculate the number of the partitions. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'global_sort_partitions'='10') when we don't know the total size of the input data, we can give the size of each partition. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'scale_factor'='10') it will calcute the number of the partitions as follows. splitSize = Math.max(blocklet_size, (block_size - blocklet_size)) * scale_factor numPartitions = Math.ceil(total size / splitSize) Limitation: not support insert into, support only load data command, not support multiple range columns, support only one range column exists data skew This closes #2971
…rt/global sort data loading For global_sort/local_sort table, load data command add RANGE_COLUMN option load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>') when we know the total size of input data, we can calculate the number of the partitions. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'global_sort_partitions'='10') when we don't know the total size of the input data, we can give the size of each partition. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'scale_factor'='10') it will calcute the number of the partitions as follows. splitSize = Math.max(blocklet_size, (block_size - blocklet_size)) * scale_factor numPartitions = Math.ceil(total size / splitSize) Limitation: not support insert into, support only load data command, not support multiple range columns, support only one range column exists data skew This closes apache#2971
For global_sort/local_sort table, load data command add RANGE_COLUMN option
it will calcute the number of the partitions as follows.
Limitation: