[CARBONDATA-3219] Support range partition the input data for local_sort/global sort data loading #2971

QiangCai · 2018-12-03T08:45:50Z

For global_sort/local_sort table, load data command add RANGE_COLUMN option

load data inpath '<path>' into table <table name> 
options('RANGE_COLUMN'='<a column>')

when we know the total size of input data, we can calculate the number of the partitions.

load data inpath '<path>' into table <table name> 
options('RANGE_COLUMN'='<a column>', 'global_sort_partitions'='10')

when we don't know the total size of the input data, we can give the size of each partition.

load data inpath '<path>' into table <table name> 
options('RANGE_COLUMN'='<a column>', 'scale_factor'='10')

it will calcute the number of the partitions as follows.

splitSize =  Math.max(blocklet_size, (block_size - blocklet_size)) * scale_factor
numPartitions = Math.ceil(total size / splitSize)

Limitation:

not support insert into, support only load data command,
not support multiple range columns, support only one range column
exists data skew

Indhumathi27 · 2018-12-03T08:58:29Z

...rk-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessorStepOnSpark.scala

@@ -305,4 +307,107 @@ object DataLoadProcessorStepOnSpark {
          e)
    }
  }
+
+  def sortAdnWriteFunc(


Please change the method name from sortAdnWriteFunc to sortAndWriteFunc

CarbonDataQA · 2018-12-03T08:59:24Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1617/

CarbonDataQA · 2018-12-03T09:55:28Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1828/

CarbonDataQA · 2018-12-03T09:57:41Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9877/

CarbonDataQA · 2018-12-03T11:13:43Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1622/

CarbonDataQA · 2018-12-03T11:16:30Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9882/

CarbonDataQA · 2018-12-03T11:22:39Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1833/

qiuchenjian · 2018-12-03T14:33:39Z

...k-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala

+      dataFrame: Option[DataFrame],
+      model: CarbonLoadModel,
+      hadoopConf: Configuration): Array[(String, (LoadMetadataDetails, ExecutionErrors))] = {
+    val originRDD = if (dataFrame.isDefined) {


This method has some of the same code as loadDataUsingGlobalSort， I recommend refactoring these two methods

better, but after refactoring, the code logic is not clear. Now, these two flows already reuse the process steps.

qiuchenjian · 2018-12-03T14:38:22Z

...k-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala

+        // here it assumes the compression ratio of CarbonData is about 33%,
+        // so it multiply by 3 to get the split size of CSV files.
+        val splitSize = Math.max(blockletSize, (blockSize - blockletSize)) * 3
+        numPartitions = Math.ceil(totalSize / splitSize).toInt


If insert using dataframe, I think totalSize will be 0

yes, insert will use global sort

qiuchenjian · 2018-12-03T14:51:38Z

processing/src/main/java/org/apache/carbondata/processing/loading/model/LoadOption.java

@@ -188,6 +188,8 @@
    optionsFinal.put(CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB,
        Maps.getOrDefault(options, CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB,
            CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB_DEFAULT));
+
+    optionsFinal.put("range_column", Maps.getOrDefault(options, "range_column", null));


Does makeCreateTableString of CarbonDataFrameWriter need add "range_column" ?

now it only try to support load data command

CarbonDataQA · 2018-12-04T02:14:15Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1627/

CarbonDataQA · 2018-12-04T03:18:11Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1838/

CarbonDataQA · 2018-12-04T03:25:52Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9887/

CarbonDataQA · 2018-12-04T09:04:23Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1630/

CarbonDataQA · 2018-12-04T10:02:28Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9890/

CarbonDataQA · 2018-12-04T10:03:45Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1841/

CarbonDataQA · 2018-12-10T06:14:10Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1668/

CarbonDataQA · 2018-12-10T06:17:06Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9928/

CarbonDataQA · 2018-12-10T06:23:59Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1880/

CarbonDataQA · 2018-12-10T11:14:28Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1680/

CarbonDataQA · 2018-12-10T12:14:56Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1892/

CarbonDataQA · 2018-12-10T12:21:26Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9940/

CarbonDataQA · 2018-12-11T07:09:22Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1698/

CarbonDataQA · 2018-12-11T08:08:43Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1908/

CarbonDataQA · 2018-12-11T08:20:04Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9958/

CarbonDataQA · 2018-12-26T13:47:09Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1943/

CarbonDataQA · 2018-12-26T13:48:06Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10196/

CarbonDataQA · 2018-12-26T13:51:08Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2152/

CarbonDataQA · 2019-01-03T09:36:46Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10396/

CarbonDataQA · 2019-01-03T13:22:01Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2366/

CarbonDataQA · 2019-01-03T13:31:34Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2153/

CarbonDataQA · 2019-01-03T14:22:01Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10407/

CarbonDataQA · 2019-01-04T01:54:21Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2157/

CarbonDataQA · 2019-01-04T02:54:06Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10411/

CarbonDataQA · 2019-01-04T02:54:30Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2370/

CarbonDataQA · 2019-01-04T04:04:35Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2158/

CarbonDataQA · 2019-01-04T04:58:43Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2371/

CarbonDataQA · 2019-01-04T05:03:04Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10412/

…a loading

QiangCai · 2019-01-04T08:52:15Z

@ravipesala
After the compaction, it will become local_sort.
In my opinion, we can use Range_column to partition the input data.
So it can reduce the scope of sorting during data loading to improve data loading performance.
In some case, it also can improve the query performance (like Global_sort).

CarbonDataQA · 2019-01-04T08:54:08Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2159/

CarbonDataQA · 2019-01-04T09:49:32Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2372/

CarbonDataQA · 2019-01-04T09:53:53Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10413/

QiangCai · 2019-01-04T10:32:58Z

@ravipesala @kumarvishal09
please review again.

CarbonDataQA · 2019-01-04T10:39:32Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2164/

CarbonDataQA · 2019-01-04T11:39:40Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2377/

CarbonDataQA · 2019-01-04T11:46:50Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10418/

ravipesala · 2019-01-07T01:56:53Z

@QiangCai My question how the user can benefit if he chooses a different range column for each load. I feel range column should be at the table level not at the load level.
And regarding compaction, yes currently after compaction it becomes local sort but there is a way we can support range column compaction like how we do compaction for partitions. This work can be done in future. But if you allow the user to choose range column at each load level then this type of compaction cannot be done.

QiangCai · 2019-01-07T04:00:41Z

@ravipesala
I agree with you to add it to the table properties.
Even if it becomes the table property, maybe the user also can change it. right?
Range_column is different from the partition table.
For range_column, the range boundaries are different for all segments. (Global_SORT also)
For the partition table, the range boundaries are the same for all segments.

ravipesala · 2019-01-07T11:46:35Z

@QiangCai we should restrict changing that property from table properties.
I am just explaining about how we can do the compaction on range column since there are similarities with partitioning I mentioned it here.
I feel range boundaries can be recalculated during the compaction using min/max of range column and go for the merge sort.

QiangCai · 2019-01-08T01:52:00Z

@ravipesala
In my opinion, it is unnecessary to restrict changing.
The users will keep the range_column as unchanged as possible.
So I only add this option into loading command.

ravipesala · 2019-01-08T07:21:21Z

LGTM @QiangCai I feel it is better to keep in tableproprties as it is not supposed changed for each load. We can further discuss and raise another PR if needed, I am merging this now. Thanks for working on it.

…rt/global sort data loading For global_sort/local_sort table, load data command add RANGE_COLUMN option load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>') when we know the total size of input data, we can calculate the number of the partitions. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'global_sort_partitions'='10') when we don't know the total size of the input data, we can give the size of each partition. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'scale_factor'='10') it will calcute the number of the partitions as follows. splitSize = Math.max(blocklet_size, (block_size - blocklet_size)) * scale_factor numPartitions = Math.ceil(total size / splitSize) Limitation: not support insert into, support only load data command, not support multiple range columns, support only one range column exists data skew This closes #2971

…rt/global sort data loading For global_sort/local_sort table, load data command add RANGE_COLUMN option load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>') when we know the total size of input data, we can calculate the number of the partitions. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'global_sort_partitions'='10') when we don't know the total size of the input data, we can give the size of each partition. load data inpath '<path>' into table <table name> options('RANGE_COLUMN'='<a column>', 'scale_factor'='10') it will calcute the number of the partitions as follows. splitSize = Math.max(blocklet_size, (block_size - blocklet_size)) * scale_factor numPartitions = Math.ceil(total size / splitSize) Limitation: not support insert into, support only load data command, not support multiple range columns, support only one range column exists data skew This closes apache#2971

Indhumathi27 reviewed Dec 3, 2018

View reviewed changes

qiuchenjian reviewed Dec 3, 2018

View reviewed changes

QiangCai changed the title ~~[TEST] Test loading performance of range_sort~~ [TEST] Test loading performance with range_column Dec 5, 2018

QiangCai force-pushed the range_sort branch from de4ea3f to 2025e88 Compare December 26, 2018 13:43

QiangCai force-pushed the range_sort branch from 8e21157 to 98b7bf1 Compare January 4, 2019 03:50

Support range partition the input data for local_sort/global sort dat…

56690be

…a loading

QiangCai force-pushed the range_sort branch from 98b7bf1 to 56690be Compare January 4, 2019 08:40

reuse code

db57267

asfgit closed this in 45951c7 Jan 8, 2019

[CARBONDATA-3219] Support range partition the input data for local_sort/global sort data loading #2971

[CARBONDATA-3219] Support range partition the input data for local_sort/global sort data loading #2971

Conversation

QiangCai commented Dec 3, 2018 • edited

Indhumathi27 Dec 3, 2018

Choose a reason for hiding this comment

CarbonDataQA commented Dec 3, 2018

CarbonDataQA commented Dec 3, 2018

CarbonDataQA commented Dec 3, 2018

CarbonDataQA commented Dec 3, 2018

CarbonDataQA commented Dec 3, 2018

CarbonDataQA commented Dec 3, 2018

qiuchenjian Dec 3, 2018 • edited

Choose a reason for hiding this comment

QiangCai Dec 4, 2018

Choose a reason for hiding this comment

qiuchenjian Dec 3, 2018

Choose a reason for hiding this comment

QiangCai Dec 4, 2018

Choose a reason for hiding this comment

qiuchenjian Dec 3, 2018 • edited

Choose a reason for hiding this comment

QiangCai Dec 4, 2018

Choose a reason for hiding this comment

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 11, 2018

CarbonDataQA commented Dec 11, 2018

CarbonDataQA commented Dec 11, 2018

CarbonDataQA commented Dec 26, 2018

CarbonDataQA commented Dec 26, 2018

CarbonDataQA commented Dec 26, 2018

CarbonDataQA commented Jan 3, 2019

CarbonDataQA commented Jan 3, 2019

CarbonDataQA commented Jan 3, 2019

CarbonDataQA commented Jan 3, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

QiangCai commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

QiangCai commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

CarbonDataQA commented Jan 4, 2019

ravipesala commented Jan 7, 2019

QiangCai commented Jan 7, 2019

ravipesala commented Jan 7, 2019

QiangCai commented Jan 8, 2019

ravipesala commented Jan 8, 2019

QiangCai commented Dec 3, 2018 •

edited

qiuchenjian Dec 3, 2018 •

edited

qiuchenjian Dec 3, 2018 •

edited