[CARBONDATA-3293] Prune datamaps improvement #3126

dhatchayani · 2019-02-15T10:51:03Z

Problem:

(1) Currently for count (*) , the prune is same as select * query. Blocklet and ExtendedBlocklet are formed from the DataMapRow and that is of no need and it is a time consuming process.

(2) Pruning in select * query consumes time in convertToSafeRow() - converting the DataMapRow to safe as in an unsafe row to get the position of data, we need to traverse through the whole row to reach a position.

(3) In case of filter queries, even if the blocklet is valid or invalid, we are converting the DataMapRow to safeRow. This conversion is time consuming increasing the number of blocklets.

Solution:

(1) We have the blocklet row count in the DataMapRow itself, so it is just enough to read the count. With this count (*) query performance can be improved.

(2) Maintain the data length also to the DataMapRow, so that traversing the whole row can be avoided. With the length we can directly hit the data position.

(3) Read only the MinMax from the DataMapRow, decide whether scan is required on that blocklet, if required only then it can be converted to safeRow, if needed.

Performance Report:
3 node cluster
Number of cores - 30
Number of segments - 5000
Number of DataMaps per Segment - 150
Total record count - 5000000

	count(*)	count(*)	CTAS(parquet)
	E2E Time(secs)	Prune Time(secs)	Prune Time(secs)
Before Fix	425.097	331	332
After Fix	142.107	136	172

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Existing UT
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2019-02-15T11:04:58Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2572/

CarbonDataQA · 2019-02-15T11:44:06Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2802/

CarbonDataQA · 2019-02-15T11:47:53Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10831/

CarbonDataQA · 2019-02-15T12:54:52Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2573/

CarbonDataQA · 2019-02-15T13:31:15Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2803/

CarbonDataQA · 2019-02-15T13:40:04Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10832/

CarbonDataQA · 2019-02-17T08:29:51Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2577/

CarbonDataQA · 2019-02-17T09:20:54Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10836/

CarbonDataQA · 2019-02-17T09:21:41Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2807/

CarbonDataQA · 2019-02-18T10:05:08Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2579/

CarbonDataQA · 2019-02-18T11:00:23Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2809/

CarbonDataQA · 2019-02-18T11:05:03Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10838/

CarbonDataQA · 2019-02-18T12:49:39Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2580/

qiuchenjian · 2019-02-18T12:57:20Z

core/src/main/java/org/apache/carbondata/core/indexstore/UnsafeMemoryDMStore.java

        } else if (dataType == DataTypes.INT) {
          getUnsafe()
              .putInt(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength,
                  row.getInt(index));
-          runningLength += row.getSizeInBytes(index);
+          int sizeInBytes = row.getSizeInBytes(index);


duplicate code, better to use a method

qiuchenjian · 2019-02-18T13:06:50Z

core/src/main/java/org/apache/carbondata/core/indexstore/row/UnsafeDataMapRow.java

+      if (data instanceof List) {
+        sum += dataPosLength((List<Object>) data);
+      } else {
+        sum += (int) data;


does this sum will reach the Integer.MAX?

no, it will not reach Integer.MAX. I have modified the existing code only. I think it will not cause any issues.

CarbonDataQA · 2019-02-18T13:44:10Z

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10839/

CarbonDataQA · 2019-02-18T13:45:41Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2810/

dhatchayani · 2019-02-19T01:51:43Z

retest this please

CarbonDataQA · 2019-02-19T02:04:29Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2581/

CarbonDataQA · 2019-02-19T03:00:31Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2811/

CarbonDataQA · 2019-02-19T03:03:58Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10840/

ravipesala · 2019-02-19T13:07:50Z

core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java

+    Map<String, Integer> blockletToRowCountMap = new HashMap<>();
+    for (Segment segment : segments) {
+      List<CoarseGrainDataMap> dataMaps = dataMapFactory.getDataMaps(segment);
+      for (CoarseGrainDataMap dataMap : dataMaps) {


Not from all datamaps, it should be only from default datamap.

ravipesala · 2019-02-19T13:09:03Z

core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMap.java

+   * Prune the data maps for finding the row count. It returns a Map of
+   * blockletpath and the row count
+   */
+  Map<String, Integer> pruneRowCount(Segment segment, SegmentProperties segmentProperties,


It is just getting row count, not pruning anything here so, please rename it accordingly

ravipesala · 2019-02-19T13:11:39Z

core/src/main/java/org/apache/carbondata/core/indexstore/UnsafeMemoryDMStore.java

    switch (schema.getSchemaType()) {
      case FIXED:
        DataType dataType = schema.getDataType();
        if (dataType == DataTypes.BYTE) {
          getUnsafe()
              .putByte(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength,
                  row.getByte(index));
-          runningLength += row.getSizeInBytes(index);
+          int sizeInBytes = row.getSizeInBytes(index);


Did you verify the performance with this? how much performance is improved?

@ravipesala attached performance report in the description. please check

ravipesala · 2019-02-19T13:14:38Z

core/src/main/java/org/apache/carbondata/core/indexstore/UnsafeMemoryDMStore.java

    switch (schema.getSchemaType()) {
      case FIXED:
        DataType dataType = schema.getDataType();
        if (dataType == DataTypes.BYTE) {
          getUnsafe()
              .putByte(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength,
                  row.getByte(index));
-          runningLength += row.getSizeInBytes(index);
+          int sizeInBytes = row.getSizeInBytes(index);
+          dataPos.add(sizeInBytes);


Storing in another list is not a good choice it will increase the heap size a lot. We need to come up with better storage layout to traverse row to anywhere. May we store the positions at starting of the key can be better.

CarbonDataQA · 2019-02-21T08:49:52Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2588/

CarbonDataQA · 2019-02-21T09:45:17Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2819/

CarbonDataQA · 2019-02-21T09:49:10Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10848/

ravipesala · 2019-03-14T05:37:48Z

@dhatchayani Is it the final PR? or any changes needed on it?

dhatchayani · 2019-03-14T05:44:58Z

Changes needed on this PR and will be raised as separate PR.. This I will close

dhatchayani force-pushed the CARBONDATA-3293 branch from cda1dc6 to 9cc9961 Compare February 15, 2019 12:42

dhatchayani force-pushed the CARBONDATA-3293 branch from 9cc9961 to 4bad2e4 Compare February 17, 2019 08:15

dhatchayani force-pushed the CARBONDATA-3293 branch from 4bad2e4 to 0581c50 Compare February 18, 2019 09:51

dhatchayani force-pushed the CARBONDATA-3293 branch from 0581c50 to 2ed4eb0 Compare February 18, 2019 12:35

qiuchenjian reviewed Feb 18, 2019

View reviewed changes

apache deleted a comment from dhatchayani Feb 18, 2019

ravipesala reviewed Feb 19, 2019

View reviewed changes

dhatchayani changed the title ~~[WIP][CARBONDATA-3293] Prune datamaps improvement~~ [CARBONDATA-3293] Prune datamaps improvement Feb 19, 2019

[CARBONDATA-3293] Prune datamaps improvement

87ef80f

dhatchayani force-pushed the CARBONDATA-3293 branch from 2ed4eb0 to 87ef80f Compare February 21, 2019 08:36

dhatchayani closed this Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-3293] Prune datamaps improvement #3126

[CARBONDATA-3293] Prune datamaps improvement #3126

dhatchayani commented Feb 15, 2019 •

edited

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 17, 2019

CarbonDataQA commented Feb 17, 2019

CarbonDataQA commented Feb 17, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

qiuchenjian Feb 18, 2019

qiuchenjian Feb 18, 2019

dhatchayani Feb 19, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

dhatchayani commented Feb 19, 2019

CarbonDataQA commented Feb 19, 2019

CarbonDataQA commented Feb 19, 2019

CarbonDataQA commented Feb 19, 2019

ravipesala Feb 19, 2019

ravipesala Feb 19, 2019

ravipesala Feb 19, 2019

dhatchayani Feb 21, 2019

ravipesala Feb 19, 2019

CarbonDataQA commented Feb 21, 2019

CarbonDataQA commented Feb 21, 2019

CarbonDataQA commented Feb 21, 2019

ravipesala commented Mar 14, 2019

dhatchayani commented Mar 14, 2019

[CARBONDATA-3293] Prune datamaps improvement #3126

[CARBONDATA-3293] Prune datamaps improvement #3126

Conversation

dhatchayani commented Feb 15, 2019 • edited

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 15, 2019

CarbonDataQA commented Feb 17, 2019

CarbonDataQA commented Feb 17, 2019

CarbonDataQA commented Feb 17, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Feb 18, 2019

CarbonDataQA commented Feb 18, 2019

dhatchayani commented Feb 19, 2019

CarbonDataQA commented Feb 19, 2019

CarbonDataQA commented Feb 19, 2019

CarbonDataQA commented Feb 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Feb 21, 2019

CarbonDataQA commented Feb 21, 2019

CarbonDataQA commented Feb 21, 2019

ravipesala commented Mar 14, 2019

dhatchayani commented Mar 14, 2019

dhatchayani commented Feb 15, 2019 •

edited