New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3293] Prune datamaps improvement #3126
Conversation
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2572/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2802/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10831/ |
cda1dc6
to
9cc9961
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2573/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2803/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10832/ |
9cc9961
to
4bad2e4
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2577/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10836/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2807/ |
4bad2e4
to
0581c50
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2579/ |
Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2809/ |
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10838/ |
0581c50
to
2ed4eb0
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2580/ |
} else if (dataType == DataTypes.INT) { | ||
getUnsafe() | ||
.putInt(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength, | ||
row.getInt(index)); | ||
runningLength += row.getSizeInBytes(index); | ||
int sizeInBytes = row.getSizeInBytes(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate code, better to use a method
if (data instanceof List) { | ||
sum += dataPosLength((List<Object>) data); | ||
} else { | ||
sum += (int) data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this sum will reach the Integer.MAX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it will not reach Integer.MAX. I have modified the existing code only. I think it will not cause any issues.
Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10839/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2810/ |
retest this please |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2581/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2811/ |
Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10840/ |
Map<String, Integer> blockletToRowCountMap = new HashMap<>(); | ||
for (Segment segment : segments) { | ||
List<CoarseGrainDataMap> dataMaps = dataMapFactory.getDataMaps(segment); | ||
for (CoarseGrainDataMap dataMap : dataMaps) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from all datamaps, it should be only from default datamap.
* Prune the data maps for finding the row count. It returns a Map of | ||
* blockletpath and the row count | ||
*/ | ||
Map<String, Integer> pruneRowCount(Segment segment, SegmentProperties segmentProperties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just getting row count, not pruning anything here so, please rename it accordingly
switch (schema.getSchemaType()) { | ||
case FIXED: | ||
DataType dataType = schema.getDataType(); | ||
if (dataType == DataTypes.BYTE) { | ||
getUnsafe() | ||
.putByte(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength, | ||
row.getByte(index)); | ||
runningLength += row.getSizeInBytes(index); | ||
int sizeInBytes = row.getSizeInBytes(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify the performance with this? how much performance is improved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravipesala attached performance report in the description. please check
switch (schema.getSchemaType()) { | ||
case FIXED: | ||
DataType dataType = schema.getDataType(); | ||
if (dataType == DataTypes.BYTE) { | ||
getUnsafe() | ||
.putByte(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength, | ||
row.getByte(index)); | ||
runningLength += row.getSizeInBytes(index); | ||
int sizeInBytes = row.getSizeInBytes(index); | ||
dataPos.add(sizeInBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing in another list is not a good choice it will increase the heap size a lot. We need to come up with better storage layout to traverse row to anywhere. May we store the positions at starting of the key can be better.
2ed4eb0
to
87ef80f
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2588/ |
Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2819/ |
Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10848/ |
@dhatchayani Is it the final PR? or any changes needed on it? |
Changes needed on this PR and will be raised as separate PR.. This I will close |
Problem:
(1) Currently for count (*) , the prune is same as select * query. Blocklet and ExtendedBlocklet are formed from the DataMapRow and that is of no need and it is a time consuming process.
(2) Pruning in select * query consumes time in convertToSafeRow() - converting the DataMapRow to safe as in an unsafe row to get the position of data, we need to traverse through the whole row to reach a position.
(3) In case of filter queries, even if the blocklet is valid or invalid, we are converting the DataMapRow to safeRow. This conversion is time consuming increasing the number of blocklets.
Solution:
(1) We have the blocklet row count in the DataMapRow itself, so it is just enough to read the count. With this count (*) query performance can be improved.
(2) Maintain the data length also to the DataMapRow, so that traversing the whole row can be avoided. With the length we can directly hit the data position.
(3) Read only the MinMax from the DataMapRow, decide whether scan is required on that blocklet, if required only then it can be converted to safeRow, if needed.
Performance Report:
3 node cluster
Number of cores - 30
Number of segments - 5000
Number of DataMaps per Segment - 150
Total record count - 5000000
Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
Existing UT
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.