Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3293] Prune datamaps improvement #3126

Closed
wants to merge 1 commit into from

Conversation

dhatchayani
Copy link
Contributor

@dhatchayani dhatchayani commented Feb 15, 2019

Problem:

(1) Currently for count (*) , the prune is same as select * query. Blocklet and ExtendedBlocklet are formed from the DataMapRow and that is of no need and it is a time consuming process.

(2) Pruning in select * query consumes time in convertToSafeRow() - converting the DataMapRow to safe as in an unsafe row to get the position of data, we need to traverse through the whole row to reach a position.

(3) In case of filter queries, even if the blocklet is valid or invalid, we are converting the DataMapRow to safeRow. This conversion is time consuming increasing the number of blocklets.

Solution:

(1) We have the blocklet row count in the DataMapRow itself, so it is just enough to read the count. With this count (*) query performance can be improved.

(2) Maintain the data length also to the DataMapRow, so that traversing the whole row can be avoided. With the length we can directly hit the data position.

(3) Read only the MinMax from the DataMapRow, decide whether scan is required on that blocklet, if required only then it can be converted to safeRow, if needed.

Performance Report:
3 node cluster
Number of cores - 30
Number of segments - 5000
Number of DataMaps per Segment - 150
Total record count - 5000000

count(*) count(*) CTAS(parquet)
E2E Time(secs) Prune Time(secs) Prune Time(secs)
Before Fix 425.097 331 332
After Fix 142.107 136 172
  • Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Existing UT

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2572/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2802/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10831/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2573/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2803/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10832/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2577/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10836/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2807/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2579/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2809/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10838/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2580/

} else if (dataType == DataTypes.INT) {
getUnsafe()
.putInt(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength,
row.getInt(index));
runningLength += row.getSizeInBytes(index);
int sizeInBytes = row.getSizeInBytes(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate code, better to use a method

if (data instanceof List) {
sum += dataPosLength((List<Object>) data);
} else {
sum += (int) data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this sum will reach the Integer.MAX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it will not reach Integer.MAX. I have modified the existing code only. I think it will not cause any issues.

@apache apache deleted a comment from dhatchayani Feb 18, 2019
@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10839/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2810/

@dhatchayani
Copy link
Contributor Author

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2581/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2811/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10840/

Map<String, Integer> blockletToRowCountMap = new HashMap<>();
for (Segment segment : segments) {
List<CoarseGrainDataMap> dataMaps = dataMapFactory.getDataMaps(segment);
for (CoarseGrainDataMap dataMap : dataMaps) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from all datamaps, it should be only from default datamap.

* Prune the data maps for finding the row count. It returns a Map of
* blockletpath and the row count
*/
Map<String, Integer> pruneRowCount(Segment segment, SegmentProperties segmentProperties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just getting row count, not pruning anything here so, please rename it accordingly

switch (schema.getSchemaType()) {
case FIXED:
DataType dataType = schema.getDataType();
if (dataType == DataTypes.BYTE) {
getUnsafe()
.putByte(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength,
row.getByte(index));
runningLength += row.getSizeInBytes(index);
int sizeInBytes = row.getSizeInBytes(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify the performance with this? how much performance is improved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravipesala attached performance report in the description. please check

switch (schema.getSchemaType()) {
case FIXED:
DataType dataType = schema.getDataType();
if (dataType == DataTypes.BYTE) {
getUnsafe()
.putByte(memoryBlock.getBaseObject(), memoryBlock.getBaseOffset() + runningLength,
row.getByte(index));
runningLength += row.getSizeInBytes(index);
int sizeInBytes = row.getSizeInBytes(index);
dataPos.add(sizeInBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing in another list is not a good choice it will increase the heap size a lot. We need to come up with better storage layout to traverse row to anywhere. May we store the positions at starting of the key can be better.

@dhatchayani dhatchayani changed the title [WIP][CARBONDATA-3293] Prune datamaps improvement [CARBONDATA-3293] Prune datamaps improvement Feb 19, 2019
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2588/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2819/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10848/

@ravipesala
Copy link
Contributor

@dhatchayani Is it the final PR? or any changes needed on it?

@dhatchayani
Copy link
Contributor Author

Changes needed on this PR and will be raised as separate PR.. This I will close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants