HBASE-24859: Optimize in-memory representation of HBase map reduce table splits#2591
Conversation
8c991e8 to
38c3384
Compare
|
💔 -1 overall
This message was automatically generated. |
38c3384 to
688aaa5
Compare
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| List<InputSplit> splits = new ArrayList<>(1); | ||
| long regionSize = sizeCalculator.getRegionSize(regLoc.getRegionInfo().getRegionName()); | ||
| TableSplit split = new TableSplit(tableName, scan, | ||
| TableSplit split = new TableSplit(tableName, |
There was a problem hiding this comment.
So, we do not pass the scan object and then we save a bunch of memory? The TableSplit doesn't carry around the heavy scan instance anymore? Is that it? Thanks.
There was a problem hiding this comment.
@saintstack that is correct! If you see the jira for description, there is a heap dump screenshots which shows the scan object may occupy much memory in case of tables with large number of regions. This patch just fix the TableInputFormat for single table where we don’t use the scan object from TableSplit since we use it from MR Job conf directly. There should be another patch to fix the similar fix with more code changes for MultiTableInputFormat. I will try to fix that in a separate patch.
saintstack
left a comment
There was a problem hiding this comment.
Thank you. Patch looks good to me but I defer to those who have been following the issue closer than I... Need a +1 from them before merge I'd say.
|
@saintstack I can sign-off on this change, I have some context on the bug, will take a look. |
|
@bharathv thank you |
|
Created a separate jira https://issues.apache.org/jira/browse/HBASE-25226 for optimizing the memory consumption in |
bharathv
left a comment
There was a problem hiding this comment.
Patch seems fine, just a few nits, I think we should consider fixing MultiTableInputFormat..
| public TableSplit(TableName tableName, byte[] startRow, byte[] endRow, | ||
| final String location) { | ||
| final String location, final String encodedRegionName, long length) { | ||
| this(tableName, null, startRow, endRow, location, encodedRegionName, length); |
There was a problem hiding this comment.
Why do we need a new constructor? Just pass null wherever it is not needed?
There was a problem hiding this comment.
@bharathv I think the separate constructor is more intuitive that it is okay to not have scan object? What do you think?
There was a problem hiding this comment.
can't we ignore the scan assignment, just using this.scan = "" anyway, instead of the following:
try {
this.scan =
(null == scan) ? "" : TableMapReduceUtil.convertScanToString(scan);
} catch (IOException e) {
LOG.warn("Failed to convert Scan to String", e);
}
I also prefer not to add a new constructor.
There was a problem hiding this comment.
iff we make sure the scan is useless here
There was a problem hiding this comment.
I think the separate constructor is more intuitive that it is okay to not have scan object? What do you think?
IMHO more constructors is less readability. In this case we have 8 constructors for TableSplit. If I were to use it, I'd confused unless I see the usages of each of these. Either we should have a fluent interface or fewer constructors is what I think. Subjective of course.
There was a problem hiding this comment.
@Reidddddd Scan is not useless for all the input formats of map-reduce, it is for the TableInputFormat.
We do need the scan object for MultiTableInputFormat
b49c196 to
7fef633
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| public TableSplit(TableName tableName, byte[] startRow, byte[] endRow, | ||
| final String location) { | ||
| final String location, final String encodedRegionName, long length) { | ||
| this(tableName, null, startRow, endRow, location, encodedRegionName, length); |
There was a problem hiding this comment.
can't we ignore the scan assignment, just using this.scan = "" anyway, instead of the following:
try {
this.scan =
(null == scan) ? "" : TableMapReduceUtil.convertScanToString(scan);
} catch (IOException e) {
LOG.warn("Failed to convert Scan to String", e);
}
I also prefer not to add a new constructor.
| public TableSplit(TableName tableName, byte[] startRow, byte[] endRow, | ||
| final String location) { | ||
| final String location, final String encodedRegionName, long length) { | ||
| this(tableName, null, startRow, endRow, location, encodedRegionName, length); |
There was a problem hiding this comment.
iff we make sure the scan is useless here
| public TableSplit(TableName tableName, byte[] startRow, byte[] endRow, | ||
| final String location) { | ||
| final String location, final String encodedRegionName, long length) { | ||
| this(tableName, null, startRow, endRow, location, encodedRegionName, length); |
There was a problem hiding this comment.
I think the separate constructor is more intuitive that it is okay to not have scan object? What do you think?
IMHO more constructors is less readability. In this case we have 8 constructors for TableSplit. If I were to use it, I'd confused unless I see the usages of each of these. Either we should have a fluent interface or fewer constructors is what I think. Subjective of course.
bharathv
left a comment
There was a problem hiding this comment.
Few nits but lgtm, lets see what the QA bot says..
| private String regionLocation; | ||
| private String encodedRegionName = ""; | ||
|
|
||
| /** The scan object may be null but the serialized form of scan is never null |
There was a problem hiding this comment.
nit: new lines missing
/**
* .....
*/
|
💔 -1 overall
This message was automatically generated. |
4514746 to
ff34744
Compare
ff34744 to
244e482
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
bharathv
left a comment
There was a problem hiding this comment.
@Reidddddd Any more comments on this or is this good to go?
It has been observed that when the table has too many regions, MR jobs consume a lot of memory in the client. This is because we keep the region level information in memory and the memory heavy object is TableSplit because of the Scan object as a part of it.
However, it looks like the TableInputFormat for single table doesn't need to store the scan object in the TableSplit because we do not use it and all the splits are expected to have the exact same scan object. In
TableInputFormatwe use the scan object directly from the MR conf.