Minor improvements to BulkImport.estimateSizes()#3424
Conversation
| for (Entry<KeyExtent,MLong> entry : counts.entrySet()) { | ||
| if (entry.getKey().contains(row)) { | ||
| entry.getValue().l++; | ||
| int start = 0, end = sortedKeyExtents.size() - 1; |
There was a problem hiding this comment.
Why not Collections.binarySearch() on sortedKeyExtents list?
There was a problem hiding this comment.
In order to use Collections.binarySearch() we need to provide a KeyExtent to search for, but in this case we only have the row. So unless I'm overlooking something, I don't think Collections.binarySearch() will work.
There was a problem hiding this comment.
The biggest problem here is you are defeating the entire purpose by creating a new ArrayList. Creating a new collection copies all the entries so at that point you are actually just iterating over the entire collection anyways, and then now have to iterate again later. So this change actually makes things slower and uses more memory than before.
There was a problem hiding this comment.
yea good point. I have reverted the binary search here and removed the comment about it. I kept the change adding the FileSKVIterator to try-with-resources block.
There was a problem hiding this comment.
The comment may have been referring to leveraging the counts TreeMap to do a binary search. The following example illustrates some concepts that may be used to do this. I ran it locally and it prints what I would expect.
static KeyExtent nke(String er, String per) {
return new KeyExtent(TableId.of("1"), er == null ? null : new Text(er), per == null ? null : new Text(per));
}
public static void main(String[] args) {
TreeMap<KeyExtent, String> tm = new TreeMap<>();
tm.put(nke("c",null), "1");
tm.put(nke("f","c"), "2");
tm.put(nke("m","f"), "3");
tm.put(nke("ma","m"), "4");
tm.put(nke(null,"ma"), "5");
System.out.println("1 "+tm.ceilingEntry(nke("b",null)).getValue());
System.out.println("1 "+tm.ceilingEntry(nke("c",null)).getValue());
System.out.println("2 "+tm.ceilingEntry(nke("d",null)).getValue());
System.out.println("2 "+tm.ceilingEntry(nke("f",null)).getValue());
System.out.println("3 "+tm.ceilingEntry(nke("g",null)).getValue());
System.out.println("3 "+tm.ceilingEntry(nke("m",null)).getValue());
System.out.println("4 "+tm.ceilingEntry(nke("m ",null)).getValue());
System.out.println("4 "+tm.ceilingEntry(nke("ma",null)).getValue());
System.out.println("5 "+tm.ceilingEntry(nke("mb",null)).getValue());
System.out.println("5 "+tm.ceilingEntry(nke(null,null)).getValue());
}
Always setting prevEndRow to null for the lookup exent in the example above is something I think is done elsewhere in the Accumulo code, but not sure. I looked at the comparator for KeyExtent and I think its correct, but it would be good to double check that.
In the example above the extents go from -inf to +inf with no gaps and no overlap of the extent. So that example does not capture the complexity of this code. The existing loop in the code handles gaps and overlaps w/o issue because it checks each extent for overlap. I don't know the answer to this, but could we use the tailMap (and maybe headMap) functions of TreeMap to efficiently find the extents that overlap?
There was a problem hiding this comment.
The biggest problem here is you are defeating the entire purpose by creating a new ArrayList.
I don't think creating the ArrayList was necessarily a perf problem because it was done once outside of a loop and the binary search was done many times inside a loop.
@DomGarguilo I unresolved this in case you wanted to look into using tailMap in the TreeMap. Can always resolve this comment if not interested in looking into that.
There was a problem hiding this comment.
Since this is merged, if anything further is done, it should be in a new PR.
ctubbsii
left a comment
There was a problem hiding this comment.
Seems a shame that all the work ended with a mere use of try-with-resources, but sometimes it works out that way. Thanks for looking into this!
This PR addresses a TODO about using a binary search. In addition I moved theFileSKVIterator indexto the try-with-resources block.Edit:
This PR now removes a TODO comment about using binary search (after determining that it is not an improvement) and moves a resource to a try-wth-resources block in
BulkImport.estimateSizes()