Skip to content

Minor improvements to BulkImport.estimateSizes()#3424

Merged
ctubbsii merged 2 commits intoapache:mainfrom
DomGarguilo:bulkImportEstimateSizes
May 26, 2023
Merged

Minor improvements to BulkImport.estimateSizes()#3424
ctubbsii merged 2 commits intoapache:mainfrom
DomGarguilo:bulkImportEstimateSizes

Conversation

@DomGarguilo
Copy link
Member

@DomGarguilo DomGarguilo commented May 23, 2023

This PR addresses a TODO about using a binary search. In addition I moved the FileSKVIterator index to the try-with-resources block.

Edit:
This PR now removes a TODO comment about using binary search (after determining that it is not an improvement) and moves a resource to a try-wth-resources block in BulkImport.estimateSizes()

for (Entry<KeyExtent,MLong> entry : counts.entrySet()) {
if (entry.getKey().contains(row)) {
entry.getValue().l++;
int start = 0, end = sortedKeyExtents.size() - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Collections.binarySearch() on sortedKeyExtents list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to use Collections.binarySearch() we need to provide a KeyExtent to search for, but in this case we only have the row. So unless I'm overlooking something, I don't think Collections.binarySearch() will work.

Copy link
Contributor

@cshannon cshannon May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest problem here is you are defeating the entire purpose by creating a new ArrayList. Creating a new collection copies all the entries so at that point you are actually just iterating over the entire collection anyways, and then now have to iterate again later. So this change actually makes things slower and uses more memory than before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea good point. I have reverted the binary search here and removed the comment about it. I kept the change adding the FileSKVIterator to try-with-resources block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment may have been referring to leveraging the counts TreeMap to do a binary search. The following example illustrates some concepts that may be used to do this. I ran it locally and it prints what I would expect.

    static KeyExtent nke(String er, String per) {
        return new KeyExtent(TableId.of("1"), er == null ? null : new Text(er), per == null ? null : new Text(per));
    }

    public static void main(String[] args) {
        TreeMap<KeyExtent, String> tm = new TreeMap<>();

        tm.put(nke("c",null), "1");
        tm.put(nke("f","c"), "2");
        tm.put(nke("m","f"), "3");
        tm.put(nke("ma","m"), "4");
        tm.put(nke(null,"ma"), "5");

        System.out.println("1 "+tm.ceilingEntry(nke("b",null)).getValue());
        System.out.println("1 "+tm.ceilingEntry(nke("c",null)).getValue());
        System.out.println("2 "+tm.ceilingEntry(nke("d",null)).getValue());
        System.out.println("2 "+tm.ceilingEntry(nke("f",null)).getValue());
        System.out.println("3 "+tm.ceilingEntry(nke("g",null)).getValue());
        System.out.println("3 "+tm.ceilingEntry(nke("m",null)).getValue());
        System.out.println("4 "+tm.ceilingEntry(nke("m ",null)).getValue());
        System.out.println("4 "+tm.ceilingEntry(nke("ma",null)).getValue());
        System.out.println("5 "+tm.ceilingEntry(nke("mb",null)).getValue());
        System.out.println("5 "+tm.ceilingEntry(nke(null,null)).getValue());
    }

Always setting prevEndRow to null for the lookup exent in the example above is something I think is done elsewhere in the Accumulo code, but not sure. I looked at the comparator for KeyExtent and I think its correct, but it would be good to double check that.

In the example above the extents go from -inf to +inf with no gaps and no overlap of the extent. So that example does not capture the complexity of this code. The existing loop in the code handles gaps and overlaps w/o issue because it checks each extent for overlap. I don't know the answer to this, but could we use the tailMap (and maybe headMap) functions of TreeMap to efficiently find the extents that overlap?

Copy link
Contributor

@keith-turner keith-turner May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest problem here is you are defeating the entire purpose by creating a new ArrayList.

I don't think creating the ArrayList was necessarily a perf problem because it was done once outside of a loop and the binary search was done many times inside a loop.

@DomGarguilo I unresolved this in case you wanted to look into using tailMap in the TreeMap. Can always resolve this comment if not interested in looking into that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is merged, if anything further is done, it should be in a new PR.

@DomGarguilo DomGarguilo changed the title Improve performance of BulkImport.estimateSizes() Minor improvements to BulkImport.estimateSizes() May 26, 2023
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a shame that all the work ended with a mere use of try-with-resources, but sometimes it works out that way. Thanks for looking into this!

@ctubbsii ctubbsii merged commit bb48f00 into apache:main May 26, 2023
@DomGarguilo DomGarguilo deleted the bulkImportEstimateSizes branch May 30, 2023 15:10
@ctubbsii ctubbsii added this to the 3.0.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants