Skip to content

Conversation

@keith-turner
Copy link
Contributor

In #3500 it was observed that GC spends a lot of time creating Path objects for each Reference. This changes avoids creating Path objects for the GC case.

Currently have tested these changes without issue against tests matching the pattern *GC*IT and *Garb*IT. Looking into doing some further performance test to see if there is an observable difference.

In apache#3500 it was observed that GC spends a lot of time creating Path
objects for each Reference.  This changes avoids creating Path objects
for the GC case.
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StoredTabletFile is still a somewhat nicer API than going back to using String again (especially when it's a collection of pairs of strings, which gets quite confusing). If the problem is that StoredTabletFile is doing unnecessary validation, can we have a specific kind of TabletFile that is lighter weight, but still more expressive than strings?

@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 15, 2023

Tried running the test mentioned on #3500 accumulo-jmh-test with these changes. Seeing a noticeable difference. Against 2.1.0 saw the following for the benchmark.

Benchmark                                          Mode  Cnt  Score   Error  Units
GarbageCollectorPerformanceIT.benchmarkCandidates  avgt    3  0.012 ± 0.005  ms/op

With these changes and running the test against a locally installed 2.1.1-SNAPSHOT saw the following.

Benchmark                                          Mode  Cnt  Score   Error  Units
GarbageCollectorPerformanceIT.benchmarkCandidates  avgt    3  0.003 ± 0.003  ms/op

@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 15, 2023

The StoredTabletFile is still a somewhat nicer API than going back to using String again (especially when it's a collection of pairs of strings, which gets quite confusing). If the problem is that StoredTabletFile is doing unnecessary validation, can we have a specific kind of TabletFile that is lighter weight, but still more expressive than strings?

I looked into refactoring StoredTabletFile first and decided against that as I could not see an easy way to do it. I think the changes @cshannon has made in 3.0 branch would be a prereq for doing this refactoring.

We could open a follow on issue about making StoredTabletFile do lazy validation for 3.0. When those changes are made in 3.0 these changes could be removed.

@keith-turner keith-turner marked this pull request as ready for review June 15, 2023 21:27
@ctubbsii ctubbsii linked an issue Jun 15, 2023 that may be closed by this pull request
@cshannon
Copy link
Contributor

@keith-turner - I can do a follow on issue to do lazy validation because I agree with @ctubbsii that keeping the new API is a lot nicer than strings so being able to revert these changes in 3.0 would be nice.

@cshannon
Copy link
Contributor

I added a PR accumulo-jmh-test so it can now run on the command line.

Copy link
Contributor

@cshannon cshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, I am working on a PR now for main that will fix this issue inside StoredTabletFile and will be up shortly.

@dlmarion
Copy link
Contributor

At it's core, Path is a URI. The Path constructors perform validation on the input, then create and set the URI. Except, when you use the constructor Path(URI), that does not occur and does not incur the cost of the validation. If we used the Path constructors (and the validation) when writing to the metadata table, but used Path(URI) when reading from the metadata table, then we could still keep the Path object vs going back to String.

@cshannon
Copy link
Contributor

At it's core, Path is a URI. The Path constructors perform validation on the input, then create and set the URI. Except, when you use the constructor Path(URI), that does not occur and does not incur the cost of the validation. If we used the Path constructors (and the validation) when writing to the metadata table, but used Path(URI) when reading from the metadata table, then we could still keep the Path object vs going back to String.

The primary bottleneck does not appear to be he actual creation of the Path(). It turns out it's the call to Path.getParent() that is the main problem which is only done in our validation logic.

@cshannon
Copy link
Contributor

PR to fix this in main is up: #3502

@dlmarion
Copy link
Contributor

At it's core, Path is a URI. The Path constructors perform validation on the input, then create and set the URI. Except, when you use the constructor Path(URI), that does not occur and does not incur the cost of the validation. If we used the Path constructors (and the validation) when writing to the metadata table, but used Path(URI) when reading from the metadata table, then we could still keep the Path object vs going back to String.

The primary bottleneck does not appear to be he actual creation of the Path(). It turns out it's the call to Path.getParent() that is the main problem which is only done in our validation logic.

Right, so we call Path.getParent(), with calls new Path(String, String, String). Here's Path.getParent:

  /**
   * Returns the parent of a path or null if at root.
   * @return the parent of a path or null if at root
   */
  public Path getParent() {
    String path = uri.getPath();
    int lastSlash = path.lastIndexOf('/');
    int start = startPositionWithoutWindowsDrive(path);
    if ((path.length() == start) ||               // empty path
        (lastSlash == start && path.length() == start+1)) { // at root
      return null;
    }
    String parent;
    if (lastSlash==-1) {
      parent = CUR_DIR;
    } else {
      parent = path.substring(0, lastSlash==start?start+1:lastSlash);
    }
    return new Path(uri.getScheme(), uri.getAuthority(), parent);
  }

The issue is the last line of the method, right? Could we not create a URI from (String, String, String) and then call Path(URI)?

@ctubbsii
Copy link
Member

Is the plan to merge this to 2.1, and omit this change when merging to 3.0 (-s ours), in favor of #3502 fixing it there? Or is there something from this that would still be desirable to keep when going forward?

@cshannon
Copy link
Contributor

Is the plan to merge this to 2.1, and omit this change when merging to 3.0 (-s ours), in favor of #3502 fixing it there? Or is there something from this that would still be desirable to keep when going forward?

The only thing to merge forward would be the change to Pair so maybe that should be a separate PR.

@ctubbsii
Copy link
Member

Is the plan to merge this to 2.1, and omit this change when merging to 3.0 (-s ours), in favor of #3502 fixing it there? Or is there something from this that would still be desirable to keep when going forward?

The only thing to merge forward would be the change to Pair so maybe that should be a separate PR.

Or just commit the change to Pair, because it's so trivial. If you want it "reviewed", consider this comment my +1 for the Pair changes.

@keith-turner
Copy link
Contributor Author

I think I will pull the pair changes out as their own commit push merge them and as @ctubbsii mentioned they are reviewed so I will not do another PR.

@keith-turner
Copy link
Contributor Author

The issue is the last line of the method, right? Could we not create a URI from (String, String, String) and then call Path(URI)?

@dlmarion do you think we might be able to optimize the validation instead of avoid it?

@keith-turner
Copy link
Contributor Author

Pushed the pair changes to 2.1 and main in fbb54c3

@dlmarion
Copy link
Contributor

The issue is the last line of the method, right? Could we not create a URI from (String, String, String) and then call Path(URI)?

@dlmarion do you think we might be able to optimize the validation instead of avoid it?

You mean like override getParent() in a subclass, copy the method contents and change the last line to use Path(URI) ?

@dlmarion
Copy link
Contributor

// use Path object to step backwards from the filename through all the parts
this.fileName = metaPath.getName();
ValidationUtil.validateFileName(fileName);
Path tabletDirPath = Objects.requireNonNull(metaPath.getParent(), errorMsg);
Path tableIdPath = Objects.requireNonNull(tabletDirPath.getParent(), errorMsg);
var id = tableIdPath.getName();
Path tablePath = Objects.requireNonNull(tableIdPath.getParent(), errorMsg);
String tpString = "/" + tablePath.getName();
Preconditions.checkArgument(tpString.equals(HDFS_TABLES_DIR), errorMsg);
Path volumePath = Objects.requireNonNull(tablePath.getParent(), errorMsg);
Preconditions.checkArgument(volumePath.toUri().getScheme() != null, errorMsg);
var volume = volumePath.toString();
this.tabletDir = new TabletDirectory(volume, TableId.of(id), tabletDirPath.getName());

Seems like we could just change the validation code to operate on the metaPath constructor argument differently. Maybe instead of calling Path.getParent(), just use metaPath.toURI() and parse the path?

@keith-turner
Copy link
Contributor Author

You mean like override getParent() in a subclass, copy the method contents and change the last line to use Path(URI) ?

I am not sure, I have not looked into optimizing what is there. When I made this PR I was working under the assumption that it was the path object in general that was causing problems. However after making this PR it was discovered it was the way Accumulo using the path object. So I was wondering if we should pursue optimizing the validation instead of avoiding it. If we are going to avoid validating on read in 2.1.2, then I am wondering if we should validate on write in 2.1.2 as proposed in #3504.

@dlmarion
Copy link
Contributor

@keith-turner - see #3509

@dlmarion
Copy link
Contributor

The primary bottleneck does not appear to be he actual creation of the Path(). It turns out it's the call to Path.getParent() that is the main problem which is only done in our validation logic.

I removed the calls to Path.getParent() in #3509 . Given the comment above, do we still need the changes in this and #3502?

@cshannon
Copy link
Contributor

If we merged in #3509 then I would not be opposed to reverting #3502, but I could see an argument either way. Lazy loading and lazy validation helps performance if you don't need to validate but there is always the issue of a mistake being made and you do lose the immediate check on creation so you may cause a runtime exception later. Adding validation to MetadataContraints would help prevent that in #3504 and #3506 but no guarantee future changes wouldn't invalid metadata be passed and not caught until later . It would make the changes in #3504 simpler if we reverted the lazy load as I had to do some extra work to specifically not lazy load to ensure validation.

@keith-turner
Copy link
Contributor Author

Given the comment above, do we still need the changes in this and #3502?

I am hoping we do not. I do not plan on merging this and hope to close it w/o merging if the perf of #3509 is sufficient.

@ctubbsii
Copy link
Member

Closing based on last comment because #3509 is merged and the performance numbers look good on that PR. Can re-open to address this as follow-on, if still needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GC seems to be spending alot of time during Path evaluation

4 participants