Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support stable ~del split points and avoid ~del hotspots #1043

Closed
keith-turner opened this issue Mar 18, 2019 · 18 comments
Closed

Support stable ~del split points and avoid ~del hotspots #1043

keith-turner opened this issue Mar 18, 2019 · 18 comments
Assignees
Projects

Comments

@keith-turner
Copy link
Contributor

When a file is deleted an entry is written to the metadata table of the form ~del<path-to-delete>. This schema causes two problems. First, bulk imports that happened around the same time sort together in the table. If many bulk imports that happened close time also compact close in time then it can lead to a hotspot for the metdata table with many tablet servers trying to write to the same tablet. The second problem is that the active splits for the ~del prefix of the metdata table change over time, making it hard to presplit this portion of the metadata table.

One possible solution to the problem is to hash the path and include that in the metadata table like ~del<hex(hash(path))><path>. This would lead to a stable set of split points and avoid hotspots. This assumes nothing cares about the sorted order, which needs to be validated. It's possible the sorted order matters for the directory ~del entries.

If this can be done, upgrade needs to be considered. Below are two possible upgrade designs.

  • Rewrite the metdata table on upgrade and recompute the split points.
  • Support del entries with and without the hash. One possible way to do this is with a different prefix like ~deh that indicates a hash is present.

May make sense to make the same change for the ~blip section of the metadata table also.

@ctubbsii ctubbsii added this to To do in 2.1.0 via automation Jun 14, 2019
@ctubbsii ctubbsii removed the v2.1.0 label Jun 17, 2019
@hkeebler
Copy link
Contributor

hkeebler commented Jul 25, 2019

I was reviewing what I think is the related code set for this ticket. I believe its GC that actually does the deletion of the Metadata tablet file entries and files. To address your concern about the sorting difference it looks as though the SGC expects the dir to be first in the sort but previous to this the prefix is removed and data resorted so I don't think its an issue. The only other side effect is that if memory is low before all the deletion candidates can be retrieved then the directory may not be in that set but who cares. Everything will still work.

The hash would need to be a known size or delimited from the <path>.

In GarbageCollectinAlgorithm the following piece of code was intriguing, for lack of a better word. I haven't had a chance to really go through and understand it's necessity but holy moly! There may be a very good reason for all it does.

 private String makeRelative(String path, int expectedLen) {
    String relPath = path;

    if (relPath.startsWith("../"))
      relPath = relPath.substring(3);

    while (relPath.endsWith("/"))
      relPath = relPath.substring(0, relPath.length() - 1);

    while (relPath.startsWith("/"))
      relPath = relPath.substring(1);

    String[] tokens = relPath.split("/");

    // handle paths like a//b///c
    boolean containsEmpty = false;
    for (String token : tokens) {
      if (token.equals("")) {
        containsEmpty = true;
        break;
      }
    }

    if (containsEmpty) {
      ArrayList<String> tmp = new ArrayList<>();
      for (String token : tokens) {
        if (!token.equals("")) {
          tmp.add(token);
        }
      }

      tokens = tmp.toArray(new String[tmp.size()]);
    }

    if (tokens.length > 3 && path.contains(":")) {
      if (tokens[tokens.length - 4].equals(ServerConstants.TABLE_DIR)
          && (expectedLen == 0 || expectedLen == 3)) {
        relPath = tokens[tokens.length - 3] + "/" + tokens[tokens.length - 2] + "/"
            + tokens[tokens.length - 1];
      } else if (tokens[tokens.length - 3].equals(ServerConstants.TABLE_DIR)
          && (expectedLen == 0 || expectedLen == 2)) {
        relPath = tokens[tokens.length - 2] + "/" + tokens[tokens.length - 1];
      } else {
        throw new IllegalArgumentException(path);
      }
    } else if (tokens.length == 3 && (expectedLen == 0 || expectedLen == 3)
        && !path.contains(":")) {
      relPath = tokens[0] + "/" + tokens[1] + "/" + tokens[2];
    } else if (tokens.length == 2 && (expectedLen == 0 || expectedLen == 2)
        && !path.contains(":")) {
      relPath = tokens[0] + "/" + tokens[1];
    } else {
      throw new IllegalArgumentException(path);
    }

    return relPath;
  }

@ctubbsii
Copy link
Member

@hkeebler When you post code blocks in GitHub issues/PRs, you can get nice syntax highlighting by specifying the language. For example, use: ```java to open a block instead of ```

@hkeebler
Copy link
Contributor

@ctubbsii thanks - let me edit it and see.

@hkeebler
Copy link
Contributor

you did it. thanks.

@ctubbsii
Copy link
Member

As for the makeRelative code... it's a bit tricky. What we're doing here is normalizing the paths to effectively trim out the volume information, so that we don't unintentionally delete a file that was still in use. We want to detect that a file is still in use, even if it looks like it's on a different volume... because it could look like it's on a different volume because of many weird scenarios (such as hostname of the HDFS volume changing or being remapped).

At one point, we assumed all files were on a single HDFS volume, so all paths in the metadata were "relative" to some root volume (specified by instance.dfs.uri, but defaulting to Hadoop's fs.defaultFS) and directory (specified by instance.dfs.dir, defaulting to /accumulo). Old metadata can still be written that way, but new data should be written with the full volume and absolute path to the file in HDFS, because we now support running on top of multiple HDFS volumes (like mounting different disks, say /dev/sda and /dev/sdb). So, we need to store more than just the relative paths.

I hope that helps provide some context for this confusing code. Specifically what it's doing to normalize... I'd have to look at it more carefully, but I think all of it can be ignored for the purposes of this ticket. What matters here is that we have a distinct prefix (to replace the current ~del), and we normalize whatever we have left after we strip off the prefix.

@hkeebler
Copy link
Contributor

Thanks @ctubbsii for the explanation. It helps explain why the test is using relative path where I though absolutes should be expected. The above mentioned method throws an IllegalArgumentException if parsing the path doesn't appear as it should. Why would this ever occur? This is controlled/generated data.

@ctubbsii
Copy link
Member

Why would this ever occur?

IllegalStateException is probably more appropriate, since it isn't user-provided. This can occur if data gets corrupted, or if we have a bug, or if users were performing manual surgery on the metadata table to recover from a failure and made a mistake. It really shouldn't ever happen, but if it does, we probably want the accumulo-gc service to die from the RuntimeException, rather than potentially delete files that it shouldn't.

@hkeebler
Copy link
Contributor

hkeebler commented Jul 29, 2019

Considering the migration strategy:

If this can be done, upgrade needs to be considered. Below are two possible upgrade designs.

Rewrite the metadata table on upgrade and recompute the split points.
Support del entries with and without the hash. One possible way to do this is with a different prefix like ~deh that indicates a hash is present.

May make sense to make the same change for the ~blip section of the metadata table also.

The first seems desirable because it does not require supporting two formats down the road. How costly is the metadata rewrite? I'm thinking the metadata tables are much smaller than data tables, the ~del should be minimal especially if a GC is done first, because they are lumped together the cost of finding them is low. Can you turn off compaction and balancing (whatever) during a migration upgrade? For ~blip, can you require that bulk imports are all complete prior to the upgrade?

@keith-turner
Copy link
Contributor Author

The first seems desirable because it does not require supporting two formats down the road.

I agree it will make future maintenance of the code easier.

Can you turn off compaction and balancing (whatever) during a migration upgrade?

The master process can control what it does. Since the master does migration, balancing, and loading it could pause those. Can't remember if it does. However things done by tablet servers it may not be able to control, I opened #1300 to futher explore this.

For the ~del markers there is the possibility that the Accumulo GC will temporarily miss a ~del marker, but this is ok because it will be seen later. Just need to make sure the masters move of the ~del markers is idempotent.

The master process will abort if there are any outstanding FATE transactions and an upgrade is needed.

@hkeebler
Copy link
Contributor

hkeebler commented Aug 7, 2019

Currently one is allowed to create invalid or empty delete entries as demonstrated by the GarbageCollectorIT test:

bw.addMutation(createDelMutation("", "", "", ""));
       bw.addMutation(createDelMutation("", "testDel", "test", "valueTest"));
       bw.addMutation(createDelMutation("/", "", "", ""));
     }

The flagging of the error does not occur until GarbageCollectionAlgrithm.makeRelative is called which occurs after the prefix is stripped and the mutation is deciphered into a relative path name. Since this ticket is talking about using a hashcode of the path as part of the key then what should be done with an empty or null path?

  1. Allow it and have a defaut null hash something like 00000000x, later to be caught by makeRelative?
  2. Throw a runtime error at creation (in MetadataSchema.DeletesSection.getRowPrefix(path)
if (path == null || path.isEmpty())
      return section.getRowPrefix() + "00000000"; 
OR
      throw IllegalArgumentException():

No exceptions or explicit runtimes are thrown in MetadataSchema at present.

@hkeebler
Copy link
Contributor

hkeebler commented Aug 7, 2019

My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1?

@ctubbsii
Copy link
Member

ctubbsii commented Aug 7, 2019

My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1?

Yes. Any code required to upgrade the metadata, or to handle the old delete markers would be in the 2.1 code checked into the master branch.

@ctubbsii
Copy link
Member

ctubbsii commented Aug 7, 2019

  • Allow it and have a defaut null hash something like 00000000x, later to be caught by makeRelative?

This seems reasonable to me.

@hkeebler
Copy link
Contributor

hkeebler commented Aug 8, 2019

Rewrite the metdata table on upgrade and recompute the split points.
Support del entries with and without the hash. One possible way to do this is with a different prefix like ~deh that indicates a hash is present.

These may be combined by utilizing one of the "un-utilized" cf or cq. These are currently empty as is visibility ts, .... If the new hashed delete added the hashcode also as a cf it could be used to determine the length of the prefix to remove - which would be 0 for the empty cf , aka old delete entry. Something like the following:

  // find candidates for deletion; chop off the prefix
      for (Entry<Key,Value> entry : scanner) {
        String cand = entry.getKey().getRow().toString()
//old            .substring(MetadataSchema.DeletesSection.getRowPrefix().length());
 .substring(MetadataSchema.DeletesSection.getRowPrefixLength()+entry.getKey().getColumnFamily().toString().length();
        result.add(cand);

@keith-turner
Copy link
Contributor Author

I think it would make sense to fail fast on empty paths throwing and exception when trying to set them. Anything that consumes gc candidate paths should also fail on empy paths, but in practice it would not see them. I have found its nice if code that reads persisted data does not assumet that it was written in a certain way.

In #1313 I pushed all reading and writing of GC candidates into Ample and moved this out of the GC code. I think it would make sense if Ample added and removed the hashes, but code outside of Ample was never aware of the hashes and only dealt with paths.

My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1?

The metadata should be rewritten in Upgrader9to10.upgradeMetadata()

@hkeebler
Copy link
Contributor

hkeebler commented Aug 9, 2019

So we should say that this is dependent on #1313 being completed. I will start taking a look at the upgrade code necessary but what did anyone think of the solution in my previous comment that would not require an upgrade?

  • Pros - no upgrade code to be written, no conditional checking for old vs new deletes, supports multi-sized hash
  • Cons - more complex process for stripping delete prefix (just a little), key field not really being used for intended purpose BUT maybe if value was used (which is also "") then field intent is maintained.

@keith-turner
Copy link
Contributor Author

what did anyone think of the solution in my previous comment that would not require an upgrade?

I think its a clever solution and I think that utilizing the values is a good way to go if we do not resolve this in upgrade. I am leaning towards resolving this in upgrade because I think it keeps the existing code simpler, but I am not 100% sure this is the best path to take.

The makeRelative() function you asked about earlier has some code in it to handle legacy paths that makes it more complicated. We should improve makeRelative() to clearly document parts of it that are handling legacy paths. That code is confusing now because of the legacy support.

If think if you pursue not resolving this in upgrade that it would be best to clearly document and isolate the code that handles legacy delete candidates.

If you are going to pursue the upgrade path, the following is one possible way to do it.

  • First submit a PR that changes the code to use hashes without any concern for upgrade. So it only works with a newyly initialized system at first. Just add something that throws an exception in the upgrade code.
  • Second submit a PR for the upgrade code.

I will try to get #1313 in ASAP so you can build on it.

@keith-turner
Copy link
Contributor Author

Another reason to resolve in upgrade is that over time handling more legacy persisted formats becomes harder and harder to test. For example assume in the future there are 5 different legacy data formats for the metadata table, will we ever test different combination of the legacy formats? No, that testing will likely not be done.

Doing a transformation in the upgrade step and testing that transformation independently seems more maintainable. The current code only needs to worry about testing the current format. The testing is much simpler and can better cover the data expected.

@hkeebler hkeebler self-assigned this Aug 22, 2019
hkeebler added a commit to hkeebler/accumulo that referenced this issue Aug 29, 2019
hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 16, 2019
 includes recommended changes

Fix 1043 Support Stable ~del split points
recommended changes applied

includes upgrade code

Suggested Changes
hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 16, 2019
 includes recommended changes

Fix 1043 Support Stable ~del split points
recommended changes applied

includes upgrade code

Suggested Changes
hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 17, 2019
2.1.0 automation moved this from To do to Done Sep 17, 2019
hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 18, 2019
hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 19, 2019
updated to add value to all new ~del
hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 26, 2019
hkeebler added a commit that referenced this issue Sep 27, 2019
* Fix #1365 2.1 Upgrade processing for #1043 ~del
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
2.1.0
  
Done
Development

No branches or pull requests

3 participants