Support stable ~del split points and avoid ~del hotspots #1043

keith-turner · 2019-03-18T16:55:52Z

When a file is deleted an entry is written to the metadata table of the form ~del<path-to-delete>. This schema causes two problems. First, bulk imports that happened around the same time sort together in the table. If many bulk imports that happened close time also compact close in time then it can lead to a hotspot for the metdata table with many tablet servers trying to write to the same tablet. The second problem is that the active splits for the ~del prefix of the metdata table change over time, making it hard to presplit this portion of the metadata table.

One possible solution to the problem is to hash the path and include that in the metadata table like ~del<hex(hash(path))><path>. This would lead to a stable set of split points and avoid hotspots. This assumes nothing cares about the sorted order, which needs to be validated. It's possible the sorted order matters for the directory ~del entries.

If this can be done, upgrade needs to be considered. Below are two possible upgrade designs.

Rewrite the metdata table on upgrade and recompute the split points.
Support del entries with and without the hash. One possible way to do this is with a different prefix like ~deh that indicates a hash is present.

May make sense to make the same change for the ~blip section of the metadata table also.

The text was updated successfully, but these errors were encountered:

hkeebler · 2019-07-25T18:20:13Z

I was reviewing what I think is the related code set for this ticket. I believe its GC that actually does the deletion of the Metadata tablet file entries and files. To address your concern about the sorting difference it looks as though the SGC expects the dir to be first in the sort but previous to this the prefix is removed and data resorted so I don't think its an issue. The only other side effect is that if memory is low before all the deletion candidates can be retrieved then the directory may not be in that set but who cares. Everything will still work.

The hash would need to be a known size or delimited from the <path>.

In GarbageCollectinAlgorithm the following piece of code was intriguing, for lack of a better word. I haven't had a chance to really go through and understand it's necessity but holy moly! There may be a very good reason for all it does.

 private String makeRelative(String path, int expectedLen) {
    String relPath = path;

    if (relPath.startsWith("../"))
      relPath = relPath.substring(3);

    while (relPath.endsWith("/"))
      relPath = relPath.substring(0, relPath.length() - 1);

    while (relPath.startsWith("/"))
      relPath = relPath.substring(1);

    String[] tokens = relPath.split("/");

    // handle paths like a//b///c
    boolean containsEmpty = false;
    for (String token : tokens) {
      if (token.equals("")) {
        containsEmpty = true;
        break;
      }
    }

    if (containsEmpty) {
      ArrayList<String> tmp = new ArrayList<>();
      for (String token : tokens) {
        if (!token.equals("")) {
          tmp.add(token);
        }
      }

      tokens = tmp.toArray(new String[tmp.size()]);
    }

    if (tokens.length > 3 && path.contains(":")) {
      if (tokens[tokens.length - 4].equals(ServerConstants.TABLE_DIR)
          && (expectedLen == 0 || expectedLen == 3)) {
        relPath = tokens[tokens.length - 3] + "/" + tokens[tokens.length - 2] + "/"
            + tokens[tokens.length - 1];
      } else if (tokens[tokens.length - 3].equals(ServerConstants.TABLE_DIR)
          && (expectedLen == 0 || expectedLen == 2)) {
        relPath = tokens[tokens.length - 2] + "/" + tokens[tokens.length - 1];
      } else {
        throw new IllegalArgumentException(path);
      }
    } else if (tokens.length == 3 && (expectedLen == 0 || expectedLen == 3)
        && !path.contains(":")) {
      relPath = tokens[0] + "/" + tokens[1] + "/" + tokens[2];
    } else if (tokens.length == 2 && (expectedLen == 0 || expectedLen == 2)
        && !path.contains(":")) {
      relPath = tokens[0] + "/" + tokens[1];
    } else {
      throw new IllegalArgumentException(path);
    }

    return relPath;
  }

ctubbsii · 2019-07-25T18:49:55Z

@hkeebler When you post code blocks in GitHub issues/PRs, you can get nice syntax highlighting by specifying the language. For example, use: ```java to open a block instead of ```

hkeebler · 2019-07-25T18:51:17Z

@ctubbsii thanks - let me edit it and see.

hkeebler · 2019-07-25T18:52:00Z

you did it. thanks.

ctubbsii · 2019-07-25T19:03:54Z

As for the makeRelative code... it's a bit tricky. What we're doing here is normalizing the paths to effectively trim out the volume information, so that we don't unintentionally delete a file that was still in use. We want to detect that a file is still in use, even if it looks like it's on a different volume... because it could look like it's on a different volume because of many weird scenarios (such as hostname of the HDFS volume changing or being remapped).

At one point, we assumed all files were on a single HDFS volume, so all paths in the metadata were "relative" to some root volume (specified by instance.dfs.uri, but defaulting to Hadoop's fs.defaultFS) and directory (specified by instance.dfs.dir, defaulting to /accumulo). Old metadata can still be written that way, but new data should be written with the full volume and absolute path to the file in HDFS, because we now support running on top of multiple HDFS volumes (like mounting different disks, say /dev/sda and /dev/sdb). So, we need to store more than just the relative paths.

I hope that helps provide some context for this confusing code. Specifically what it's doing to normalize... I'd have to look at it more carefully, but I think all of it can be ignored for the purposes of this ticket. What matters here is that we have a distinct prefix (to replace the current ~del), and we normalize whatever we have left after we strip off the prefix.

hkeebler · 2019-07-25T19:38:17Z

Thanks @ctubbsii for the explanation. It helps explain why the test is using relative path where I though absolutes should be expected. The above mentioned method throws an IllegalArgumentException if parsing the path doesn't appear as it should. Why would this ever occur? This is controlled/generated data.

ctubbsii · 2019-07-25T20:13:01Z

Why would this ever occur?

IllegalStateException is probably more appropriate, since it isn't user-provided. This can occur if data gets corrupted, or if we have a bug, or if users were performing manual surgery on the metadata table to recover from a failure and made a mistake. It really shouldn't ever happen, but if it does, we probably want the accumulo-gc service to die from the RuntimeException, rather than potentially delete files that it shouldn't.

hkeebler · 2019-07-29T13:18:01Z

Considering the migration strategy:

If this can be done, upgrade needs to be considered. Below are two possible upgrade designs.
Rewrite the metadata table on upgrade and recompute the split points.
Support del entries with and without the hash. One possible way to do this is with a different prefix like ~deh that indicates a hash is present.
May make sense to make the same change for the ~blip section of the metadata table also.

The first seems desirable because it does not require supporting two formats down the road. How costly is the metadata rewrite? I'm thinking the metadata tables are much smaller than data tables, the ~del should be minimal especially if a GC is done first, because they are lumped together the cost of finding them is low. Can you turn off compaction and balancing (whatever) during a migration upgrade? For ~blip, can you require that bulk imports are all complete prior to the upgrade?

keith-turner · 2019-07-29T14:45:32Z

The first seems desirable because it does not require supporting two formats down the road.

I agree it will make future maintenance of the code easier.

Can you turn off compaction and balancing (whatever) during a migration upgrade?

The master process can control what it does. Since the master does migration, balancing, and loading it could pause those. Can't remember if it does. However things done by tablet servers it may not be able to control, I opened #1300 to futher explore this.

For the ~del markers there is the possibility that the Accumulo GC will temporarily miss a ~del marker, but this is ok because it will be seen later. Just need to make sure the masters move of the ~del markers is idempotent.

The master process will abort if there are any outstanding FATE transactions and an upgrade is needed.

hkeebler · 2019-08-07T15:47:27Z

Currently one is allowed to create invalid or empty delete entries as demonstrated by the GarbageCollectorIT test:

bw.addMutation(createDelMutation("", "", "", ""));
       bw.addMutation(createDelMutation("", "testDel", "test", "valueTest"));
       bw.addMutation(createDelMutation("/", "", "", ""));
     }

The flagging of the error does not occur until GarbageCollectionAlgrithm.makeRelative is called which occurs after the prefix is stripped and the mutation is deciphered into a relative path name. Since this ticket is talking about using a hashcode of the path as part of the key then what should be done with an empty or null path?

Allow it and have a defaut null hash something like 00000000x, later to be caught by makeRelative?
Throw a runtime error at creation (in MetadataSchema.DeletesSection.getRowPrefix(path)

if (path == null || path.isEmpty())
      return section.getRowPrefix() + "00000000"; 
OR
      throw IllegalArgumentException():

No exceptions or explicit runtimes are thrown in MetadataSchema at present.

hkeebler · 2019-08-07T15:50:09Z

My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1?

ctubbsii · 2019-08-07T16:19:53Z

My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1?

Yes. Any code required to upgrade the metadata, or to handle the old delete markers would be in the 2.1 code checked into the master branch.

ctubbsii · 2019-08-07T16:57:22Z

Allow it and have a defaut null hash something like 00000000x, later to be caught by makeRelative?

This seems reasonable to me.

hkeebler · 2019-08-08T17:16:19Z

Rewrite the metdata table on upgrade and recompute the split points.
Support del entries with and without the hash. One possible way to do this is with a different prefix like ~deh that indicates a hash is present.

These may be combined by utilizing one of the "un-utilized" cf or cq. These are currently empty as is visibility ts, .... If the new hashed delete added the hashcode also as a cf it could be used to determine the length of the prefix to remove - which would be 0 for the empty cf , aka old delete entry. Something like the following:

  // find candidates for deletion; chop off the prefix
      for (Entry<Key,Value> entry : scanner) {
        String cand = entry.getKey().getRow().toString()
//old            .substring(MetadataSchema.DeletesSection.getRowPrefix().length());
 .substring(MetadataSchema.DeletesSection.getRowPrefixLength()+entry.getKey().getColumnFamily().toString().length();
        result.add(cand);

keith-turner · 2019-08-08T17:45:51Z

I think it would make sense to fail fast on empty paths throwing and exception when trying to set them. Anything that consumes gc candidate paths should also fail on empy paths, but in practice it would not see them. I have found its nice if code that reads persisted data does not assumet that it was written in a certain way.

In #1313 I pushed all reading and writing of GC candidates into Ample and moved this out of the GC code. I think it would make sense if Ample added and removed the hashes, but code outside of Ample was never aware of the hashes and only dealt with paths.

My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1?

The metadata should be rewritten in Upgrader9to10.upgradeMetadata()

hkeebler · 2019-08-09T14:21:19Z

So we should say that this is dependent on #1313 being completed. I will start taking a look at the upgrade code necessary but what did anyone think of the solution in my previous comment that would not require an upgrade?

Pros - no upgrade code to be written, no conditional checking for old vs new deletes, supports multi-sized hash
Cons - more complex process for stripping delete prefix (just a little), key field not really being used for intended purpose BUT maybe if value was used (which is also "") then field intent is maintained.

keith-turner · 2019-08-09T15:21:44Z

what did anyone think of the solution in my previous comment that would not require an upgrade?

I think its a clever solution and I think that utilizing the values is a good way to go if we do not resolve this in upgrade. I am leaning towards resolving this in upgrade because I think it keeps the existing code simpler, but I am not 100% sure this is the best path to take.

The makeRelative() function you asked about earlier has some code in it to handle legacy paths that makes it more complicated. We should improve makeRelative() to clearly document parts of it that are handling legacy paths. That code is confusing now because of the legacy support.

If think if you pursue not resolving this in upgrade that it would be best to clearly document and isolate the code that handles legacy delete candidates.

If you are going to pursue the upgrade path, the following is one possible way to do it.

First submit a PR that changes the code to use hashes without any concern for upgrade. So it only works with a newyly initialized system at first. Just add something that throws an exception in the upgrade code.
Second submit a PR for the upgrade code.

I will try to get #1313 in ASAP so you can build on it.

keith-turner · 2019-08-09T15:28:59Z

Another reason to resolve in upgrade is that over time handling more legacy persisted formats becomes harder and harder to test. For example assume in the future there are 5 different legacy data formats for the metadata table, will we ever test different combination of the legacy formats? No, that testing will likely not be done.

Doing a transformation in the upgrade step and testing that transformation independently seems more maintainable. The current code only needs to worry about testing the current format. The testing is much simpler and can better cover the data expected.

includes recommended changes Fix 1043 Support Stable ~del split points recommended changes applied includes upgrade code Suggested Changes

… changes

updated to add value to all new ~del

cleanup

* Fix #1365 2.1 Upgrade processing for #1043 ~del

ctubbsii added the v2.1.0 label Mar 18, 2019

ctubbsii added this to To do in 2.1.0 via automation Jun 14, 2019

ctubbsii removed the v2.1.0 label Jun 17, 2019

hkeebler self-assigned this Aug 22, 2019

hkeebler added a commit to hkeebler/accumulo that referenced this issue Aug 29, 2019

Fix apache#1043 Support stable ~del split points

cc75697

hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 16, 2019

Fix apache#1043 Support stable ~del split points

1b67e78

includes recommended changes Fix 1043 Support Stable ~del split points recommended changes applied includes upgrade code Suggested Changes

hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 16, 2019

Fix apache#1043 Support stable ~del split points

0d563a8

includes recommended changes Fix 1043 Support Stable ~del split points recommended changes applied includes upgrade code Suggested Changes

hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 17, 2019

Fix apache#1043 Support stable ~del split points includes recommended…

44a1d15

… changes

hkeebler closed this as completed in 0c5ddc8 Sep 17, 2019

2.1.0 automation moved this from To do to Done Sep 17, 2019

hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 18, 2019

Fix apache#1365 2.1 Upgrade processing for apache#1043 ~del

262721e

hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 19, 2019

Fix apache#1365 Upgrade processing for apache#1043

c45c70e

updated to add value to all new ~del

hkeebler added a commit to hkeebler/accumulo that referenced this issue Sep 26, 2019

Fix apache#1365 Upgrade processing for apache#1043

bb70f13

cleanup

hkeebler added a commit that referenced this issue Sep 27, 2019

Fix #1365 2.1 Upgrade processing for #1043 ~del (#1366)

840fe43

* Fix #1365 2.1 Upgrade processing for #1043 ~del

milleruntime mentioned this issue Dec 19, 2019

Replace relative paths in delete markers on upgrade #1465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support stable ~del split points and avoid ~del hotspots #1043

Support stable ~del split points and avoid ~del hotspots #1043

keith-turner commented Mar 18, 2019

hkeebler commented Jul 25, 2019 •

edited by ctubbsii

ctubbsii commented Jul 25, 2019

hkeebler commented Jul 25, 2019

hkeebler commented Jul 25, 2019

ctubbsii commented Jul 25, 2019

hkeebler commented Jul 25, 2019

ctubbsii commented Jul 25, 2019

hkeebler commented Jul 29, 2019 •

edited

keith-turner commented Jul 29, 2019

hkeebler commented Aug 7, 2019

hkeebler commented Aug 7, 2019

ctubbsii commented Aug 7, 2019

ctubbsii commented Aug 7, 2019

hkeebler commented Aug 8, 2019 •

edited

keith-turner commented Aug 8, 2019

hkeebler commented Aug 9, 2019

keith-turner commented Aug 9, 2019

keith-turner commented Aug 9, 2019

Support stable ~del split points and avoid ~del hotspots #1043

Support stable ~del split points and avoid ~del hotspots #1043

Comments

keith-turner commented Mar 18, 2019

hkeebler commented Jul 25, 2019 • edited by ctubbsii

ctubbsii commented Jul 25, 2019

hkeebler commented Jul 25, 2019

hkeebler commented Jul 25, 2019

ctubbsii commented Jul 25, 2019

hkeebler commented Jul 25, 2019

ctubbsii commented Jul 25, 2019

hkeebler commented Jul 29, 2019 • edited

keith-turner commented Jul 29, 2019

hkeebler commented Aug 7, 2019

hkeebler commented Aug 7, 2019

ctubbsii commented Aug 7, 2019

ctubbsii commented Aug 7, 2019

hkeebler commented Aug 8, 2019 • edited

keith-turner commented Aug 8, 2019

hkeebler commented Aug 9, 2019

keith-turner commented Aug 9, 2019

keith-turner commented Aug 9, 2019

hkeebler commented Jul 25, 2019 •

edited by ctubbsii

hkeebler commented Jul 29, 2019 •

edited

hkeebler commented Aug 8, 2019 •

edited