New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support stable ~del split points and avoid ~del hotspots #1043
Comments
I was reviewing what I think is the related code set for this ticket. I believe its GC that actually does the deletion of the Metadata tablet file entries and files. To address your concern about the sorting difference it looks as though the SGC expects the dir to be first in the sort but previous to this the prefix is removed and data resorted so I don't think its an issue. The only other side effect is that if memory is low before all the deletion candidates can be retrieved then the directory may not be in that set but who cares. Everything will still work. The hash would need to be a known size or delimited from the In GarbageCollectinAlgorithm the following piece of code was intriguing, for lack of a better word. I haven't had a chance to really go through and understand it's necessity but holy moly! There may be a very good reason for all it does. private String makeRelative(String path, int expectedLen) {
String relPath = path;
if (relPath.startsWith("../"))
relPath = relPath.substring(3);
while (relPath.endsWith("/"))
relPath = relPath.substring(0, relPath.length() - 1);
while (relPath.startsWith("/"))
relPath = relPath.substring(1);
String[] tokens = relPath.split("/");
// handle paths like a//b///c
boolean containsEmpty = false;
for (String token : tokens) {
if (token.equals("")) {
containsEmpty = true;
break;
}
}
if (containsEmpty) {
ArrayList<String> tmp = new ArrayList<>();
for (String token : tokens) {
if (!token.equals("")) {
tmp.add(token);
}
}
tokens = tmp.toArray(new String[tmp.size()]);
}
if (tokens.length > 3 && path.contains(":")) {
if (tokens[tokens.length - 4].equals(ServerConstants.TABLE_DIR)
&& (expectedLen == 0 || expectedLen == 3)) {
relPath = tokens[tokens.length - 3] + "/" + tokens[tokens.length - 2] + "/"
+ tokens[tokens.length - 1];
} else if (tokens[tokens.length - 3].equals(ServerConstants.TABLE_DIR)
&& (expectedLen == 0 || expectedLen == 2)) {
relPath = tokens[tokens.length - 2] + "/" + tokens[tokens.length - 1];
} else {
throw new IllegalArgumentException(path);
}
} else if (tokens.length == 3 && (expectedLen == 0 || expectedLen == 3)
&& !path.contains(":")) {
relPath = tokens[0] + "/" + tokens[1] + "/" + tokens[2];
} else if (tokens.length == 2 && (expectedLen == 0 || expectedLen == 2)
&& !path.contains(":")) {
relPath = tokens[0] + "/" + tokens[1];
} else {
throw new IllegalArgumentException(path);
}
return relPath;
} |
@hkeebler When you post code blocks in GitHub issues/PRs, you can get nice syntax highlighting by specifying the language. For example, use: |
@ctubbsii thanks - let me edit it and see. |
you did it. thanks. |
As for the makeRelative code... it's a bit tricky. What we're doing here is normalizing the paths to effectively trim out the volume information, so that we don't unintentionally delete a file that was still in use. We want to detect that a file is still in use, even if it looks like it's on a different volume... because it could look like it's on a different volume because of many weird scenarios (such as hostname of the HDFS volume changing or being remapped). At one point, we assumed all files were on a single HDFS volume, so all paths in the metadata were "relative" to some root volume (specified by I hope that helps provide some context for this confusing code. Specifically what it's doing to normalize... I'd have to look at it more carefully, but I think all of it can be ignored for the purposes of this ticket. What matters here is that we have a distinct prefix (to replace the current |
Thanks @ctubbsii for the explanation. It helps explain why the test is using relative path where I though absolutes should be expected. The above mentioned method throws an IllegalArgumentException if parsing the path doesn't appear as it should. Why would this ever occur? This is controlled/generated data. |
|
Considering the migration strategy:
The first seems desirable because it does not require supporting two formats down the road. How costly is the metadata rewrite? I'm thinking the metadata tables are much smaller than data tables, the ~del should be minimal especially if a GC is done first, because they are lumped together the cost of finding them is low. Can you turn off compaction and balancing (whatever) during a migration upgrade? For ~blip, can you require that bulk imports are all complete prior to the upgrade? |
I agree it will make future maintenance of the code easier.
The master process can control what it does. Since the master does migration, balancing, and loading it could pause those. Can't remember if it does. However things done by tablet servers it may not be able to control, I opened #1300 to futher explore this. For the The master process will abort if there are any outstanding FATE transactions and an upgrade is needed. |
Currently one is allowed to create invalid or empty delete entries as demonstrated by the GarbageCollectorIT test: bw.addMutation(createDelMutation("", "", "", ""));
bw.addMutation(createDelMutation("", "testDel", "test", "valueTest"));
bw.addMutation(createDelMutation("/", "", "", ""));
} The flagging of the error does not occur until GarbageCollectionAlgrithm.makeRelative is called which occurs after the prefix is stripped and the mutation is deciphered into a relative path name. Since this ticket is talking about using a hashcode of the path as part of the key then what should be done with an empty or null path?
if (path == null || path.isEmpty())
return section.getRowPrefix() + "00000000";
OR
throw IllegalArgumentException(): No exceptions or explicit runtimes are thrown in MetadataSchema at present. |
My next question - how is code that requires a migration path managed? Would it be checked into the master baseline for 2.1? |
Yes. Any code required to upgrade the metadata, or to handle the old delete markers would be in the 2.1 code checked into the master branch. |
This seems reasonable to me. |
These may be combined by utilizing one of the "un-utilized" cf or cq. These are currently empty as is visibility ts, .... If the new hashed delete added the hashcode also as a cf it could be used to determine the length of the prefix to remove - which would be 0 for the empty cf , aka old delete entry. Something like the following: // find candidates for deletion; chop off the prefix
for (Entry<Key,Value> entry : scanner) {
String cand = entry.getKey().getRow().toString()
//old .substring(MetadataSchema.DeletesSection.getRowPrefix().length());
.substring(MetadataSchema.DeletesSection.getRowPrefixLength()+entry.getKey().getColumnFamily().toString().length();
result.add(cand); |
I think it would make sense to fail fast on empty paths throwing and exception when trying to set them. Anything that consumes gc candidate paths should also fail on empy paths, but in practice it would not see them. I have found its nice if code that reads persisted data does not assumet that it was written in a certain way. In #1313 I pushed all reading and writing of GC candidates into Ample and moved this out of the GC code. I think it would make sense if Ample added and removed the hashes, but code outside of Ample was never aware of the hashes and only dealt with paths.
The metadata should be rewritten in Upgrader9to10.upgradeMetadata() |
So we should say that this is dependent on #1313 being completed. I will start taking a look at the upgrade code necessary but what did anyone think of the solution in my previous comment that would not require an upgrade?
|
I think its a clever solution and I think that utilizing the values is a good way to go if we do not resolve this in upgrade. I am leaning towards resolving this in upgrade because I think it keeps the existing code simpler, but I am not 100% sure this is the best path to take. The If think if you pursue not resolving this in upgrade that it would be best to clearly document and isolate the code that handles legacy delete candidates. If you are going to pursue the upgrade path, the following is one possible way to do it.
I will try to get #1313 in ASAP so you can build on it. |
Another reason to resolve in upgrade is that over time handling more legacy persisted formats becomes harder and harder to test. For example assume in the future there are 5 different legacy data formats for the metadata table, will we ever test different combination of the legacy formats? No, that testing will likely not be done. Doing a transformation in the upgrade step and testing that transformation independently seems more maintainable. The current code only needs to worry about testing the current format. The testing is much simpler and can better cover the data expected. |
includes recommended changes Fix 1043 Support Stable ~del split points recommended changes applied includes upgrade code Suggested Changes
includes recommended changes Fix 1043 Support Stable ~del split points recommended changes applied includes upgrade code Suggested Changes
updated to add value to all new ~del
When a file is deleted an entry is written to the metadata table of the form
~del<path-to-delete>
. This schema causes two problems. First, bulk imports that happened around the same time sort together in the table. If many bulk imports that happened close time also compact close in time then it can lead to a hotspot for the metdata table with many tablet servers trying to write to the same tablet. The second problem is that the active splits for the~del
prefix of the metdata table change over time, making it hard to presplit this portion of the metadata table.One possible solution to the problem is to hash the path and include that in the metadata table like
~del<hex(hash(path))><path>
. This would lead to a stable set of split points and avoid hotspots. This assumes nothing cares about the sorted order, which needs to be validated. It's possible the sorted order matters for the directory ~del entries.If this can be done, upgrade needs to be considered. Below are two possible upgrade designs.
~deh
that indicates a hash is present.May make sense to make the same change for the
~blip
section of the metadata table also.The text was updated successfully, but these errors were encountered: