Fixed FileNormalizationIT after #5237#5240
Conversation
The changes in apache#5237 to make the external compaction json metadata entry easier to use ended up breaking FileNormalizationIT. apache#5237 removed the metadata entry string which ended up escaping the double quotes in the string for the json and replaced it with an object that contains the path, start, and end row. The problem with this is that the metadata entry retains any misnormalization of the file name (e.g. double slashes in the path) and that path alone does not. This fixes the issue by putting the metadata entry into the new object that is serialized to json so that the json for the external compaction files now contains the metadata entry, normalized path, start and end row.
cshannon
left a comment
There was a problem hiding this comment.
LGTM, just a couple small comments on the package scope of those methods.
I like that it's cleaner now by using StoredTabletFile directly inside the TabletFileCqMetadataGson constructor
core/src/main/java/org/apache/accumulo/core/metadata/StoredTabletFile.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/accumulo/core/metadata/StoredTabletFile.java
Outdated
Show resolved
Hide resolved
| protected byte[] startRow; | ||
| protected byte[] endRow; |
There was a problem hiding this comment.
I'm wondering if we should remove these at this point. They are contained in the metadata entry and the path is really only used for readability at this point.
There was a problem hiding this comment.
It just occurred to me actually, now that you brought back the metadataEntry and not just the path, wouldn't the extra quotes come back again? I need to look at the output to check
There was a problem hiding this comment.
I am trying to figure out how this looks when serialized. The goal of metadata entry in stored tablet file is to record the exact column qual in the metadata table so that updates can use that even if normalization changes it, so not sure what serializing it will look like. Seems like it may contain itself.
There was a problem hiding this comment.
Here is the output of the System.out.println in the new test:
{"metadataEntry":"{ \"path\":\"hdfs://localhost:8020/accumulo//tables//1/t-0000000/A000003v.rf\",\"startRow\":\"AmEA\",\"endRow\":\"AnoA\" }","path":"hdfs://localhost:8020/accumulo/tables/1/t-0000000/A000003v.rf","startRow":"AmEA","endRow":"AnoA"}
There was a problem hiding this comment.
The new output looks correct to me and what we want, so I think you can go ahead and remove the separate startRow and endRow as we don't really need to access it separately as it's part of the metadata path and will be parsed already when rebuilding StoredTabletFile from the json
There was a problem hiding this comment.
@cshannon - I think the issue I had was the name
TabletFileCq. The name implies that it's being used for a column qualifier. I think a name likeFileInfoor something may be less confusing.
Yeah good point, I think some of this was just natural evolution while working on it. I don't remember exactly how it went by when working on the no-chop merge stuff over a few months i'm sure a lot got tweaked and renamed. We could rename it but I'd want to do it in 3.1 branch as well so it's consistent.
There was a problem hiding this comment.
@keith-turner - I guess the question is whether or not the convenience of having the file path in an easier to read format in the external compaction json is worth having it in addition to the metadata entry. If not, then we should revert #5237. If it is worth it, then I can try and clean this up.
I think the extra data may make it hard to read. I was looking at the serialized compaction data in the manager logs w/ compaction bugs that caused thousands of files to build up per tablet. Having more data for that case may make it more confusing.
Could we write a tool or something to make scanning and reading the entry easier?
Yeah that seems good. The data in the metadata table should probably be compact and correct (so it would only have the escaped data) and maybe we can have tooling to make the data more readable in logs and provide some mechanism to make the data more readable when scanning the metadata table.
There was a problem hiding this comment.
Ok, I'm going to close this, then revert #5237 directly on main and push that.
There was a problem hiding this comment.
Ok, I'm going to close this, then revert #5237 directly on main and push that.
After that is done I will open a PR to add a comment to the compaction metadata code about maintaining exact data from file family.
The changes in #5237 to make the external compaction json metadata entry easier to use ended up breaking FileNormalizationIT. #5237 removed the metadata entry string which ended up escaping the double quotes in the string for the json and replaced it with an object that contains the path, start, and end row. The problem with this is that the metadata entry retains any misnormalization of the file name (e.g. double slashes in the path) and that path alone does not. This fixes the issue by putting the metadata entry into the new object that is serialized to json so that the json for the external compaction files now contains the metadata entry, normalized path, start and end row.