Skip to content

Fixed FileNormalizationIT after #5237#5240

Closed
dlmarion wants to merge 2 commits intoapache:mainfrom
dlmarion:fix-file-normalization-it
Closed

Fixed FileNormalizationIT after #5237#5240
dlmarion wants to merge 2 commits intoapache:mainfrom
dlmarion:fix-file-normalization-it

Conversation

@dlmarion
Copy link
Copy Markdown
Contributor

@dlmarion dlmarion commented Jan 9, 2025

The changes in #5237 to make the external compaction json metadata entry easier to use ended up breaking FileNormalizationIT. #5237 removed the metadata entry string which ended up escaping the double quotes in the string for the json and replaced it with an object that contains the path, start, and end row. The problem with this is that the metadata entry retains any misnormalization of the file name (e.g. double slashes in the path) and that path alone does not. This fixes the issue by putting the metadata entry into the new object that is serialized to json so that the json for the external compaction files now contains the metadata entry, normalized path, start and end row.

The changes in apache#5237 to make the external compaction json metadata
entry easier to use ended up breaking FileNormalizationIT. apache#5237
removed the metadata entry string which ended up escaping the
double quotes in the string for the json and replaced it with an
object that contains the path, start, and end row. The problem
with this is that the metadata entry retains any misnormalization
of the file name (e.g. double slashes in the path) and that path
alone does not. This fixes the issue by putting the metadata entry
into the new object that is serialized to json so that the json
for the external compaction files now contains the metadata entry,
normalized path, start and end row.
@dlmarion dlmarion added this to the 4.0.0 milestone Jan 9, 2025
@dlmarion dlmarion self-assigned this Jan 9, 2025
Copy link
Copy Markdown
Contributor

@cshannon cshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple small comments on the package scope of those methods.

I like that it's cleaner now by using StoredTabletFile directly inside the TabletFileCqMetadataGson constructor

Comment on lines +297 to +298
protected byte[] startRow;
protected byte[] endRow;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should remove these at this point. They are contained in the metadata entry and the path is really only used for readability at this point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me actually, now that you brought back the metadataEntry and not just the path, wouldn't the extra quotes come back again? I need to look at the output to check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out how this looks when serialized. The goal of metadata entry in stored tablet file is to record the exact column qual in the metadata table so that updates can use that even if normalization changes it, so not sure what serializing it will look like. Seems like it may contain itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the output of the System.out.println in the new test:

{"metadataEntry":"{ \"path\":\"hdfs://localhost:8020/accumulo//tables//1/t-0000000/A000003v.rf\",\"startRow\":\"AmEA\",\"endRow\":\"AnoA\" }","path":"hdfs://localhost:8020/accumulo/tables/1/t-0000000/A000003v.rf","startRow":"AmEA","endRow":"AnoA"}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new output looks correct to me and what we want, so I think you can go ahead and remove the separate startRow and endRow as we don't really need to access it separately as it's part of the metadata path and will be parsed already when rebuilding StoredTabletFile from the json

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cshannon - I think the issue I had was the name TabletFileCq. The name implies that it's being used for a column qualifier. I think a name like FileInfo or something may be less confusing.

Yeah good point, I think some of this was just natural evolution while working on it. I don't remember exactly how it went by when working on the no-chop merge stuff over a few months i'm sure a lot got tweaked and renamed. We could rename it but I'd want to do it in 3.1 branch as well so it's consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keith-turner - I guess the question is whether or not the convenience of having the file path in an easier to read format in the external compaction json is worth having it in addition to the metadata entry. If not, then we should revert #5237. If it is worth it, then I can try and clean this up.

I think the extra data may make it hard to read. I was looking at the serialized compaction data in the manager logs w/ compaction bugs that caused thousands of files to build up per tablet. Having more data for that case may make it more confusing.

Could we write a tool or something to make scanning and reading the entry easier?

Yeah that seems good. The data in the metadata table should probably be compact and correct (so it would only have the escaped data) and maybe we can have tooling to make the data more readable in logs and provide some mechanism to make the data more readable when scanning the metadata table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm going to close this, then revert #5237 directly on main and push that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm going to close this, then revert #5237 directly on main and push that.

After that is done I will open a PR to add a comment to the compaction metadata code about maintaining exact data from file family.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert done.

@dlmarion dlmarion closed this Jan 9, 2025
@dlmarion dlmarion deleted the fix-file-normalization-it branch January 9, 2025 17:52
@ctubbsii ctubbsii removed this from the 4.0.0 milestone Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants