Skip to content

[HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible#9337

Merged
codope merged 14 commits intoapache:masterfrom
the-other-tim-brown:HUDI-6628
Aug 4, 2023
Merged

[HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible#9337
codope merged 14 commits intoapache:masterfrom
the-other-tim-brown:HUDI-6628

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Aug 1, 2023

Change Logs

  • Updates sections of the code to use the getters in the HoodieBaseFile and HoodieLogFile instead of FSUtils to move away from relying directly on the path for getting metadata about the file when possible
  • Sets file ID and commit time in HoodieBaseFile on construction and avoids running split on the file name twice to improve efficiency
  • Sets fileId, baseCommitTime, logVersion, logWriteToken, fileExtension, suffix, and path when creating HoodieLogFile instead of making multiple calls to a regex based matcher to improve efficiency
  • Uses CachingPath instead of Path in HoodieLogFile for improved efficiency

Impact

Lowers overhead when extracting metadata about HoodieBaseFiles or HoodieLogFiles

Risk level (write none, low medium or high below)

low, unit tests were added to assert behavior is maintained

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

}

public String getFileId() {
return FSUtils.getFileIdFromLogPath(getPath());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously all of these methods would construct a path object and then run a matcher on the fileName from that path. Now we'll make a single Path object when creating the object and we'll run the matcher once and extract all the values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +67 to +86
private String[] getFileIdAndCommitTimeFromFileName() {
String[] values = new String[2];
short underscoreCount = 0;
short lastUnderscoreIndex = 0;
for (int i = 0; i < fileName.length(); i++) {
char c = fileName.charAt(i);
if (c == '_') {
if (underscoreCount == 0) {
values[0] = fileName.substring(0, i);
}
lastUnderscoreIndex = (short) i;
underscoreCount++;
} else if (c == '.') {
if (underscoreCount == 2) {
values[1] = fileName.substring(lastUnderscoreIndex + 1, i);
break;
}
}
}
return values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this code can be refactored a bit to make it more readable as to what is happening. 'File Id' is a UUID, so should not contain a dot. Would the following work:

/** @return {@link String} array where first element is file id and second element is commit time. */
private String[] getFileIdAndCommitTimeFromFileName() {
   int endOfFileId = fileName.lastIndexOf('-');
   int endOfCommitTime = fileName.indexOf('.', endOfFileId + 1);

   if (endOfFileId >= 0 && endOfCommitTime >= 0) {
     return new String[]{fileName.substring(0, endOfFileId), fileName.substring(endOfFileId + 1, endOfCommitTime)};
   }

   return new String[]{null, null};
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we strictly require file IDs to be a UUID so I don't think this will work. It will require multiple traversals of the string which is what I was trying to avoid.

.map(partitionBaseFilePair -> Pair.of(partitionBaseFilePair.getLeft(), partitionBaseFilePair.getRight().getFileName()))
.sorted()
.collect(toList());
Collections.sort(partitionFileNameList); // TODO why does this need to be sorted?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan or @yihua do you know why this partitionFileNameList has to be sorted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code. I don't think we need sorting here. anyways, internaly when polling col stats from MDT, after constructing all record keys to be looked up, we sort before looking up in hfile. we can probably remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there good testing around this? I can remove and make sure the tests still pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks like this sorting is not needed anymore. Likely leftover from the previous refactoring. The keys to look up in MDT needs to be sorted after generation. The list of partition and file name pairs need not tobe sorted here.

@nsivabalan nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Aug 3, 2023
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok w/ changes in HoodieBaseFile and HoodieLogFile.
but in index and elsewhere, we should be cautious in bringing more memory to driver(HoodieBaseFile object instead of a string). lets re-think about that.

.map(partitionBaseFilePair -> Pair.of(partitionBaseFilePair.getLeft(), partitionBaseFilePair.getRight().getFileName()))
.sorted()
.collect(toList());
Collections.sort(partitionFileNameList); // TODO why does this need to be sorted?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code. I don't think we need sorting here. anyways, internaly when polling col stats from MDT, after constructing all record keys to be looked up, we sort before looking up in hfile. we can probably remove this.

}

public String getFileId() {
return FSUtils.getFileIdFromLogPath(getPath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// further
this.logFile = new HoodieLogFile(FSUtils.makeQualified(fs, logFile.getPath()), logFile.getFileSize());
Path updatedPath = FSUtils.makeQualified(fs, logFile.getPath());
this.logFile = updatedPath.equals(logFile.getPath()) ? logFile : new HoodieLogFile(updatedPath, logFile.getFileSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I know what this change is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid creating an extra object and recomputing the metadata from the file name

@hudi-bot
Copy link
Collaborator

hudi-bot commented Aug 4, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants