[HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible#9337
Conversation
| } | ||
|
|
||
| public String getFileId() { | ||
| return FSUtils.getFileIdFromLogPath(getPath()); |
There was a problem hiding this comment.
Previously all of these methods would construct a path object and then run a matcher on the fileName from that path. Now we'll make a single Path object when creating the object and we'll run the matcher once and extract all the values.
| private String[] getFileIdAndCommitTimeFromFileName() { | ||
| String[] values = new String[2]; | ||
| short underscoreCount = 0; | ||
| short lastUnderscoreIndex = 0; | ||
| for (int i = 0; i < fileName.length(); i++) { | ||
| char c = fileName.charAt(i); | ||
| if (c == '_') { | ||
| if (underscoreCount == 0) { | ||
| values[0] = fileName.substring(0, i); | ||
| } | ||
| lastUnderscoreIndex = (short) i; | ||
| underscoreCount++; | ||
| } else if (c == '.') { | ||
| if (underscoreCount == 2) { | ||
| values[1] = fileName.substring(lastUnderscoreIndex + 1, i); | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| return values; |
There was a problem hiding this comment.
I am wondering if this code can be refactored a bit to make it more readable as to what is happening. 'File Id' is a UUID, so should not contain a dot. Would the following work:
/** @return {@link String} array where first element is file id and second element is commit time. */
private String[] getFileIdAndCommitTimeFromFileName() {
int endOfFileId = fileName.lastIndexOf('-');
int endOfCommitTime = fileName.indexOf('.', endOfFileId + 1);
if (endOfFileId >= 0 && endOfCommitTime >= 0) {
return new String[]{fileName.substring(0, endOfFileId), fileName.substring(endOfFileId + 1, endOfCommitTime)};
}
return new String[]{null, null};
}
There was a problem hiding this comment.
I don't think we strictly require file IDs to be a UUID so I don't think this will work. It will require multiple traversals of the string which is what I was trying to avoid.
| .map(partitionBaseFilePair -> Pair.of(partitionBaseFilePair.getLeft(), partitionBaseFilePair.getRight().getFileName())) | ||
| .sorted() | ||
| .collect(toList()); | ||
| Collections.sort(partitionFileNameList); // TODO why does this need to be sorted? |
There was a problem hiding this comment.
@nsivabalan or @yihua do you know why this partitionFileNameList has to be sorted?
There was a problem hiding this comment.
I checked the code. I don't think we need sorting here. anyways, internaly when polling col stats from MDT, after constructing all record keys to be looked up, we sort before looking up in hfile. we can probably remove this.
There was a problem hiding this comment.
Is there good testing around this? I can remove and make sure the tests still pass
There was a problem hiding this comment.
Yes, looks like this sorting is not needed anymore. Likely leftover from the previous refactoring. The keys to look up in MDT needs to be sorted after generation. The list of partition and file name pairs need not tobe sorted here.
nsivabalan
left a comment
There was a problem hiding this comment.
I am ok w/ changes in HoodieBaseFile and HoodieLogFile.
but in index and elsewhere, we should be cautious in bringing more memory to driver(HoodieBaseFile object instead of a string). lets re-think about that.
| .map(partitionBaseFilePair -> Pair.of(partitionBaseFilePair.getLeft(), partitionBaseFilePair.getRight().getFileName())) | ||
| .sorted() | ||
| .collect(toList()); | ||
| Collections.sort(partitionFileNameList); // TODO why does this need to be sorted? |
There was a problem hiding this comment.
I checked the code. I don't think we need sorting here. anyways, internaly when polling col stats from MDT, after constructing all record keys to be looked up, we sort before looking up in hfile. we can probably remove this.
...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
Show resolved
Hide resolved
| } | ||
|
|
||
| public String getFileId() { | ||
| return FSUtils.getFileIdFromLogPath(getPath()); |
| // further | ||
| this.logFile = new HoodieLogFile(FSUtils.makeQualified(fs, logFile.getPath()), logFile.getFileSize()); | ||
| Path updatedPath = FSUtils.makeQualified(fs, logFile.getPath()); | ||
| this.logFile = updatedPath.equals(logFile.getPath()) ? logFile : new HoodieLogFile(updatedPath, logFile.getFileSize()); |
There was a problem hiding this comment.
may I know what this change is for.
There was a problem hiding this comment.
This is to avoid creating an extra object and recomputing the metadata from the file name
Change Logs
spliton the file name twice to improve efficiencyImpact
Lowers overhead when extracting metadata about HoodieBaseFiles or HoodieLogFiles
Risk level (write none, low medium or high below)
low, unit tests were added to assert behavior is maintained
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist