[HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible by the-other-tim-brown · Pull Request #9337 · apache/hudi

the-other-tim-brown · 2023-08-01T21:34:29Z

Change Logs

Updates sections of the code to use the getters in the HoodieBaseFile and HoodieLogFile instead of FSUtils to move away from relying directly on the path for getting metadata about the file when possible
Sets file ID and commit time in HoodieBaseFile on construction and avoids running split on the file name twice to improve efficiency
Sets fileId, baseCommitTime, logVersion, logWriteToken, fileExtension, suffix, and path when creating HoodieLogFile instead of making multiple calls to a regex based matcher to improve efficiency
Uses CachingPath instead of Path in HoodieLogFile for improved efficiency

Impact

Lowers overhead when extracting metadata about HoodieBaseFiles or HoodieLogFiles

Risk level (write none, low medium or high below)

low, unit tests were added to assert behavior is maintained

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…ls.getFileId

the-other-tim-brown · 2023-08-01T21:48:22Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java

  }

  public String getFileId() {
-    return FSUtils.getFileIdFromLogPath(getPath());


Previously all of these methods would construct a path object and then run a matcher on the fileName from that path. Now we'll make a single Path object when creating the object and we'll run the matcher once and extract all the values.

amrishlal · 2023-08-01T22:47:54Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java

+  private String[] getFileIdAndCommitTimeFromFileName() {
+    String[] values = new String[2];
+    short underscoreCount = 0;
+    short lastUnderscoreIndex = 0;
+    for (int i = 0; i < fileName.length(); i++) {
+      char c = fileName.charAt(i);
+      if (c == '_') {
+        if (underscoreCount == 0) {
+          values[0] = fileName.substring(0, i);
+        }
+        lastUnderscoreIndex = (short) i;
+        underscoreCount++;
+      } else if (c == '.') {
+        if (underscoreCount == 2) {
+          values[1] = fileName.substring(lastUnderscoreIndex + 1, i);
+          break;
+        }
+      }
+    }
+    return values;


I am wondering if this code can be refactored a bit to make it more readable as to what is happening. 'File Id' is a UUID, so should not contain a dot. Would the following work:

/** @return {@link String} array where first element is file id and second element is commit time. */ private String[] getFileIdAndCommitTimeFromFileName() { int endOfFileId = fileName.lastIndexOf('-'); int endOfCommitTime = fileName.indexOf('.', endOfFileId + 1); if (endOfFileId >= 0 && endOfCommitTime >= 0) { return new String[]{fileName.substring(0, endOfFileId), fileName.substring(endOfFileId + 1, endOfCommitTime)}; } return new String[]{null, null}; }

I don't think we strictly require file IDs to be a UUID so I don't think this will work. It will require multiple traversals of the string which is what I was trying to avoid.

the-other-tim-brown · 2023-08-02T02:05:29Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java

-            .map(partitionBaseFilePair -> Pair.of(partitionBaseFilePair.getLeft(), partitionBaseFilePair.getRight().getFileName()))
-            .sorted()
-            .collect(toList());
+    Collections.sort(partitionFileNameList); // TODO why does this need to be sorted?


@nsivabalan or @yihua do you know why this partitionFileNameList has to be sorted?

I checked the code. I don't think we need sorting here. anyways, internaly when polling col stats from MDT, after constructing all record keys to be looked up, we sort before looking up in hfile. we can probably remove this.

Is there good testing around this? I can remove and make sure the tests still pass

Yes, looks like this sorting is not needed anymore. Likely leftover from the previous refactoring. The keys to look up in MDT needs to be sorted after generation. The list of partition and file name pairs need not tobe sorted here.

nsivabalan

I am ok w/ changes in HoodieBaseFile and HoodieLogFile.
but in index and elsewhere, we should be cautious in bringing more memory to driver(HoodieBaseFile object instead of a string). lets re-think about that.

nsivabalan · 2023-08-03T15:08:40Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java

-            .map(partitionBaseFilePair -> Pair.of(partitionBaseFilePair.getLeft(), partitionBaseFilePair.getRight().getFileName()))
-            .sorted()
-            .collect(toList());
+    Collections.sort(partitionFileNameList); // TODO why does this need to be sorted?


I checked the code. I don't think we need sorting here. anyways, internaly when polling col stats from MDT, after constructing all record keys to be looked up, we sort before looking up in hfile. we can probably remove this.

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

nsivabalan · 2023-08-03T15:16:09Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java

  }

  public String getFileId() {
-    return FSUtils.getFileIdFromLogPath(getPath());


nsivabalan · 2023-08-03T15:18:31Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java

    //       further
-    this.logFile = new HoodieLogFile(FSUtils.makeQualified(fs, logFile.getPath()), logFile.getFileSize());
+    Path updatedPath = FSUtils.makeQualified(fs, logFile.getPath());
+    this.logFile = updatedPath.equals(logFile.getPath()) ? logFile : new HoodieLogFile(updatedPath, logFile.getFileSize());


may I know what this change is for.

This is to avoid creating an extra object and recomputing the metadata from the file name

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java

hudi-bot · 2023-08-04T08:36:17Z

CI report:

dbd715a Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

the-other-tim-brown added 5 commits August 1, 2023 13:14

change to method usage

df46bc4

log file testing and refactor

6e6b63c

update basefile code, update another usage of file id, optimize FSUti…

01f6db3

…ls.getFileId

return instance of path in log file

d14b5be

add suffix test case

655dfaf

the-other-tim-brown commented Aug 1, 2023

View reviewed changes

amrishlal reviewed Aug 1, 2023

View reviewed changes

the-other-tim-brown added 3 commits August 1, 2023 16:45

fix suffix handling

89d3dd3

mark path as transient for serialization

e8527c9

handle base file naming edge case

48b44eb

the-other-tim-brown commented Aug 2, 2023

View reviewed changes

the-other-tim-brown added 2 commits August 2, 2023 09:08

more optimizations

9cbb48c

provide CachingPath instance to log file constructor when possible

ee75581

nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Aug 3, 2023

nsivabalan requested changes Aug 3, 2023

View reviewed changes

remove sorting

57f4543

nsivabalan reviewed Aug 3, 2023

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java Show resolved Hide resolved

set load factor on map to avoid resize

306b6c9

nsivabalan approved these changes Aug 3, 2023

View reviewed changes

nsivabalan reviewed Aug 3, 2023

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieBaseFile.java Show resolved Hide resolved

the-other-tim-brown added 2 commits August 3, 2023 19:35

fix HoodieMetadataBloomFilterProbingFunction bug

36a64db

lazy parse on hoodie log file, lazy init caching path as well

dbd715a

codope assigned nsivabalan and codope Aug 4, 2023

codope merged commit 3840922 into apache:master Aug 4, 2023

danny0405 mentioned this pull request Oct 8, 2024

[HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function #6384

Closed

4 tasks

hudi-bot mentioned this pull request Dec 9, 2025

Rely on HoodieBaseFile and HoodieLogFile methods over FsUtils #16143

Closed

Conversation

the-other-tim-brown commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Aug 4, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

the-other-tim-brown commented Aug 1, 2023 •

edited

Loading