[WIP] hudi cluster write path poc by vinothchandar · Pull Request #2082 · apache/hudi

vinothchandar · 2020-09-10T00:56:25Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

bvaradar

Yet to finish up a pass. Adding comments so far.

Once REPLACE action (@satishkotha) changes are coalesced with this change (#2048), this would simplify.

The naming "cluster" vs "clustering" needs to be standardized.

bvaradar · 2020-09-28T22:35:42Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/RocksDbBasedFileSystemView.java


+  @Override
+  protected void addPendingClusteringOperations(Stream<Pair<String, ClusteringOperation>> operations) {
+    // TODO


This TODO and others are needed for incremental filesystem view syncing.

bvaradar · 2020-09-28T22:38:30Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java

    return fileIdToPendingCompaction;
  }

+  protected Map<HoodieFileGroupId, Pair<String, ClusteringOperation>> createFileIdToPendingClusteringMap(


It would be a good idea to rebase this with #2048 as it is almost ready.

bvaradar · 2020-09-28T22:46:20Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java

      } else {
        return HoodieTimeline.makeCommitFileName(timestamp);
      }
+    } else if (HoodieTimeline.CLUSTERING_ACTION.equals(action)) {


I am assuming this would be changed to REPLACE_ACTION from the other PR.

bvaradar · 2020-09-28T23:06:48Z

...ent/src/main/java/org/apache/hudi/table/action/clustering/HoodieCopyOnWriteTableCluster.java

+            .collect(Collectors.toSet());
+    List<HoodieClusteringOperation> operations = jsc.parallelize(partitionPaths, partitionPaths.size()).map((Function<String, ClusteringOperation>) partitionPath -> {
+      Stream<FileSlice> fileSliceStream = fileSystemView.getLatestFileSlices(partitionPath);
+      List<HoodieBaseFile> baseFiles = fileSliceStream.filter(slice -> (!fgIdsInPendingClusterings.contains(slice.getFileGroupId()) && !fgIdsPendingCompactions.contains(slice.getFileGroupId())))


For COW table, pending compactions would not be present.

Some of the code seems common with Compaction handling. Any chance of moving to common helper functions ?

thanks, will reuse these code

bvaradar · 2020-09-28T23:14:30Z

...va/org/apache/hudi/table/action/clustering/strategy/BaseFileSizeBasedClusteringStrategy.java

+  }
+
+  @Override
+  public int compare(HoodieClusteringOperation op1, HoodieClusteringOperation op2) {


This can be moved to separate comparator class ?

bvaradar · 2020-09-28T23:18:54Z

...ient/src/main/java/org/apache/hudi/table/action/clustering/updates/RejectUpdateStrategy.java

+        String partitionPath = partitionStat.getKey();
+        String fileId = updateLocEntry.getKey();
+        if (partitionFileIdPairs.contains(Pair.of(partitionPath, fileId))) {
+          LOG.error("Not allowed to update the clustering files, partition: " + partitionPath + ", fileID " + fileId + ", please use other strategy.");


"please use other strategy" => Reword ? FOr pending clustering operations, we are not going to support update for now right ?

yes, first step will not support update when clustering

bvaradar · 2020-09-28T23:22:24Z

hudi-common/src/main/avro/HoodieClusteringPlan.avsc

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+{


Are we going to reuse Replace metadata or this one ?

bvaradar · 2020-09-28T23:26:18Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java

      INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION,
      INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, INFLIGHT_CLEAN_EXTENSION,
-      INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION));
+      INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION,


Are we going to reuse REPLACE action here ?

bvaradar · 2020-09-28T23:27:20Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java

    return fileIdToPendingCompaction;
  }

+  protected Map<HoodieFileGroupId, Pair<String, ClusteringOperation>> createFileIdToPendingClusteringMap(


Assuming we are going to reuse REPLACE, many of these changes should go away ?

bvaradar · 2020-09-28T23:32:59Z

hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java

+  public HoodieWriteMetadata clustering(JavaSparkContext jsc, String compactionInstantTime) {
+    return new RunClusteringActionExecutor(jsc, config, this, compactionInstantTime).execute();
+  }
+


Need to ensure small file handling doesn't toe-step clustering. In other words, we should employ update rejection policy for small file handling too.

Thanks, Will pay attention to this

satishkotha · 2020-10-01T21:21:36Z

@leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed)

leesf · 2020-10-02T02:31:46Z

@leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed)

Sure, considering I am a little busy these days, it is wonderful if you @satishkotha would take over the PR and land it. Thanks

lw309637554 · 2020-10-12T08:36:24Z

@leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed)

Sure, considering I am a little busy these days, it is wonderful if you @satishkotha would take over the PR and land it. Thanks

@leesf @satishkotha what is your process? i am intrested to take this and land it. Thanks

satishkotha · 2020-10-13T04:22:57Z

@leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed)

Sure, considering I am a little busy these days, it is wonderful if you @satishkotha would take over the PR and land it. Thanks

@leesf @satishkotha what is your process? i am intrested to take this and land it. Thanks

@lw309637554 I've already started working on this. Perhaps, you could help with one of the followup tasks of #2048? These are tracked as subtasks here https://issues.apache.org/jira/browse/HUDI-868? Subtasks 2,4 are easy to get started. But, feel free to pick others too?

@vinothchandar Maybe we can close this PR to avoid confusion? I'll open new PR when i'm ready and run some basic tests.

lw309637554 · 2020-10-13T06:14:09Z

@leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed)

Sure, considering I am a little busy these days, it is wonderful if you @satishkotha would take over the PR and land it. Thanks

@leesf @satishkotha what is your process? i am intrested to take this and land it. Thanks

@lw309637554 I've already started working on this. Perhaps, you could help with one of the followup tasks of #2048? These are tracked as subtasks here https://issues.apache.org/jira/browse/HUDI-868? Subtasks 2,4 are easy to get started. But, feel free to pick others too?

@vinothchandar Maybe we can close this PR to avoid confusion? I'll open new PR when i'm ready and run some basic tests.

@satishkotha ok, i can take some sub task in https://issues.apache.org/jira/browse/HUDI-868

vinothchandar · 2020-11-21T04:18:03Z

Closing this , since we have the actual PR now from @satishkotha

hudi cluster write path poc

bbc1c41

vinothchandar added the status:in-progress Work in progress label Sep 10, 2020

vinothchandar self-assigned this Sep 10, 2020

vinothchandar changed the title ~~hudi cluster write path poc~~ [WIP] hudi cluster write path poc Sep 17, 2020

bvaradar reviewed Sep 28, 2020

View reviewed changes

vinothchandar closed this Nov 21, 2020

Conversation

vinothchandar commented Sep 10, 2020

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

bvaradar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satishkotha commented Oct 1, 2020

Uh oh!

leesf commented Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lw309637554 commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satishkotha commented Oct 13, 2020

Uh oh!

lw309637554 commented Oct 13, 2020

Uh oh!

vinothchandar commented Nov 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leesf commented Oct 2, 2020 •

edited

Loading

lw309637554 commented Oct 12, 2020 •

edited

Loading