[HUDI-1575] Early Conflict Detection For Multi-writer #6133

zhangyue19921010 · 2022-07-18T09:30:29Z

Replaced #6059

Change Logs

Please take a look at #6003 for more details.

Impact

no impact

Risk level low

Documentation Update

Please take a look at #6003 for more details.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

yanghua · 2022-07-20T09:10:47Z

@zhangyue19921010 Would you please update the PR to fix the conflicts.

zhangyue19921010 · 2022-07-22T11:15:34Z

Hi @yanghua and @yihua Sorry for the late response.
Resolved conflict! PTAL :)

zhangyue19921010 · 2022-10-26T11:32:57Z

@hudi-bot run azure

…le-final

zhangyue19921010 · 2022-11-12T12:50:44Z

@hudi-bot run azure

yihua

Generally, the logic looks good and follows the design. We need to think about better code abstraction and reuse to avoid any discrepancy compared to the existing conflict detection and resolution strategy.

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java

yihua · 2022-11-15T19:36:59Z

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java

+   */
+  private TypedProperties refreshLockConfig(HoodieWriteConfig writeConfig, String key) {
+    TypedProperties props = new TypedProperties(writeConfig.getProps());
+    props.setProperty(LockConfiguration.ZK_LOCK_KEY_PROP_KEY, key);


Here it should check if the ZK-based lock is configured. Otherwise, it should throw an exception.

Generally, we should think about how to support different lock provider implementations. For the first cut, it may be okay to have this specific logic here.

Sure thing Changed. Also we could mark a TODO here to support more lock provider as next step

Sg. Let's use LOCK_PROVIDER_CLASS_NAME instead of ZK_BASE_PATH_PROP_KEY for checking whether ZK-based lock is configured.

@zhangyue19921010 could you file a JIRA ticket besides the TODO because this requires more work?

.../hudi-client-common/src/main/java/org/apache/hudi/client/transaction/TransactionManager.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java

yihua · 2022-11-15T21:20:33Z

...-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java

+        }
+
+        if (earlyConflictDetectionStrategy.hasMarkerConflict()) {
+          earlyConflictDetectionStrategy.resolveMarkerConflict(basePath, markerDir, markerName);


No exception should be thrown here at the timeline server if there is detected conflict. The timeline server should simply return false for the marker creation request and let the executor/write handle resolve the marker conflict (throw the exception).

Same reason, this check is batch and async. For specific request get false result. It means maker checker find a conflict but maybe it is not current request related marker conflict.

So is it possible to let current executor to handle others' conflict based on timeline sever in async and batch mode. :)

The timeline server should simply return false for the marker creation request

Totally agree with it.
For now timeline server will return false for executor request and and executor will

if (success) { return Option.of(new Path(FSUtils.getPartitionPath(markerDirPath, partitionPath), markerFileName)); } else { // this failed may due to early conflict detection, so we need to throw out. throw new HoodieEarlyConflictDetectionException(new ConcurrentModificationException("Early conflict detected but cannot resolve conflicts for overlapping writes")); }

...ce/src/main/java/org/apache/hudi/timeline/service/handlers/marker/MarkerCheckerRunnable.java

yihua · 2022-11-15T21:31:39Z

...ce/src/main/java/org/apache/hudi/timeline/service/handlers/marker/MarkerCheckerRunnable.java

+   * @param instants
+   * @return
+   */
+  private List<String> getCandidateInstants(List<Path> instants, String currentInstantTime) {


Can we adapt the common logic from ConflictResolutionStrategy instead of reinventing similar logic?

Yeap, actually there are some diff here using the same name :)
for occ getCandidateInstants which depends on a state:

// To find which instants are conflicting, we apply the following logic // 1. Get completed instants timeline only for commits that have happened since the last successful write. // 2. Get any scheduled or completed compaction or clustering operations that have started and/or finished // after the current instant. We need to check for write conflicts since they may have mutated the same files // that are being newly created by the current write.

For current early conflict detection getCandidateInstants:

/** * Get Candidate Instant to do conflict checking: * 1. Skip current writer related instant(currentInstantTime) * 2. Skip all instants after currentInstantTime * 3. Skip dead writers related instants based on heart-beat * @param instants * @return */

Also Thanks for your reviewing here! Really appreciate!

...udi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java

yihua · 2023-01-06T07:16:29Z

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java

+   */
+  private TypedProperties refreshLockConfig(HoodieWriteConfig writeConfig, String key) {
+    TypedProperties props = new TypedProperties(writeConfig.getProps());
+    props.setProperty(LockConfiguration.ZK_LOCK_KEY_PROP_KEY, key);


Sg. Let's use LOCK_PROVIDER_CLASS_NAME instead of ZK_BASE_PATH_PROP_KEY for checking whether ZK-based lock is configured.

@zhangyue19921010 could you file a JIRA ticket besides the TODO because this requires more work?

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

...client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java

hudi-common/src/main/java/org/apache/hudi/common/util/MarkerUtils.java

.../hudi-client-common/src/main/java/org/apache/hudi/client/transaction/TransactionManager.java

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/TimelineService.java

… and revise config description

hudi-common/src/main/java/org/apache/hudi/common/util/MarkerUtils.java

yihua · 2023-01-09T02:41:16Z

...client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java

+    long maxAllowableHeartbeatIntervalInMs = config.getHoodieClientHeartbeatIntervalInMs() * config.getHoodieClientHeartbeatTolerableMisses();
+
+    HoodieDirectMarkerBasedEarlyConflictDetectionStrategy strategy =
+        (HoodieDirectMarkerBasedEarlyConflictDetectionStrategy) ReflectionUtils.loadClass(config.getEarlyConflictDetectionStrategyClassName(),


nit: we can think about loading the strategy class through reflection in a common place for reuse, instead of loading for every marker creation.

...-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java

zhangyue19921010 · 2023-01-10T05:31:06Z

@hudi-bot run azure

yihua

LGTM. The marker APIs can be further improved which can be addressed in a separate PR. @zhangyue19921010 Good job on getting the implementation done!

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java

...-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java

yihua · 2023-01-20T18:13:18Z

I'm doing more thorough tests in CI. Please do not merge this PR now.

…he marker type and fallback to default

…le-final

yihua · 2023-01-23T06:01:06Z

@hudi-bot run azure

hudi-bot · 2023-01-23T09:09:17Z

CI report:

dbe3db8 UNKNOWN
678cce4 UNKNOWN
6fc5bf1 UNKNOWN
0b74647 UNKNOWN
3369e5e UNKNOWN
1ccecb4 UNKNOWN
6fdf901 UNKNOWN
0a77616 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua · 2023-01-23T17:41:51Z

The Azure CI run with the feature flag turned on by default (#7703) has succeeded. The CI failure of this PR is due to flaky tests. Merging this PR.

Before this PR, Hudi implements OCC (Optimistic Concurrency Control) to detect the write conflict at the pre-commit time to ensure data consistency, integrity, and correctness between multiple writers. OCC detects the conflict at Hudi's file group level, i.e., two concurrent writers updating the same file group are detected as a conflict. Currently, conflict detection is performed before committing metadata and after the data writing is completed. If any conflict is detected, it leads to a waste of cluster resources because computing and writing are finished already. To solve this problem, this change implements an early conflict detection mechanism to detect the conflict during the data writing phase and abort the writing early if a conflict is detected, using Hudi's marker mechanism. Before writing each data file, the writer creates a corresponding marker to mark that the file is created, so that later on, the writer can use the markers to automatically clean up uncommitted data files in the failure and rollback scenarios. We leverage the markers to identify the conflict at the file group level during writing data. There are subtle differences in the early conflict detection workflow among different types of markers. For direct markers, the writer lists necessary marker files directly and checks the conflict before creating markers and starting to write the corresponding data file. For the timeline-server-based markers, the writer gets the result of marker conflict detection by contacting the timeline server. The conflict detection is asynchronously and periodically executed at the timeline server so that the write conflicts can be detected as early as possible. Both writers may still write the data files of the same file slice until the conflict is detected in the next round of detection. Note that, the implemented early conflict detection operates within OCC. Any conflict detection outside the scope of OCC is not handled. For example, the current OCC for multiple writers cannot detect the conflict if two concurrent writers perform INSERT operations for the same set of record keys, because the writers write to different file groups. This set of changes does not intend to address this problem. Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

yuezhang added 4 commits July 18, 2022 15:42

need more test

79feeb3

tested

6331439

tested

de69c0e

tested

fcaaf9d

zhangyue19921010 mentioned this pull request Jul 18, 2022

[HUDI-1575] Early Conflict Detection For Multi-writer #6059

Closed

5 tasks

yuezhang added 3 commits July 18, 2022 18:02

fix liences

553fb00

fix config

dbe3db8

add uts

66b7d1b

yanghua assigned zhangyue19921010 Jul 19, 2022

resolve conflict

64819e4

yihua added priority:blocker multi-writer writer-core Issues relating to core transactions/write actions big-needle-movers labels Sep 12, 2022

yihua mentioned this pull request Sep 29, 2022

[HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer #6003

Merged

5 tasks

yuezhang added 4 commits October 19, 2022 14:12

merge from master && resolve conflicts

678cce4

merge from master && resolve conflicts

5842dcf

fix checkstyle

645766d

merge from master

e23ab61

yihua added 2 commits November 11, 2022 11:51

Resolve conflict with master

5d0d05f

Merge branch 'master' into early-conflict-detection-based-on-occ-simp…

465536f

…le-final

yihua reviewed Nov 15, 2022

View reviewed changes

yuezhang added 4 commits November 18, 2022 18:41

refact abstraction

c6bc22d

refact abstraction

7d8f3bc

address comments

fc5927a

address comments

3bde14b

yihua reviewed Jan 9, 2023

View reviewed changes

yihua added 2 commits January 8, 2023 17:04

Improve abstraction for lock and transaction manager, rename configs,…

c412635

… and revise config description

Replace checker naming

6bb1974

yihua reviewed Jan 9, 2023

View reviewed changes

yuezhang added 6 commits January 10, 2023 00:47

address comments

6d19d03

merge from master and resolve conflict

b90ea04

address comments

3f2118a

address comments

be0d5b4

address comments

aad218a

address comments

1b837ec

Address review comments

c34fb52

yihua approved these changes Jan 19, 2023

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java Outdated Show resolved Hide resolved

...-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java Outdated Show resolved Hide resolved

Fix build

67b3892

This was referenced Jan 19, 2023

[HUDI-1575][DO NOT MERGE] Testing early conflict detection with feature flag enabled by default #7703

Closed

[HUDI-5589] Fix Hudi config inference #7713

Merged

yihua added 4 commits January 19, 2023 19:53

Fix diverging changes from master and nits

a2980b7

Fix config inference

0579f9b

Add and revise javadocs, make strategy class names shorter

2976167

Revise other names

7344fab

yihua added 3 commits January 20, 2023 13:54

Improve async timeline-server-based conflict detection

501e47f

Add validation of conflict detection strategy to be compatible with t…

46e80ae

…he marker type and fallback to default

Merge branch 'master' into early-conflict-detection-based-on-occ-simp…

0a77616

…le-final

yihua merged commit c18d615 into apache:master Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1575] Early Conflict Detection For Multi-writer #6133

[HUDI-1575] Early Conflict Detection For Multi-writer #6133

zhangyue19921010 commented Jul 18, 2022 •

edited

Loading

yanghua commented Jul 20, 2022

zhangyue19921010 commented Jul 22, 2022

zhangyue19921010 commented Oct 26, 2022

zhangyue19921010 commented Nov 12, 2022

yihua left a comment

yihua Nov 15, 2022

zhangyue19921010 Nov 21, 2022

yihua Jan 6, 2023

yihua Nov 15, 2022

zhangyue19921010 Nov 21, 2022

zhangyue19921010 Nov 21, 2022

yihua Nov 15, 2022

zhangyue19921010 Nov 21, 2022 •

edited

Loading

zhangyue19921010 Nov 21, 2022

yihua Jan 6, 2023

yihua Jan 9, 2023

zhangyue19921010 commented Jan 10, 2023

yihua left a comment

yihua commented Jan 20, 2023

yihua commented Jan 23, 2023

hudi-bot commented Jan 23, 2023

yihua commented Jan 23, 2023

[HUDI-1575] Early Conflict Detection For Multi-writer #6133

[HUDI-1575] Early Conflict Detection For Multi-writer #6133

Conversation

zhangyue19921010 commented Jul 18, 2022 • edited Loading

Change Logs

Impact

Risk level low

Documentation Update

Contributor's checklist

yanghua commented Jul 20, 2022

zhangyue19921010 commented Jul 22, 2022

zhangyue19921010 commented Oct 26, 2022

zhangyue19921010 commented Nov 12, 2022

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 Nov 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 commented Jan 10, 2023

yihua left a comment

Choose a reason for hiding this comment

yihua commented Jan 20, 2023

yihua commented Jan 23, 2023

hudi-bot commented Jan 23, 2023

CI report:

yihua commented Jan 23, 2023

zhangyue19921010 commented Jul 18, 2022 •

edited

Loading

zhangyue19921010 Nov 21, 2022 •

edited

Loading