[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529

EnricoMi · 2024-02-15T11:00:23Z

What changes were proposed in this pull request?

Use map index and task attempt number as the task attempt id in Spark3.

This requires to rework the bits of the blockId to maximize bit utilization for Spark3:

incubator-uniffle/common/src/main/java/org/apache/uniffle/common/util/Constants.java

Lines 30 to 35 in b924aca

    
           // BlockId is long and consist of partitionId, taskAttemptId, atomicInt 
        
           // the length of them are ATOMIC_INT_MAX_LENGTH + PARTITION_ID_MAX_LENGTH + 
        
           // TASK_ATTEMPT_ID_MAX_LENGTH = 63 
        
           public static final int PARTITION_ID_MAX_LENGTH = 24; 
        
           public static final int TASK_ATTEMPT_ID_MAX_LENGTH = 21; 
        
           public static final int ATOMIC_INT_MAX_LENGTH = 18;

Ideally, the TASK_ATTEMPT_ID_MAX_LENGTH is set equal to PARTITION_ID_MAX_LENGTH + the number of bits required to store the largest task attempt number. The largest task attempt number is maxFailures - 1, or maxFailures if speculative execution is enabled (configured via spark.speculation and disabled by default). The maxFailures is configured via spark.task.maxFailures and defaults to 4. So by default, two bits are required to store the largest attempt number and TASK_ATTEMPT_ID_MAX_LENGTH should be set to PARTITION_ID_MAX_LENGTH + 2.

Example:

with PARTITION_ID_MAX_LENGTH = 20, Uniffle supports 1,048,576 partitions
requiring TASK_ATTEMPT_ID_MAX_LENGTH = 22
allowing for ATOMIC_INT_MAX_LENGTH = 21.

Why are the changes needed?

The map index (map partition id) is limited to the number of partitions of a shuffle. The task attempt number is limited by the max number of failures configured by spark.task.maxFailures, which defaults to 4. This provides us an id that is unique per shuffe while not growing arbitrarily large as context.taskAttemptId does.

Fix: #134

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit and integration tests.

codecov-commenter · 2024-02-15T11:08:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7fbe7c9) 54.15% compared to head (60a6931) 55.14%.
Report is 3 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1529      +/-   ##
============================================
+ Coverage     54.15%   55.14%   +0.98%     
- Complexity     2803     2804       +1     
============================================
  Files           430      410      -20     
  Lines         24417    22056    -2361     
  Branches       2081     2081              
============================================
- Hits          13224    12163    -1061     
+ Misses        10361     9133    -1228     
+ Partials        832      760      -72

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-02-15T11:22:45Z

Test Results

2 438 files + 9 2 438 suites +9 4h 43m 12s ⏱️ + 3m 32s
822 tests + 3 821 ✅ + 3 1 💤 ±0 0 ❌ ±0
9 736 runs +23 9 722 ✅ +23 14 💤 ±0 0 ❌ ±0

Results for commit 60a6931. ± Comparison against base commit d120f4b.

♻️ This comment has been updated with latest results.

EnricoMi · 2024-02-16T06:26:36Z

Re #1514 (comment):

AttemptNo will waste some bits. If we increase the bits, the bitmap will occupy more memory.

The current config is not optimal:

public static final int PARTITION_ID_MAX_LENGTH = 24;
public static final int TASK_ATTEMPT_ID_MAX_LENGTH = 21;
public static final int ATOMIC_INT_MAX_LENGTH = 18;

The PARTITION_ID_MAX_LENGTH supports 16,777,216 partitions, with an assumed optimal partition size of 200 MB this would easily support a dataset of 3 PB. I think that can be reduced a bit.

Further, a TASK_ATTEMPT_ID_MAX_LENGTH that is smaller than PARTITION_ID_MAX_LENGTH does not make sense. A single stage with 2^PARTITION_ID_MAX_LENGTH partitions would create at least as many task attempt ids, which immediately exhausts TASK_ATTEMPT_ID_MAX_LENGTH. So there is room for improvement.

With the improvement in #1529 you would set TASK_ATTEMPT_ID_MAX_LENGTH = PARTITION_ID_MAX_LENGTH + 2 (for the default max failures of 4).

If you would like to support 2 m partitions and 4 max failures, then you would use:

public static final int PARTITION_ID_MAX_LENGTH = 21;
public static final int TASK_ATTEMPT_ID_MAX_LENGTH = 23;
public static final int ATOMIC_INT_MAX_LENGTH = 19;

I think 2 m partitions is quite a lot (supports 400 TB datasets) and ATOMIC_INT_MAX_LENGTH would even be increased with that.

In other words, 2^PARTITION_ID_MAX_LENGTH (2^24) partitions have never been supported, at most 2^TASK_ATTEMPT_ID_MAX_LENGTH (2^21) partitions, which still is the case, plus more room for the sequence number.

EnricoMi · 2024-02-16T06:37:35Z

Re #1514 (comment):

#1529 looks fine, can we add a integration test to cover the case of speculative exection?

The added FailingTasksTest simulates speculative execution quite well because the failing tasks actually write shuffle data but do not register the shuffle result, so they represent the killed slow tasks. The succeeding attempts represent the speculative task.

A speculative execution setup requires workers / executors with different host names. I think that has not been done in Uniffle before, so that would require some significant work with little extra benefit.

zuston · 2024-02-16T10:31:25Z

The added FailingTasksTest simulates speculative execution quite well because the failing tasks actually write shuffle data but do not register the shuffle result, so they represent the killed slow tasks. The succeeding attempts represent the speculative task.

Make sense.

Further, a TASK_ATTEMPT_ID_MAX_LENGTH that is smaller than PARTITION_ID_MAX_LENGTH does not make sense. A single stage with 2^PARTITION_ID_MAX_LENGTH partitions would create at least as many task attempt ids, which immediately exhausts TASK_ATTEMPT_ID_MAX_LENGTH. So there is room for improvement.

Thanks for your explanation, make sense for me. cc @jerqi

jerqi · 2024-02-16T13:23:13Z

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

+   * practically reach LONG.MAX_VALUE. That would overflow the bits in the block id.
+   *
+   * <p>Here we use the map index or task id, appended by the attempt number per task. The map index
+   * is limited by the number of partitions of a stage. The attempt number per task is limited /


Attempt number may be larger than maxFailures.
If we fail 3 times, when fourth attempt run, Spark may trigger a speculative task at the same time. We will have 5 attempts.

That is a bit surprising, but looking at the relevant code, the max failure is not considered when resubmitting a task as speculative:
https://github.com/apache/spark/blob/2abd3a2f445e86337ad94da19f301cb2b8bc232f/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L1226-L1227

Good catch, accounted for: 2f5bccd

jerqi · 2024-02-16T13:24:25Z

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

+  @VisibleForTesting
+  protected static long getTaskAttemptId(int mapIndex, int attemptNo, int maxFailures) {
+    int maxAttemptNo = maxFailures < 1 ? 0 : maxFailures - 1;
+    if (attemptNo > maxAttemptNo) {


Maybe it's ok if we have this judgement. But we should consider the case above.

jerqi · 2024-02-16T15:01:21Z

How does the reader process the new taskAttemptId?

jerqi · 2024-02-16T15:03:01Z

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

+   * @return a task attempt id unique for a shuffle stage
+   */
+  @VisibleForTesting
+  protected static long getTaskAttemptId(int mapIndex, int attemptNo, int maxFailures, boolean speculationEnabled) {


Do we need to consider that mapIndex exceed the limit?

Passing the returned long as taskAttemptId to ClientUtils.getBlockId will raise an error that the taskAttemptId is too large for the block id bit layout.

Method getTaskAttemptId will not long-overflow because the mapIndex is an int, the maxFailures is an int. With maxFailures = Integer.MAX_VALUE, the value of attemptBits is 31 and the returned value is still a valid positive long.

If you prefer to fail-fast here, I can add an assertion.

Fixed in 7b745ed as it is always good to fail-fast with a meaningful error message.

EnricoMi · 2024-02-16T15:07:46Z

How does the reader process the new taskAttemptId?

The reader uses the taskAttemptId as is, they are opaque, they are just unique long values.

jerqi · 2024-02-16T16:12:48Z

How does the reader process the new taskAttemptId?

The reader uses the taskAttemptId as is, they are opaque, they are just unique long values.

The reader will retrieve the taskAttemptIds which it need to read from the MapOutputTracker. So they are not opaque

EnricoMi · 2024-02-16T18:36:18Z

How does the reader process the new taskAttemptId?

The reader uses the taskAttemptId as is, they are opaque, they are just unique long values.

The reader will retrieve the taskAttemptIds which it need to read from the MapOutputTracker. So they are not opaque

The RssShuffleWriter adds the taskAttemptId (now generated by RssShuffleManager.getTaskAttemptId) to the MapOutputTracker (by RssShuffleWriter.stop returning a MapStatus):

incubator-uniffle/client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java

Lines 471 to 477 in d120f4b

    
           final BlockManagerId blockManagerId = 
        
               BlockManagerId.apply( 
        
                   appId + "_" + taskId, 
        
                   DUMMY_HOST, 
        
                   DUMMY_PORT, 
        
                   Option.apply(Long.toString(taskAttemptId))); 
        
           MapStatus mapStatus = MapStatus.apply(blockManagerId, partitionLengths, taskAttemptId);

The RssShuffleManager retrieves that id from the MapOutputTracker, to create the RssShuffleReader, as you pointed out. The ids are used to filter for blocks that come from those task attempts. The ids are compared as is, they are treated as plain longs, no inner information is (or has to be) extracted from them. This is what opaque means. Block ids, in contrast, are not opaque, as inner information like partition id are extracted from them.

jerqi · 2024-02-17T02:19:39Z

How does the reader process the new taskAttemptId?

The reader uses the taskAttemptId as is, they are opaque, they are just unique long values.

The reader will retrieve the taskAttemptIds which it need to read from the MapOutputTracker. So they are not opaque

The RssShuffleWriter adds the taskAttemptId (now generated by RssShuffleManager.getTaskAttemptId) to the MapOutputTracker (by RssShuffleWriter.stop returning a MapStatus):

incubator-uniffle/client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java

Lines 471 to 477 in d120f4b

final BlockManagerId blockManagerId =

BlockManagerId.apply(

appId + "_" + taskId,

DUMMY_HOST,

DUMMY_PORT,

Option.apply(Long.toString(taskAttemptId)));

MapStatus mapStatus = MapStatus.apply(blockManagerId, partitionLengths, taskAttemptId);

The RssShuffleManager retrieves that id from the MapOutputTracker, to create the RssShuffleReader, as you pointed out. The ids are used to filter for blocks that come from those task attempts. The ids are compared as is, they are treated as plain longs, no inner information is (or has to be) extracted from them. This is what opaque means. Block ids, in contrast, are not opaque, as inner information like partition id are extracted from them.

OK. I got it.

jerqi · 2024-02-17T02:27:47Z

@zuston Do you have another suggestion?

zuston · 2024-02-17T08:56:16Z

It looks this PR is a compatible change, but I'm not sure whether this will effect the local order mechanism? I will review this in next week of working days carefully.

zuston

LGTM. I have checked the LocalOrderSegmentSplitter, and I think the task id is now composed in a way that still maintains the sequence required by local order

zuston · 2024-02-20T02:26:41Z

@jerqi PTAL again.

zuston · 2024-02-20T08:20:22Z

Thanks @EnricoMi . This is a great improvement! 🎉

EnricoMi · 2024-02-20T08:59:33Z

Thank you for incorporating this, it helps us greatly migrating to Uniffle!

…kAttemptId (#1544) ### What changes were proposed in this pull request? Use map index and task attempt number as the task attempt id in Spark2. ### Why are the changes needed? This aligns Spark2 taskAttemptId of the blockId with Spark3. See #1529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing integration tests.

EnricoMi added 2 commits February 15, 2024 11:06

Use constrained taskAttemptId alternative in Spark3

4da967c

Add integration test with many failings tasks

f25b532

This was referenced Feb 15, 2024

[#1512] feature(Spark): Replace taskAttemptId with mapIndex in blockId #1514

Closed

[FEATURE] Adaptive length of different parts of blockId #731

Closed

jerqi reviewed Feb 16, 2024

View reviewed changes

Account for speculative task attempts

2f5bccd

EnricoMi force-pushed the blockid-with-mapindex-attemptno branch from dcd03bc to 3abe7ae Compare February 16, 2024 15:50

Fail if produced taskAttemptId does not fit into max bits

7b745ed

EnricoMi force-pushed the blockid-with-mapindex-attemptno branch from 3abe7ae to 7b745ed Compare February 16, 2024 16:10

Improve error messages and comments

60a6931

zuston approved these changes Feb 18, 2024

View reviewed changes

jerqi approved these changes Feb 20, 2024

View reviewed changes

zuston merged commit 44eb4e5 into apache:master Feb 20, 2024
38 of 40 checks passed

EnricoMi deleted the blockid-with-mapindex-attemptno branch February 20, 2024 08:59

EnricoMi mentioned this pull request Feb 27, 2024

[Improvement] Replace taskAttemptId in blockId with mapIndex #1512

Closed

3 tasks

This was referenced Feb 27, 2024

[Improvement] [SPARK] Modify task attempt id to avoid information redundancy. #1458

Closed

[#1398] fix(mr)(tez): Make attempId computable and move it to taskAttemptId in BlockId layout. #1418

Merged

EnricoMi mentioned this pull request Feb 27, 2024

[#134][FOLLOWUP] improvement(spark2): Use taskId and attemptNo as taskAttemptId #1544

Merged

zuston mentioned this pull request Mar 6, 2024

[Bug] RssMRUtils.convertTaskAttemptIdToLong produces too large taskAttemptIds #1518

Closed

3 tasks

EnricoMi mentioned this pull request Jul 2, 2024

Rework stage retry #1840

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529

[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529

EnricoMi commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

github-actions bot commented Feb 15, 2024 •

edited

Loading

EnricoMi commented Feb 16, 2024 •

edited

Loading

EnricoMi commented Feb 16, 2024

zuston commented Feb 16, 2024

jerqi Feb 16, 2024

EnricoMi Feb 16, 2024 •

edited

Loading

EnricoMi Feb 16, 2024 •

edited

Loading

jerqi Feb 16, 2024

jerqi commented Feb 16, 2024

jerqi Feb 16, 2024

EnricoMi Feb 16, 2024

EnricoMi Feb 16, 2024

EnricoMi Feb 16, 2024 •

edited

Loading

EnricoMi commented Feb 16, 2024

jerqi commented Feb 16, 2024

EnricoMi commented Feb 16, 2024

jerqi commented Feb 17, 2024

jerqi commented Feb 17, 2024

zuston commented Feb 17, 2024

zuston left a comment

zuston commented Feb 20, 2024

zuston commented Feb 20, 2024

EnricoMi commented Feb 20, 2024

	// BlockId is long and consist of partitionId, taskAttemptId, atomicInt
	// the length of them are ATOMIC_INT_MAX_LENGTH + PARTITION_ID_MAX_LENGTH +
	// TASK_ATTEMPT_ID_MAX_LENGTH = 63
	public static final int PARTITION_ID_MAX_LENGTH = 24;
	public static final int TASK_ATTEMPT_ID_MAX_LENGTH = 21;
	public static final int ATOMIC_INT_MAX_LENGTH = 18;

[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529

[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529

Conversation

EnricoMi commented Feb 15, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Feb 15, 2024 • edited Loading

Codecov Report

github-actions bot commented Feb 15, 2024 • edited Loading

Test Results

EnricoMi commented Feb 16, 2024 • edited Loading

EnricoMi commented Feb 16, 2024

zuston commented Feb 16, 2024

jerqi Feb 16, 2024

Choose a reason for hiding this comment

EnricoMi Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

EnricoMi Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

jerqi Feb 16, 2024

Choose a reason for hiding this comment

jerqi commented Feb 16, 2024

jerqi Feb 16, 2024

Choose a reason for hiding this comment

EnricoMi Feb 16, 2024

Choose a reason for hiding this comment

EnricoMi Feb 16, 2024

Choose a reason for hiding this comment

EnricoMi Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

EnricoMi commented Feb 16, 2024

jerqi commented Feb 16, 2024

EnricoMi commented Feb 16, 2024

jerqi commented Feb 17, 2024

jerqi commented Feb 17, 2024

zuston commented Feb 17, 2024

zuston left a comment

Choose a reason for hiding this comment

zuston commented Feb 20, 2024

zuston commented Feb 20, 2024

EnricoMi commented Feb 20, 2024

EnricoMi commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

github-actions bot commented Feb 15, 2024 •

edited

Loading

EnricoMi commented Feb 16, 2024 •

edited

Loading

EnricoMi Feb 16, 2024 •

edited

Loading

EnricoMi Feb 16, 2024 •

edited

Loading

EnricoMi Feb 16, 2024 •

edited

Loading