-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529
[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId #1529
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1529 +/- ##
============================================
+ Coverage 54.15% 55.14% +0.98%
- Complexity 2803 2804 +1
============================================
Files 430 410 -20
Lines 24417 22056 -2361
Branches 2081 2081
============================================
- Hits 13224 12163 -1061
+ Misses 10361 9133 -1228
+ Partials 832 760 -72 ☔ View full report in Codecov by Sentry. |
Re #1514 (comment):
The current config is not optimal:
The Further, a With the improvement in #1529 you would set If you would like to support 2 m partitions and 4 max failures, then you would use:
I think 2 m partitions is quite a lot (supports 400 TB datasets) and In other words, 2^ |
Re #1514 (comment):
The added A speculative execution setup requires workers / executors with different host names. I think that has not been done in Uniffle before, so that would require some significant work with little extra benefit. |
Make sense.
Thanks for your explanation, make sense for me. cc @jerqi |
* practically reach LONG.MAX_VALUE. That would overflow the bits in the block id. | ||
* | ||
* <p>Here we use the map index or task id, appended by the attempt number per task. The map index | ||
* is limited by the number of partitions of a stage. The attempt number per task is limited / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempt number may be larger than maxFailures.
If we fail 3 times, when fourth attempt run, Spark may trigger a speculative task at the same time. We will have 5 attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a bit surprising, but looking at the relevant code, the max failure is not considered when resubmitting a task as speculative:
https://github.com/apache/spark/blob/2abd3a2f445e86337ad94da19f301cb2b8bc232f/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L1226-L1227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, accounted for: 2f5bccd
@VisibleForTesting | ||
protected static long getTaskAttemptId(int mapIndex, int attemptNo, int maxFailures) { | ||
int maxAttemptNo = maxFailures < 1 ? 0 : maxFailures - 1; | ||
if (attemptNo > maxAttemptNo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's ok if we have this judgement. But we should consider the case above.
How does the reader process the new taskAttemptId? |
* @return a task attempt id unique for a shuffle stage | ||
*/ | ||
@VisibleForTesting | ||
protected static long getTaskAttemptId(int mapIndex, int attemptNo, int maxFailures, boolean speculationEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider that mapIndex
exceed the limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the returned long
as taskAttemptId
to ClientUtils.getBlockId
will raise an error that the taskAttemptId
is too large for the block id bit layout.
Method getTaskAttemptId
will not long
-overflow because the mapIndex
is an int
, the maxFailures
is an int
. With maxFailures = Integer.MAX_VALUE
, the value of attemptBits
is 31 and the returned value is still a valid positive long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer to fail-fast here, I can add an assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 7b745ed as it is always good to fail-fast with a meaningful error message.
The reader uses the taskAttemptId as is, they are opaque, they are just unique long values. |
dcd03bc
to
3abe7ae
Compare
3abe7ae
to
7b745ed
Compare
The reader will retrieve the taskAttemptIds which it need to read from the MapOutputTracker. So they are not opaque |
The Lines 471 to 477 in d120f4b
The |
OK. I got it. |
@zuston Do you have another suggestion? |
It looks this PR is a compatible change, but I'm not sure whether this will effect the local order mechanism? I will review this in next week of working days carefully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I have checked the LocalOrderSegmentSplitter
, and I think the task id is now composed in a way that still maintains the sequence required by local order
@jerqi PTAL again. |
Thanks @EnricoMi . This is a great improvement! 🎉 |
Thank you for incorporating this, it helps us greatly migrating to Uniffle! |
…kAttemptId (#1544) ### What changes were proposed in this pull request? Use map index and task attempt number as the task attempt id in Spark2. ### Why are the changes needed? This aligns Spark2 taskAttemptId of the blockId with Spark3. See #1529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing integration tests.
What changes were proposed in this pull request?
Use map index and task attempt number as the task attempt id in Spark3.
This requires to rework the bits of the blockId to maximize bit utilization for Spark3:
incubator-uniffle/common/src/main/java/org/apache/uniffle/common/util/Constants.java
Lines 30 to 35 in b924aca
Ideally, the
TASK_ATTEMPT_ID_MAX_LENGTH
is set equal toPARTITION_ID_MAX_LENGTH
+ the number of bits required to store the largest task attempt number. The largest task attempt number ismaxFailures - 1
, ormaxFailures
if speculative execution is enabled (configured viaspark.speculation
and disabled by default). ThemaxFailures
is configured viaspark.task.maxFailures
and defaults to 4. So by default, two bits are required to store the largest attempt number andTASK_ATTEMPT_ID_MAX_LENGTH
should be set toPARTITION_ID_MAX_LENGTH + 2
.Example:
PARTITION_ID_MAX_LENGTH = 20
, Uniffle supports 1,048,576 partitionsTASK_ATTEMPT_ID_MAX_LENGTH = 22
ATOMIC_INT_MAX_LENGTH = 21
.Why are the changes needed?
The map index (map partition id) is limited to the number of partitions of a shuffle. The task attempt number is limited by the max number of failures configured by
spark.task.maxFailures
, which defaults to 4. This provides us an id that is unique per shuffe while not growing arbitrarily large ascontext.taskAttemptId
does.Fix: #134
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit and integration tests.