-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-12763][runtime] Fail job immediately if tasks’ resource needs can not be satisfied. #8740
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 27f9881 (Tue Aug 06 15:39:08 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
0485ac2
to
d1756a3
Compare
} else { | ||
clusterInitTime = Time.milliseconds(configuration.getLong(JobManagerOptions.SLOT_REQUEST_TIMEOUT)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paring of clusterInitTime occurs in both StandaloneResourceManagerFactory
and StandaloneResourceManagerWithUUIDFactory
, I think it can be moved to somewhere in common and here we can get the parsed value directly.
* and failed. If not configured, {@link JobManagerOptions#SLOT_REQUEST_TIMEOUT} will be used by default. | ||
*/ | ||
public static final ConfigOption<Long> CLUSTER_INITIALIZATION_TIME = ConfigOptions | ||
.key("resourcemanager.cluster-init-time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that cluster-init-time
seems to be too general. Do you think names like resourcemanager.standalone.slot-resources-registration-timeout
would be better?
@@ -133,7 +139,7 @@ | |||
private final FatalErrorHandler fatalErrorHandler; | |||
|
|||
/** The slot manager maintains the available slots. */ | |||
private final SlotManager slotManager; | |||
protected final SlotManager slotManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be better to have a protected setAllowedSlotResourceProfiles
method in the ResourceManager than making slotManager
to be protected directly.
@@ -123,6 +125,9 @@ | |||
/** Release task executor only when each produced result partition is either consumed or failed. */ | |||
private final boolean waitResultConsumedBeforeRelease; | |||
|
|||
/** Resource profiles of slots that can be acquired. */ | |||
private final HashSet<ResourceProfile> allowedSlotResourceProfiles = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name is a little ambiguous and it seems to be the resource profiles of the slots that are allowed to created on the TM side. I think it may be better to use names like availableSlotResourceProfiles
. The comments and the name of setAllowedSlotResourceProfiles
may also be modified according.
Preconditions.checkArgument( | ||
ResourceProfile.UNKNOWN.equals(resourceProfile), | ||
"The YarnResourceManager does not support custom ResourceProfiles yet. It assumes that all containers have the same resources."); | ||
// start new TM with default resource profile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
starts
assertTrue(slotManager.registerSlotRequest(slotRequest1)); | ||
try { | ||
slotManager.registerSlotRequest(slotRequest2); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catch SlotManagerException
may be better.
assertTrue(slotManager.registerSlotRequest(slotRequest2)); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may also need to test the case that pendingRequest is marked as fail when setAllowedSlotResourceProfiles.
|
||
private ResourceID taskExecutorResourceID; | ||
|
||
private ResourceID resourceManagerResourceID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resourceManagerResourceID
is not used
|
||
private static final Time CLUSTER_INIT_TIME = Time.seconds(1L); | ||
|
||
private static final long HEARTBEAT_TIMEOUT = 5000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEARTBEAT_TIMEOUT_IN_MILLS
*/ | ||
public class StandaloneResourceManagerTest extends TestLogger { | ||
|
||
private static final Time TIMEOUT = Time.seconds(10L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIMEOUT
seems to be too general and may be RPC_TIMEOUT
|
||
private HardwareDescription hardwareDescription = new HardwareDescription(1, 2L, 3L, 4L); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this empty line.
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.flink.runtime.resourcemanager.utils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test located in the utils package? I think it may be in the org.apache.flink.runtime.resourcemanager
slotReport, | ||
TIMEOUT).get(); | ||
|
||
Thread.sleep(CLUSTER_INIT_TIME.toMilliseconds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slack time interval may be need to ensure stability.
d1756a3
to
6af2fdd
Compare
@gaoyunhaii Thanks again for the review. Your comments are addressed. |
36dfa37
to
0cca74c
Compare
@flinkbot attention @tillrohrmann |
45d4b69
to
063aee3
Compare
8864739
to
09b91bd
Compare
I think that the problem you describe is a more general problem for the standalone resource manager. I standalone mode, it can take a long time until the "not enough resources" exception comes for streaming jobs, and for batch jobs the "no matching slot". So why don't we solve it in a more general way? I like the idea of a "startup period" in which the standalone RM waits for a longer timeout for TMs (and thus slots) to appear, and after that period slot requests are failed immediately if no free slot is readily available. That idea has floated around for a bit, maybe it is time to go for it. What I don't quite understand is the "mixed solution" in this PR that the startup period is used to discover what resource profiles are available. After that, requests still time out after a long time unless they request a resource profile that is incompatible with the ones seen during the startup period. I think this may lead to strange behavior:
All this becomes both easier and more consistent with a simple startup-period for the StandaloneResourceManager. After that, all fail immediately unless a slot is directly available. What do you think? BTW: This would be a change we need to discuss on dev/user mailing lists, because it changes system behavior. Probably most users would agree that it is for the better, but nonetheless, we need to be transparent there. |
flink-core/src/main/java/org/apache/flink/configuration/ConfigurationUtils.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
Outdated
Show resolved
Hide resolved
I think you are right. I can see the problem of the current solution. Just one question about the solution your proposed. What do we do for the batch job?
|
0452d82
to
2767427
Compare
I think the difference between batch and streaming should not manifest in the ResourceManager. It can manifest in the scheduler, so let's see if we can cover this there. What do you think about this approach:
Long Term Approach
Short Term
|
2767427
to
30ef9c3
Compare
@flinkbot attention @StephanEwen Hi @StephanEwen , I've update this PR, as well as its the description. Please take a look at your convenience. Thank you. |
…converted from ResourceSpecs.
30ef9c3
to
a27bf93
Compare
…n slot offers. The test cases should not user ResourceProfile#UNKNOWN in slot offers. - ResourceProfile#UNKNOWN is used for slot requests whose resource needs are not specified. - ResourceProfile#ANY is used for task manager slots whose can match any slot request. These cases haven't been failing because so far slot requests always have unknown resource profiles.
…if it is set to do so.
…ker when requested resource profile cannot be satisfied.
…ail unfulfillable requests on started.
…ter that after this period StandaloneResourceManager set SlotManager to fail unfulfillable requests.
a27bf93
to
27f9881
Compare
Rebased to master after #8704 is merged. |
Looks good overall. And thanks for splitting this into finer grained commits for the review - it helped a lot. I think we may have a latent bug in the In Especially because the |
Thank you for the review, @StephanEwen. |
True - UNKNOWN has negative requirements (I confused it with ANY, which has infinite requirements). |
Manually merged by @StephanEwen in ea032ae. |
What is the purpose of the change
This pull request is based on #8704. It makes SlotManager immediately fails slot requests with resource profiles that cannot be satisfied.
Brief change log
ResourceSpec
of job vertex toResourceProfile
of the slot request, which marks the fine grained resource profiles are actually used for scheduling.failUnfulfillableRequest
and corresponding tests inSlotManager
. When switched on, the SlotManager should immediately fail pending and new coming slot requests that cannot be satisfied by neither any registered slot nor any pending slot that can be allocated.startNewWorker
, according to the default configuration. New workers will not be started if the requests can not be satisfied.failUnfulfillableRequest
switch ofSlotManager
on right after started, because they can decide whether requested resource profiles can be fulfilled from the configuration.failUnfulfillableRequest
switch ofSlotManager
on after a start-up period, during which it expect task managers to be registered. Introduce a config option for this start-up period.Verifying this change
This change added tests and can be verified as follows:
failUnfulfillableRequest
switch is turned on.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation