Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-8431] [mesos] Allow to specify # GPUs for TaskManager in Mesos #5307

Closed
wants to merge 5 commits into from

Conversation

eastcirclek
Copy link
Contributor

What is the purpose of the change

This PR introduces a new configuration property named "mesos.resourcemanager.tasks.gpus" to allow users to specify # of GPUs for each TaskManager process in Mesos. The configuration property is necessary because TaskManagers that do not specify to use GPUs cannot see GPUs at all when Mesos agents are configured to isolate GPUs as shown in [1].

[1] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags

Brief change log

  • Modify MesosTaskManagerParameters instead of ContaineredTaskManagerParameters to confine this problem to Mesos
  • Augment data types: (1) offers from Mesos and (2) task requests of Mesos frameworks
  • Add GPU_RESOURCES to the list of framework capabilities if "mesos.resourcemanager.tasks.gpus" > 0. Otherwise, LaunchCoordinator gets no offers from Mesos masters that are configured to prevent Mesos frameworks without GPU_RESOURCES from being given resources offers of GPU-equipped agents.

Verifying this change

I tested it by launching a standalone Flink cluster using ./bin/mesos-appmaster.sh. I tested the following scenarios with Mesos configured with --filter_gpu_resources.

  • When mesos.resourcemanager.tasks.gpus is not specified or is set to 0.0
    LaunchCoordinator isn't given any offer because MesosFlinkResourceManager does not enable GPU_RESOURCES capability when mesos.resourcemanager.tasks.gpus is not specified or it is set to 0.
  • When mesos.resourcemanager.tasks.gpus is smaller than or equal to the available GPUs on a node
    Given offers, LaunchCoordinator aggregates offers of different roles from the same node and puts aggregated offers to Fenzo for scheduling resources over nodes. When notified of the success of scheduling from Fenzo, LaunchCoordinator allocates resources of different roles to tasks and then populate Protos.TaskInfo using the allocated resources which is then wired to the Mesos master.
  • When mesos.resourcemanager.tasks.gpus is bigger than the available GPUs on a node
    Given offers, LaunchCoordinator aggregates offers of different roles from the same node and puts aggregated offers to Fenzo. However, Fenzo notifies LaunchCoordinator of the failure of scheduling with the following messages:
    AssignmentFailure {resource=Other, asking=3.0, used=0.0, available=2.0, message=gpus}.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes, it includes an upgrade (Fenzo)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: yes (JobManager and TaskManager on Mesos)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@eastcirclek
Copy link
Contributor Author

@EronWright @tillrohrmann Review this PR please

Copy link
Contributor

@EronWright EronWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I will review the changelog of Fenzo as I mentioned and reply back.

@@ -88,7 +88,7 @@ under the License.
<dependency>
<groupId>com.netflix.fenzo</groupId>
<artifactId>fenzo-core</artifactId>
<version>0.9.3</version>
<version>0.10.0</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My number one concern is the impact of updating the Fenzo dependency. You're taking a conservative approach here, which is probably wise. But I also have an aversion to .0 releases. Can we compromise and use 0.10.1? Meanwhile I am reviewing the Fenzo changelog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.10.1 works:

  • It passes all tests under flink-mesos (do you think it is enough to run tests under flink-mesos for the verification of Fenzo upgrade?)
  • I tested the scenario above in my test environment (less than and more than available GPUs)

I'm going to test 1.0.1 after the current running of CI is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I repeated the processes for Fenzo-1.0.1 and it all passes.

@@ -174,8 +183,9 @@ public AssignedResources getAssignedResources() {
public String toString() {
return "Request{" +
"cpus=" + getCPUs() +
"memory=" + getMemory() +
'}';
", memory=" + getMemory() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on those commas.

@@ -661,6 +661,7 @@ private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID,
// create the specific TM parameters from the resource profile and some defaults
MesosTaskManagerParameters params = new MesosTaskManagerParameters(
resourceProfile.getCpuCores() < 1.0 ? taskManagerParameters.cpus() : resourceProfile.getCpuCores(),
taskManagerParameters.gpus(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to add GPU at some point to ResourceProfile. @eastcirclek did you investigate that possibility?

throw new IllegalConfigurationException(MESOS_RM_TASKS_GPUS.key() +
" cannot be negative");
}
if (gpus % 1 != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor concern) Is there a possibility that the approximate nature of double will bite us? If the number must be whole, we could parse as an integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better be on the safe side as you said.


@Override
public Map<String, Double> getScalarValues() {
return aggregatedScalarResourceMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that we return all scalar resource types (cpus, gpus, ...) here, but in LaunchableMesosWorker::getScalarRequests we return only the generic resource types (gpus). Would you please double-check that this is expected by Fenzo? I wouldn't want Fenzo to double-count the cpus or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. It can cause confusion to contain an entry for cpus in Offer::aggregatedScalarResourceMap. We need to return only generic resource types (other than cpus, mem, network, and disk) as we do in LaunchableMesosWorker::getScalarRequests.

e -> e.getKey(),
e -> e.getValue().stream().mapToDouble(r -> r.getScalar().getValue()).sum()
));
this.aggregatedScalarResourceMap = Collections.unmodifiableMap(aggregatedScalarResourceMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change was motivated by a need to implement the new getScalarValues. It is good that the aggregation still occurs eagerly.

The logic could probably be simplified using Map::merge within the for loop and skipping the creation of scalarResourceMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tip on Java 8 👍

Copy link
Contributor

@EronWright EronWright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@EronWright
Copy link
Contributor

@tillrohrmann please merge

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @eastcirclek. The changes look good to me.

I've got a general question for my understanding. Is the original problem which we want to solve that Flink does not use agents which have GPU resources or that Flink cannot specify the number of GPUs it requires to run? It looks as if the PR solves the latter but I was wondering whether we shouldn't solve the former problem. Because if a user runs Flink on a Mesos cluster with mixed agents (some have GPUs other not), then it can either run on the set of GPU agents or non GPU agents. Wouldn't it also make sense to let Flink run on both sets or is this not in the scope of this PR?

@@ -238,6 +254,12 @@ public static MesosTaskManagerParameters create(Configuration flinkConfig) {
cpus = Math.max(containeredParameters.numSlots(), 1.0);
}

double gpus = Math.floor(flinkConfig.getDouble(MESOS_RM_TASKS_GPUS, 0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we rounding down because GPUs, cannot be shared? If this is the case, why don't we restrict the Flink configuration value to be an integer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feedback was similar, we could simply parse as an integer. The value must be an integer due to a limitation in Mesos. But the present solution seemed OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tillrohrmann I took possible changes in the future into consideration: what if Mesos accepts a float for # GPUs? If both of you @tillrohrmann and @EronWright think that taking an integer seems better, I'm going to follow your opinion 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user perspective I think it's clearer to make it an integer because then we don't confuse users who haven't read the code and think that they can configure a fraction of a GPU based on its type. Once Mesos accepts floats, we can change it in Flink as well. I will apply the change while merging the PR.

@EronWright
Copy link
Contributor

@tillrohrmann regarding your general question, you are right that Flink could, in concept, deploy to GPU hosts even if Flink doesn't require any GPUs. But we should keep in mind the intent of GPU_RESOURCES, which is to reserve GPU hosts for frameworks that need GPUs with an opt-in mechanism.

We could adjust the logic in this PR to the effect that the gpus configuration option has no default value. If any value is configured (including 0.0), add the GPU_RESOURCES. This would allow the user to make use of GPU hosts even if not requiring any GPU resources, but it feels a bit exploitative and would prefer we not do this. WDYT?

@eastcirclek
Copy link
Contributor Author

@tillrohrmann

As you pointed out, the discussion we had in the mailing list was about JM not starting TMs on GPU-equipped agents. It turned out that a Mesos framework needs to specify a GPU_RESOURCES capability if it wants to get resource offers that contain GPUs [link]. I managed to start TMs on the GPU-equipped agents by specifying a master flag --fliter_gpu_resources when starting the Mesos master. MESOS-7576 introduces --filter_gpu_resources and, when the flag is set to false, Mesos frameworks that do not have GPU_RESOURCES capability can receive offers that contain GPUs from the Mesos master. The problem seemed to be figured out without modifying Flink.

The reason I create FLINK-8431 to allow to specify # gpus is that TMs are not going to see GPUs if they do not request GPUs explicitly and GPUs are isolated as shown in link.

Regarding your question,

Is the original problem which we want to solve that Flink does not use agents which have GPU resources or that Flink cannot specify the number of GPUs it requires to run? It looks as if the PR solves the latter ...

Yes, the scope of FLINK-8431 and this PR is confined to the latter.

but I was wondering whether we shouldn't solve the former problem.

I don't think we need to take care of the former anymore because GPU_RESOURCES is going to be deprecated in favor of the reservation mechanism as shown in link and MESOS-7576. Thus, we need not split servers into two categories (CPU-only servers and GPU-equipped servers) anymore. Nevertheless, we need to specify GPU_RESOURCES until it is completely deprecated in Mesos-2.x. To this end, I add a GPU_RESOURCES capability if # gpus are larger than 0.

For those who are in a situation in which JM does not get offers that contains GPUs, I'd like to suggest to restart the Mesos master with --filter_gpu_resources set to false as explained above.

@tillrohrmann
Copy link
Contributor

Thanks for the clarification @EronWright and @eastcirclek. I'll merge the PR then. Thanks for your work :-)

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Jan 30, 2018
[FLINK-8431] Upgrade Fenzo dependency to 0.10.1

[FLINK-8431] Simplify scalar aggregation

[FLINK-8431] Offer::getScalarValues need not contain entries for cpus, mem, disk, and network

[FLINK-8431] Floor # gpus to make sure whole numbers

This closes apache#5307.
@asfgit asfgit closed this in b4e90fe Jan 30, 2018
@eastcirclek eastcirclek deleted the FLINK-8431 branch February 18, 2018 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants