-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayResourceManager and ParallelLocalFoldFittingStrategy #3054
Conversation
Job PR-3054-eb513d4 is done. |
Job PR-3054-d0cf864 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
@@ -1,4 +1,5 @@ | |||
import multiprocessing | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move try_import
logic to common to avoid the circular dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, i can do this in a follow-up PR
class RayResourceManager: | ||
"""Manager that fetches ray cluster resources info. This class should only be used within a ray cluster.""" | ||
|
||
@staticmethod | ||
def _init_ray(): | ||
import ray | ||
if not ray.is_initialized(): | ||
ray.init( | ||
address="auto", # Force ray to connect to an existing cluster. There should be one. Otherwise, something went wrong | ||
log_to_driver=False, | ||
logging_level=logging.ERROR | ||
) | ||
|
||
@staticmethod | ||
def _get_cluster_sources(key, default_val=0): | ||
import ray | ||
RayResourceManager._init_ray() | ||
return ray.cluster_resources().get(key, default_val) | ||
|
||
@staticmethod | ||
def get_cpu_count(): | ||
return RayResourceManager._get_cluster_sources("CPU") | ||
|
||
@staticmethod | ||
def get_gpu_count_all(): | ||
return RayResourceManager._get_cluster_sources("GPU") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add some kind of sanity check assertion method such as assert_cluster_exists()
that is called at the appropriate time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already done by the ray.init(address="auto"). This line will fail if there's no existing cluster
RayResourceManager._init_ray() | ||
return ray.cluster_resources().get(key, default_val) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add type hints + consider allowing to return physical vs virtual cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added type hints. Ray cluster only work with virtual cores
core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py
Outdated
Show resolved
Hide resolved
core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py
Outdated
Show resolved
Hide resolved
core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py
Outdated
Show resolved
Hide resolved
def _sync_model_artifact(self, local_path, model_sync_path): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring + type hint. Do we even want to allow calling this in the scenarios it doesn't do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly serving as a general interface so that subclass can provide its own implementation. In the case of no syncing is needed, I think it makes sense to just leave an empty implementation there.
core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py
Outdated
Show resolved
Hide resolved
Job PR-3054-fa3b2f8 is done. |
core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py
Outdated
Show resolved
Hide resolved
core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py
Outdated
Show resolved
Hide resolved
Job PR-3054-aeb30a2 is done. |
Job PR-3054-a17a90f is done. |
Job PR-3054-623a9c9 is done. |
Job PR-3054-873dfab is done. |
Job PR-3054-2600568 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work!
Issue #, if available:
Description of changes:
RayResourceManager
andParallelLocalFoldFittingStrategy
RayResourceManager
will report total resources of the cluster instead of a single hardware.try_import
function cannot be used here because it's being defined withinautogluon.core
, which has a dependency onautogluon.common
. Using it would result in a circular dependency. Here, we only instantiateRayResourceManager
when a env varAG_DISTRIBUTED_MODE
is set. This variable is supposed to be set by thecloud
moduleParallelLocalFoldFittingStrategy
will have a different ray init args and requires a s3 bucket for model synchronization between head and worker nodes. Currently the s3 bucket is being defined in a env varAG_MODEL_SYNC_PATH
. This interface is subject to change.Attach a screenshot of AG being run on a cluster of 2 m5.2xlarge machine.
Take notice on how 2 node_id are present for tasks and each task is utilizing 2 cpus while a single m5.2xlarge only has 8 cpus.
predictor.predict
also works after thefit
meaning model artifacts have been saved correctlyBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.