Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RayResourceManager and ParallelLocalFoldFittingStrategy #3054

Merged
merged 15 commits into from
Mar 27, 2023

Conversation

yinweisu
Copy link
Collaborator

Issue #, if available:

Description of changes:

  • This PR enables distributed training in AG tabular by implementing RayResourceManager and ParallelLocalFoldFittingStrategy
  • RayResourceManager will report total resources of the cluster instead of a single hardware.
    • The functionality requires import of ray. The try_import function cannot be used here because it's being defined within autogluon.core, which has a dependency on autogluon.common. Using it would result in a circular dependency. Here, we only instantiate RayResourceManager when a env var AG_DISTRIBUTED_MODE is set. This variable is supposed to be set by the cloud module
  • ParallelLocalFoldFittingStrategy will have a different ray init args and requires a s3 bucket for model synchronization between head and worker nodes. Currently the s3 bucket is being defined in a env var AG_MODEL_SYNC_PATH. This interface is subject to change.
  • Adding unit test is hard as it requires a ray cluster to work properly. The cloud module can test the functionality on its side.

Attach a screenshot of AG being run on a cluster of 2 m5.2xlarge machine.
Take notice on how 2 node_id are present for tasks and each task is utilizing 2 cpus while a single m5.2xlarge only has 8 cpus.
image

predictor.predict also works after the fit meaning model artifacts have been saved correctly

Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1 ...
parallel_distributed
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
        0.85     = Validation score   (accuracy)
        0.45s    = Training   runtime
        0.04s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.85     = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 3.73s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230316_203258/")
6118      <=50K
23204     <=50K
29590     <=50K
18116     <=50K
33964      >50K
          ...  
29128     <=50K
23950     <=50K
13700      >50K
35248     <=50K
24772     <=50K

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

Job PR-3054-eb513d4 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/eb513d4/index.html

@github-actions
Copy link

Job PR-3054-d0cf864 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/d0cf864/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

@@ -1,4 +1,5 @@
import multiprocessing
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move try_import logic to common to avoid the circular dependency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a follow-up PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, i can do this in a follow-up PR

Comment on lines 119 to 144
class RayResourceManager:
"""Manager that fetches ray cluster resources info. This class should only be used within a ray cluster."""

@staticmethod
def _init_ray():
import ray
if not ray.is_initialized():
ray.init(
address="auto", # Force ray to connect to an existing cluster. There should be one. Otherwise, something went wrong
log_to_driver=False,
logging_level=logging.ERROR
)

@staticmethod
def _get_cluster_sources(key, default_val=0):
import ray
RayResourceManager._init_ray()
return ray.cluster_resources().get(key, default_val)

@staticmethod
def get_cpu_count():
return RayResourceManager._get_cluster_sources("CPU")

@staticmethod
def get_gpu_count_all():
return RayResourceManager._get_cluster_sources("GPU")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add some kind of sanity check assertion method such as assert_cluster_exists() that is called at the appropriate time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done by the ray.init(address="auto"). This line will fail if there's no existing cluster

common/src/autogluon/common/utils/resource_utils.py Outdated Show resolved Hide resolved
RayResourceManager._init_ray()
return ray.cluster_resources().get(key, default_val)

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type hints + consider allowing to return physical vs virtual cores.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type hints. Ray cluster only work with virtual cores

Comment on lines 737 to 767
def _sync_model_artifact(self, local_path, model_sync_path):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring + type hint. Do we even want to allow calling this in the scenarios it doesn't do anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly serving as a general interface so that subclass can provide its own implementation. In the case of no syncing is needed, I think it makes sense to just leave an empty implementation there.

@github-actions
Copy link

Job PR-3054-fa3b2f8 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/fa3b2f8/index.html

@github-actions
Copy link

Job PR-3054-aeb30a2 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/aeb30a2/index.html

@github-actions
Copy link

Job PR-3054-a17a90f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/a17a90f/index.html

@github-actions
Copy link

Job PR-3054-623a9c9 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/623a9c9/index.html

@github-actions
Copy link

Job PR-3054-873dfab is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/873dfab/index.html

@github-actions
Copy link

Job PR-3054-2600568 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/2600568/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

@yinweisu yinweisu merged commit bdf0c66 into autogluon:master Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants