New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

RayResourceManager and ParallelLocalFoldFittingStrategy #3054

Merged

yinweisu merged 15 commits into autogluon:master from yinweisu:master

Mar 27, 2023

Collaborator

yinweisu commented Mar 16, 2023

Issue #, if available:

Description of changes:

This PR enables distributed training in AG tabular by implementing RayResourceManager and ParallelLocalFoldFittingStrategy
RayResourceManager will report total resources of the cluster instead of a single hardware.
- The functionality requires import of ray. The try_import function cannot be used here because it's being defined within autogluon.core, which has a dependency on autogluon.common. Using it would result in a circular dependency. Here, we only instantiate RayResourceManager when a env var AG_DISTRIBUTED_MODE is set. This variable is supposed to be set by the cloud module
ParallelLocalFoldFittingStrategy will have a different ray init args and requires a s3 bucket for model synchronization between head and worker nodes. Currently the s3 bucket is being defined in a env var AG_MODEL_SYNC_PATH. This interface is subject to change.
Adding unit test is hard as it requires a ray cluster to work properly. The cloud module can test the functionality on its side.

Attach a screenshot of AG being run on a cluster of 2 m5.2xlarge machine.
Take notice on how 2 node_id are present for tasks and each task is utilizing 2 cpus while a single m5.2xlarge only has 8 cpus.

predictor.predict also works after the fit meaning model artifacts have been saved correctly

Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1 ...
parallel_distributed
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
        0.85     = Validation score   (accuracy)
        0.45s    = Training   runtime
        0.04s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.85     = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 3.73s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230316_203258/")
6118      <=50K
23204     <=50K
29590     <=50K
18116     <=50K
33964      >50K
          ...  
29128     <=50K
23950     <=50K
13700      >50K
35248     <=50K
24772     <=50K

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

yinweisu requested a review from Innixma

March 16, 2023 20:51

yinweisu force-pushed the master branch from 1bb847d to eb513d4 Compare

March 17, 2023 17:03

github-actions bot commented Mar 17, 2023

Job PR-3054-eb513d4 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/eb513d4/index.html

github-actions bot commented Mar 17, 2023

Job PR-3054-d0cf864 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/d0cf864/index.html

Innixma requested changes

View reviewed changes

Contributor

Innixma left a comment

Added comments

common/src/autogluon/common/utils/resource_utils.py

		@@ -1,4 +1,5 @@
		import multiprocessing
		import logging

Contributor

Innixma Mar 17, 2023

Can we move try_import logic to common to avoid the circular dependency?

Contributor

Innixma Mar 18, 2023

Can be a follow-up PR

Collaborator Author

yinweisu Mar 20, 2023

Yea, i can do this in a follow-up PR

common/src/autogluon/common/utils/resource_utils.py Outdated

Comment on lines 119 to 144

+              class RayResourceManager:
+                  """Manager that fetches ray cluster resources info. This class should only be used within a ray cluster."""
+                  @staticmethod
+                  def _init_ray():
+                      import ray
+                      if not ray.is_initialized():
+                          ray.init(
+                              address="auto",  # Force ray to connect to an existing cluster. There should be one. Otherwise, something went wrong
+                              log_to_driver=False,
+                              logging_level=logging.ERROR
+                          )
+                  @staticmethod
+                  def _get_cluster_sources(key, default_val=0):
+                      import ray
+                      RayResourceManager._init_ray()
+                      return ray.cluster_resources().get(key, default_val)
+                  @staticmethod
+                  def get_cpu_count():
+                      return RayResourceManager._get_cluster_sources("CPU")
+                  @staticmethod
+                  def get_gpu_count_all():
+                      return RayResourceManager._get_cluster_sources("GPU")

Contributor

Innixma Mar 18, 2023

Do we want to add some kind of sanity check assertion method such as assert_cluster_exists() that is called at the appropriate time?

Collaborator Author

yinweisu Mar 20, 2023

This is already done by the ray.init(address="auto"). This line will fail if there's no existing cluster

common/src/autogluon/common/utils/resource_utils.py Outdated Show resolved Hide resolved

common/src/autogluon/common/utils/resource_utils.py

+                      RayResourceManager._init_ray()
+                      return ray.cluster_resources().get(key, default_val)
+                  @staticmethod

Contributor

Innixma Mar 18, 2023

add type hints + consider allowing to return physical vs virtual cores.

Collaborator Author

yinweisu Mar 20, 2023

Added type hints. Ray cluster only work with virtual cores

common/src/autogluon/common/utils/resource_utils.py Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated

Comment on lines 737 to 767

		def _sync_model_artifact(self, local_path, model_sync_path):
		pass

Contributor

Innixma Mar 18, 2023

docstring + type hint. Do we even want to allow calling this in the scenarios it doesn't do anything?

Collaborator Author

yinweisu Mar 21, 2023

This is mainly serving as a general interface so that subclass can provide its own implementation. In the case of no syncing is needed, I think it makes sense to just leave an empty implementation there.

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated Show resolved Hide resolved

yinweisu force-pushed the master branch from 6ebb114 to 24188e0 Compare

March 21, 2023 23:19

github-actions bot commented Mar 22, 2023

Job PR-3054-fa3b2f8 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/fa3b2f8/index.html

Innixma reviewed

View reviewed changes

common/src/autogluon/common/utils/s3_utils.py Outdated Show resolved Hide resolved

common/src/autogluon/common/utils/s3_utils.py Outdated Show resolved Hide resolved

core/src/autogluon/core/models/abstract/abstract_model.py Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated Show resolved Hide resolved

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py Outdated Show resolved Hide resolved

core/src/autogluon/core/utils/try_import.py Outdated Show resolved Hide resolved

github-actions bot commented Mar 24, 2023

Job PR-3054-aeb30a2 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/aeb30a2/index.html

github-actions bot commented Mar 24, 2023

Job PR-3054-a17a90f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/a17a90f/index.html

yinweisu force-pushed the master branch from a17a90f to 623a9c9 Compare

March 24, 2023 20:59

Weisu Yin added 14 commits

March 24, 2023 22:56


          check

46b13f3


          check

10992ae

fix

8b91076


          minor

792483b


          check

7fe41a5


          address s3 comments

ab5aa98

cp

e4cf695


          address comments

191735a

fix

a5fada1

fix

385fe15


          address comments

43571e8

fix

3cba73a

fix

e08d1cb


          rebase and update

873dfab

yinweisu force-pushed the master branch from 623a9c9 to 873dfab Compare

March 24, 2023 23:05

github-actions bot commented Mar 24, 2023

Job PR-3054-623a9c9 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/623a9c9/index.html


          update s3 util

github-actions bot commented Mar 25, 2023

Job PR-3054-873dfab is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/873dfab/index.html

github-actions bot commented Mar 25, 2023

Job PR-3054-2600568 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3054/2600568/index.html

Innixma approved these changes

View reviewed changes

Contributor

Innixma left a comment

LGTM, great work!

yinweisu merged commit bdf0c66 into autogluon:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment