Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum resources #3142

Merged
merged 5 commits into from
Apr 14, 2023
Merged

Maximum resources #3142

merged 5 commits into from
Apr 14, 2023

Conversation

yinweisu
Copy link
Collaborator

Issue #, if available:
torch models training will be slowed by the usage of virtual cores.

Description of changes:

  • Add maximum resources check when in distributed mode

Example run output with a newly launched cluster of 8 m5.24xlarge machine:
The training time matches a local run and appear to be normal now

Fitting 1 L1 models ...
Fitting model: NeuralNetFastAI_BAG_L1 ...
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
        0.9224   = Validation score   (accuracy)
        286.06s  = Training   runtime
        3.44s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.9224   = Validation score   (accuracy)
        0.08s    = Training   runtime
        0.04s    = Validation runtime
AutoGluon training complete, total runtime = 298.45s ... Best model: "WeightedEnsemble_L2"

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@yinweisu yinweisu requested a review from Innixma April 14, 2023 17:56
@github-actions
Copy link

Job PR-3142-7a800f8 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3142/7a800f8/index.html

total_resources: Optional[Dict[str, Union[int, float]]] = None,
parallel_hpo: bool = False,
**kwargs
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstring explaining this, add return type

return kwargs

def _preprocess_fit_resources(self, silent=False, total_resources=None, parallel_hpo=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add return type

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added a few minor comments

return kwargs

def _preprocess_fit_resources(self, silent=False, total_resources=None, parallel_hpo=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type hints

@github-actions
Copy link

Job PR-3142-bd6372d is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3142/bd6372d/index.html

@yinweisu
Copy link
Collaborator Author

Merging as unit tests for previous commits have passed, the most recent commit only added comments

@yinweisu yinweisu merged commit c651497 into autogluon:master Apr 14, 2023
15 checks passed
@github-actions
Copy link

Job PR-3142-f76a039 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3142/f76a039/index.html

@github-actions
Copy link

Job PR-3142-178cf94 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3142/178cf94/index.html

@github-actions
Copy link

Job PR-3142-d22642f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3142/d22642f/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants