Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular Distributed HPO, and Distributed HPO + bagging #3179

Merged
merged 6 commits into from
Apr 27, 2023

Conversation

yinweisu
Copy link
Collaborator

Issue #, if available:

Description of changes:

  • Enable distributed HPO and Distributed HPO + bagging for Tabular

Log from example run with 8 trials, each being a bagged model consists of 8 folds, training in parallel in a cluster of 8 m5.2xlarge machine.

Fitting 1 L1 models ...
Hyperparameter tuning model: NeuralNetFastAI_BAG_L1 ...
== Status ==
Current time: 2023-04-26 21:33:53 (running for 00:00:02.87)
Memory usage on this node: 1.6/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 8.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 1/8 (1 RUNNING)


== Status ==
Current time: 2023-04-26 21:33:58 (running for 00:00:08.32)
Memory usage on this node: 1.6/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 40.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 5/8 (5 RUNNING)


== Status ==
Current time: 2023-04-26 21:34:04 (running for 00:00:14.49)
Memory usage on this node: 1.8/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 48.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: 164e3c3a with validation_performance=0.842 and parameters={'layers': None, 'emb_drop': 0.1, 'ps': 0.1, 'bs': 256, 'lr': 0.01, 'epochs': 30, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0005, 'activation': 'relu', 'dropout_prob': 0.1}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (1 PENDING, 6 RUNNING, 1 TERMINATED)



== Status ==
Current time: 2023-04-26 21:34:10 (running for 00:00:19.82)
Memory usage on this node: 1.8/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 40.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: f1c1c24e with validation_performance=0.86 and parameters={'layers': (1000, 500), 'emb_drop': 0.15201944925480249, 'ps': 0.4187612771721835, 'bs': 256, 'lr': 0.029533957578207624, 'epochs': 30, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0021074502225949952, 'activation': 'softrelu', 'dropout_prob': 0.06302284671855163}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (5 RUNNING, 3 TERMINATED)


== Status ==
Current time: 2023-04-26 21:34:15 (running for 00:00:25.26)
Memory usage on this node: 1.9/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 24.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: f1c1c24e with validation_performance=0.86 and parameters={'layers': (1000, 500), 'emb_drop': 0.15201944925480249, 'ps': 0.4187612771721835, 'bs': 256, 'lr': 0.029533957578207624, 'epochs': 30, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0021074502225949952, 'activation': 'softrelu', 'dropout_prob': 0.06302284671855163}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (3 RUNNING, 5 TERMINATED)


== Status ==
Current time: 2023-04-26 21:34:21 (running for 00:00:30.60)
Memory usage on this node: 1.9/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: b3b5953e with validation_performance=0.864 and parameters={'layers': (1000, 500, 200), 'emb_drop': 0.16326137932493823, 'ps': 0.07573195581645337, 'bs': 1024, 'lr': 0.042738427971420245, 'epochs': 29, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0004787024097994504, 'activation': 'relu', 'dropout_prob': 0.25849267730526676}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (8 TERMINATED)


== Status ==
Current time: 2023-04-26 21:34:21 (running for 00:00:30.61)
Memory usage on this node: 1.9/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: b3b5953e with validation_performance=0.864 and parameters={'layers': (1000, 500, 200), 'emb_drop': 0.16326137932493823, 'ps': 0.07573195581645337, 'bs': 1024, 'lr': 0.042738427971420245, 'epochs': 29, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0004787024097994504, 'activation': 'relu', 'dropout_prob': 0.25849267730526676}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (8 TERMINATED)


Fitted model: NeuralNetFastAI_BAG_L1/164e3c3a ...
        0.842    = Validation score   (accuracy)
        7.14s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/f1c1c24e ...
        0.86     = Validation score   (accuracy)
        7.9s     = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/b3b5953e ...
        0.864    = Validation score   (accuracy)
        17.19s   = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/779a4822 ...
        0.848    = Validation score   (accuracy)
        7.52s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/9081165e ...
        0.826    = Validation score   (accuracy)
        9.02s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/63102c55 ...
        0.856    = Validation score   (accuracy)
        10.6s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/f61a7fb8 ...
        0.852    = Validation score   (accuracy)
        8.57s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/815f4587 ...
        0.838    = Validation score   (accuracy)
        7.31s    = Training   runtime
        0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
        0.864    = Validation score   (accuracy)
        0.35s    = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 32.48s ... Best model: "WeightedEnsemble_L2"

Notice how more than 8 cpus were utilized (logs from training that took longer where you can see all cpus being utilized is too long to copy over).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

Job PR-3179-1be77cc is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3179/1be77cc/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had 1 comment

@@ -193,6 +196,10 @@ def register_resources(
if minimum_model_num_gpus > 0:
num_jobs_in_parallel_with_gpu = num_gpus // minimum_model_num_gpus
num_jobs_in_parallel = min(num_jobs_in_parallel_with_mem, num_jobs_in_parallel_with_cpu, num_jobs_in_parallel_with_gpu)
if k_fold is not None and k_fold > 0:
num_jobs_in_parallel = min(num_jobs_in_parallel, self.hyperparameter_tune_kwargs.get("num_trials", math.inf) * k_fold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Have a variable assigned to max_models so it is easier to understand this logic.

Also, this might get more confusing if we add nested bagging or repeated k-fold in the logic.

@yinweisu yinweisu merged commit 1153fd8 into autogluon:master Apr 27, 2023
26 checks passed
@github-actions
Copy link

Job PR-3179-cdb0859 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3179/cdb0859/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants