Tabular Distributed HPO, and Distributed HPO + bagging #3179

yinweisu · 2023-04-26T21:39:24Z

Issue #, if available:

Description of changes:

Enable distributed HPO and Distributed HPO + bagging for Tabular

Log from example run with 8 trials, each being a bagged model consists of 8 folds, training in parallel in a cluster of 8 m5.2xlarge machine.

Fitting 1 L1 models ...
Hyperparameter tuning model: NeuralNetFastAI_BAG_L1 ...
== Status ==
Current time: 2023-04-26 21:33:53 (running for 00:00:02.87)
Memory usage on this node: 1.6/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 8.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 1/8 (1 RUNNING)


== Status ==
Current time: 2023-04-26 21:33:58 (running for 00:00:08.32)
Memory usage on this node: 1.6/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 40.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 5/8 (5 RUNNING)


== Status ==
Current time: 2023-04-26 21:34:04 (running for 00:00:14.49)
Memory usage on this node: 1.8/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 48.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: 164e3c3a with validation_performance=0.842 and parameters={'layers': None, 'emb_drop': 0.1, 'ps': 0.1, 'bs': 256, 'lr': 0.01, 'epochs': 30, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0005, 'activation': 'relu', 'dropout_prob': 0.1}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (1 PENDING, 6 RUNNING, 1 TERMINATED)



== Status ==
Current time: 2023-04-26 21:34:10 (running for 00:00:19.82)
Memory usage on this node: 1.8/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 40.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: f1c1c24e with validation_performance=0.86 and parameters={'layers': (1000, 500), 'emb_drop': 0.15201944925480249, 'ps': 0.4187612771721835, 'bs': 256, 'lr': 0.029533957578207624, 'epochs': 30, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0021074502225949952, 'activation': 'softrelu', 'dropout_prob': 0.06302284671855163}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (5 RUNNING, 3 TERMINATED)


== Status ==
Current time: 2023-04-26 21:34:15 (running for 00:00:25.26)
Memory usage on this node: 1.9/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 24.0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: f1c1c24e with validation_performance=0.86 and parameters={'layers': (1000, 500), 'emb_drop': 0.15201944925480249, 'ps': 0.4187612771721835, 'bs': 256, 'lr': 0.029533957578207624, 'epochs': 30, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0021074502225949952, 'activation': 'softrelu', 'dropout_prob': 0.06302284671855163}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (3 RUNNING, 5 TERMINATED)


== Status ==
Current time: 2023-04-26 21:34:21 (running for 00:00:30.60)
Memory usage on this node: 1.9/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: b3b5953e with validation_performance=0.864 and parameters={'layers': (1000, 500, 200), 'emb_drop': 0.16326137932493823, 'ps': 0.07573195581645337, 'bs': 1024, 'lr': 0.042738427971420245, 'epochs': 29, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0004787024097994504, 'activation': 'relu', 'dropout_prob': 0.25849267730526676}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (8 TERMINATED)


== Status ==
Current time: 2023-04-26 21:34:21 (running for 00:00:30.61)
Memory usage on this node: 1.9/30.9 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/0 GPUs, 0.0/174.34 GiB heap, 0.0/70.76 GiB objects
Current best trial: b3b5953e with validation_performance=0.864 and parameters={'layers': (1000, 500, 200), 'emb_drop': 0.16326137932493823, 'ps': 0.07573195581645337, 'bs': 1024, 'lr': 0.042738427971420245, 'epochs': 29, 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10, 'learning_rate': 0.0004787024097994504, 'activation': 'relu', 'dropout_prob': 0.25849267730526676}
Result logdir: /tmp/ray/session_2023-04-26_21-30-34_117853_27096/runtime_resources/working_dir_files/_ray_pkg_8885f3ae4fb837d6/AutogluonModels/ag-20230426_213349/models/NeuralNetFastAI_BAG_L1
Number of trials: 8/8 (8 TERMINATED)


Fitted model: NeuralNetFastAI_BAG_L1/164e3c3a ...
        0.842    = Validation score   (accuracy)
        7.14s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/f1c1c24e ...
        0.86     = Validation score   (accuracy)
        7.9s     = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/b3b5953e ...
        0.864    = Validation score   (accuracy)
        17.19s   = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/779a4822 ...
        0.848    = Validation score   (accuracy)
        7.52s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/9081165e ...
        0.826    = Validation score   (accuracy)
        9.02s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/63102c55 ...
        0.856    = Validation score   (accuracy)
        10.6s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/f61a7fb8 ...
        0.852    = Validation score   (accuracy)
        8.57s    = Training   runtime
        0.0s     = Validation runtime
Fitted model: NeuralNetFastAI_BAG_L1/815f4587 ...
        0.838    = Validation score   (accuracy)
        7.31s    = Training   runtime
        0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        WARNING: Setting `self._oof_pred_proba` by predicting on train directly! This is probably a bug and should be investigated...
        0.864    = Validation score   (accuracy)
        0.35s    = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 32.48s ... Best model: "WeightedEnsemble_L2"

Notice how more than 8 cpus were utilized (logs from training that took longer where you can see all cpus being utilized is too long to copy over).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2023-04-27T01:14:54Z

Job PR-3179-1be77cc is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3179/1be77cc/index.html

Innixma

LGTM, had 1 comment

Innixma · 2023-04-27T20:19:26Z

core/src/autogluon/core/hpo/executors.py

@@ -193,6 +196,10 @@ def register_resources(
            if minimum_model_num_gpus > 0:
                num_jobs_in_parallel_with_gpu = num_gpus // minimum_model_num_gpus
            num_jobs_in_parallel = min(num_jobs_in_parallel_with_mem, num_jobs_in_parallel_with_cpu, num_jobs_in_parallel_with_gpu)
+            if k_fold is not None and k_fold > 0:
+                num_jobs_in_parallel = min(num_jobs_in_parallel, self.hyperparameter_tune_kwargs.get("num_trials", math.inf) * k_fold)


nit: Have a variable assigned to max_models so it is easier to understand this logic.

Also, this might get more confusing if we add nested bagging or repeated k-fold in the logic.

github-actions · 2023-04-27T23:03:15Z

Job PR-3179-cdb0859 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3179/cdb0859/index.html

Weisu Yin added 5 commits April 25, 2023 17:01

cp

b40e2f4

distributed hpo

030556b

update

33c10ff

fix

fc5e206

fix

1be77cc

Innixma approved these changes Apr 27, 2023

View reviewed changes

address comments

cdb0859

yinweisu merged commit 1153fd8 into autogluon:master Apr 27, 2023
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular Distributed HPO, and Distributed HPO + bagging #3179

Tabular Distributed HPO, and Distributed HPO + bagging #3179

yinweisu commented Apr 26, 2023

github-actions bot commented Apr 27, 2023

Innixma left a comment

Innixma Apr 27, 2023

github-actions bot commented Apr 27, 2023

Tabular Distributed HPO, and Distributed HPO + bagging #3179

Tabular Distributed HPO, and Distributed HPO + bagging #3179

Conversation

yinweisu commented Apr 26, 2023

github-actions bot commented Apr 27, 2023

Innixma left a comment

Choose a reason for hiding this comment

Innixma Apr 27, 2023

Choose a reason for hiding this comment

github-actions bot commented Apr 27, 2023