Tabular Distributed Training Resource Constraint #3115

yinweisu · 2023-04-06T22:17:31Z

Issue #, if available:

Noticed nn_torch and nn_fastai models being trained extremely slow (10+X slower) than normal on a newly spun up cluster. After some investigation, it is due to some underneath ray processes that take cpu resources running along the job being submitted. Trying to use all cores on the machine could lead to resource contention situation

Description of changes:

Added logic to reserve 2 cpus for ray processes.
Update ray to 2.3.0

A run with a new spun up 8 nodes cluster with logs enabled:

The run time of all models are normal (inline with a local run on a g4.12xlarge machine) except for LightGBMXT_BAG_L1. This is likely due to it's the first task being dispatched and there are overhead for communication between nodes. This can be verified by the logging in a sense that most of the time was spent on waiting other nodes to hit the enter lgb line

Fitting 13 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ...
2023-04-06 20:10:30,915 INFO worker.py:1230 -- Using address 172.31.64.125:6379 set in the environment variable RAY_ADDRESS
2023-04-06 20:10:30,916 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 172.31.64.125:6379...
2023-04-06 20:10:30,921 INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at 172.31.64.125:8265 
        0.7775   = Validation score   (accuracy)
        0.04s    = Training   runtime
        0.34s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ...
        0.7728   = Validation score   (accuracy)
        0.04s    = Training   runtime
        0.35s    = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
(_ray_fit pid=4694) enter lgb
(_ray_fit pid=4694) num_cpus: 6
(_ray_fit pid=338, ip=172.31.69.17) enter lgb
(_ray_fit pid=338, ip=172.31.69.17) num_cpus: 6
(_ray_fit pid=371, ip=172.31.70.86) enter lgb
(_ray_fit pid=371, ip=172.31.70.86) num_cpus: 6
(_ray_fit pid=338, ip=172.31.70.212) enter lgb
(_ray_fit pid=338, ip=172.31.70.212) num_cpus: 6
(_ray_fit pid=337, ip=172.31.66.24) enter lgb
(_ray_fit pid=337, ip=172.31.66.24) num_cpus: 6
(_ray_fit pid=336, ip=172.31.75.200) enter lgb
(_ray_fit pid=336, ip=172.31.75.200) num_cpus: 6
(_ray_fit pid=338, ip=172.31.69.172) enter lgb
(_ray_fit pid=338, ip=172.31.69.172) num_cpus: 6
(_ray_fit pid=338, ip=172.31.77.126) enter lgb
(_ray_fit pid=338, ip=172.31.77.126) num_cpus: 6
        0.8683   = Validation score   (accuracy)
        77.38s   = Training   runtime
        0.33s    = Validation runtime
Fitting model: LightGBM_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
(_ray_fit pid=4873) enter lgb
(_ray_fit pid=4873) num_cpus: 6
(_ray_fit pid=382, ip=172.31.66.24) enter lgb
(_ray_fit pid=382, ip=172.31.66.24) num_cpus: 6
(_ray_fit pid=416, ip=172.31.70.86) enter lgb
(_ray_fit pid=416, ip=172.31.70.86) num_cpus: 6
(_ray_fit pid=383, ip=172.31.69.172) enter lgb
(_ray_fit pid=383, ip=172.31.69.172) num_cpus: 6
(_ray_fit pid=383, ip=172.31.70.212) enter lgb
(_ray_fit pid=383, ip=172.31.70.212) num_cpus: 6
(_ray_fit pid=381, ip=172.31.75.200) enter lgb
(_ray_fit pid=381, ip=172.31.75.200) num_cpus: 6
(_ray_fit pid=383, ip=172.31.69.17) enter lgb
(_ray_fit pid=383, ip=172.31.69.17) num_cpus: 6
(_ray_fit pid=383, ip=172.31.77.126) enter lgb
(_ray_fit pid=383, ip=172.31.77.126) num_cpus: 6
        0.8745   = Validation score   (accuracy)
        0.75s    = Training   runtime
        0.24s    = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ...
        0.8569   = Validation score   (accuracy)
        2.76s    = Training   runtime
        1.54s    = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ...
        0.8568   = Validation score   (accuracy)
        3.01s    = Training   runtime
        1.53s    = Validation runtime
Fitting model: CatBoost_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
        0.8737   = Validation score   (accuracy)
        16.42s   = Training   runtime
        0.09s    = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ...
        0.8509   = Validation score   (accuracy)
        1.96s    = Training   runtime
        1.74s    = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ...
        0.8518   = Validation score   (accuracy)
        2.01s    = Training   runtime
        1.72s    = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
(_ray_fit pid=5127) enter fit
(_ray_fit pid=5127) num cpus: 6
(_ray_fit pid=491, ip=172.31.66.24) enter fit
(_ray_fit pid=491, ip=172.31.66.24) num cpus: 6
(_ray_fit pid=491, ip=172.31.70.212) enter fit
(_ray_fit pid=491, ip=172.31.70.212) num cpus: 6
(_ray_fit pid=492, ip=172.31.69.17) enter fit
(_ray_fit pid=492, ip=172.31.69.17) num cpus: 6
(_ray_fit pid=525, ip=172.31.70.86) enter fit
(_ray_fit pid=525, ip=172.31.70.86) num cpus: 6
(_ray_fit pid=492, ip=172.31.69.172) enter fit
(_ray_fit pid=492, ip=172.31.69.172) num cpus: 6
(_ray_fit pid=490, ip=172.31.75.200) enter fit
(_ray_fit pid=490, ip=172.31.75.200) num cpus: 6
(_ray_fit pid=492, ip=172.31.77.126) enter fit
(_ray_fit pid=492, ip=172.31.77.126) num cpus: 6
(_ray_fit pid=492, ip=172.31.69.17) start fitting nnfastai
(_ray_fit pid=525, ip=172.31.70.86) start fitting nnfastai
(_ray_fit pid=5127) start fitting nnfastai
(_ray_fit pid=491, ip=172.31.66.24) start fitting nnfastai
(_ray_fit pid=491, ip=172.31.70.212) start fitting nnfastai
(_ray_fit pid=492, ip=172.31.69.17) [2023-04-06 20:12:36.429 ip-172-31-69-17:492 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=490, ip=172.31.75.200) start fitting nnfastai
(_ray_fit pid=525, ip=172.31.70.86) [2023-04-06 20:12:36.429 ip-172-31-70-86:525 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=492, ip=172.31.69.172) start fitting nnfastai
(_ray_fit pid=5127) [2023-04-06 20:12:36.457 ip-172-31-64-125:5127 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=5127) [2023-04-06 20:12:36.492 ip-172-31-64-125:5127 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=491, ip=172.31.66.24) [2023-04-06 20:12:36.441 ip-172-31-66-24:491 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=491, ip=172.31.66.24) [2023-04-06 20:12:36.476 ip-172-31-66-24:491 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=491, ip=172.31.70.212) [2023-04-06 20:12:36.430 ip-172-31-70-212:491 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=491, ip=172.31.70.212) [2023-04-06 20:12:36.465 ip-172-31-70-212:491 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=492, ip=172.31.69.17) [2023-04-06 20:12:36.464 ip-172-31-69-17:492 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=490, ip=172.31.75.200) [2023-04-06 20:12:36.523 ip-172-31-75-200:490 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=525, ip=172.31.70.86) [2023-04-06 20:12:36.465 ip-172-31-70-86:525 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=492, ip=172.31.77.126) start fitting nnfastai
(_ray_fit pid=490, ip=172.31.75.200) [2023-04-06 20:12:36.559 ip-172-31-75-200:490 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=492, ip=172.31.77.126) [2023-04-06 20:12:36.650 ip-172-31-77-126:492 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=492, ip=172.31.69.172) [2023-04-06 20:12:36.581 ip-172-31-69-172:492 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=492, ip=172.31.69.172) [2023-04-06 20:12:36.618 ip-172-31-69-172:492 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=492, ip=172.31.77.126) [2023-04-06 20:12:36.687 ip-172-31-77-126:492 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=525, ip=172.31.70.86) No improvement since epoch 6: early stopping
(_ray_fit pid=525, ip=172.31.70.86) fitting took 36.50655770301819 secs
(_ray_fit pid=525, ip=172.31.70.86) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=492, ip=172.31.69.17) fitting took 39.29960346221924 secs
(_ray_fit pid=491, ip=172.31.70.212) fitting took 39.3617377281189 secs
(_ray_fit pid=492, ip=172.31.69.17) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=491, ip=172.31.66.24) fitting took 39.55890727043152 secs
(_ray_fit pid=491, ip=172.31.70.212) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=491, ip=172.31.66.24) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=5127) fitting took 39.82307696342468 secs
(_ray_fit pid=490, ip=172.31.75.200) fitting took 39.92330813407898 secs
(_ray_fit pid=5127) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=490, ip=172.31.75.200) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=492, ip=172.31.69.172) No improvement since epoch 9: early stopping
(_ray_fit pid=492, ip=172.31.69.172) fitting took 42.38680839538574 secs
(_ray_fit pid=492, ip=172.31.69.172) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
(_ray_fit pid=492, ip=172.31.77.126) fitting took 42.7169144153595 secs
(_ray_fit pid=492, ip=172.31.77.126) INFO:botocore.credentials:Found credentials from IAM Role: AGRayCluster-v1
        0.8591   = Validation score   (accuracy)
        44.45s   = Training   runtime
        0.69s    = Validation runtime
Fitting model: XGBoost_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
(_ray_fit pid=592, ip=172.31.66.24) enter xgb
(_ray_fit pid=592, ip=172.31.66.24) num_cpus: 6
(_ray_fit pid=5368) enter xgb
(_ray_fit pid=5368) num_cpus: 6
(_ray_fit pid=593, ip=172.31.69.17) enter xgb
(_ray_fit pid=593, ip=172.31.69.17) num_cpus: 6
(_ray_fit pid=592, ip=172.31.70.212) enter xgb
(_ray_fit pid=592, ip=172.31.70.212) num_cpus: 6
(_ray_fit pid=627, ip=172.31.70.86) enter xgb
(_ray_fit pid=627, ip=172.31.70.86) num_cpus: 6
(_ray_fit pid=591, ip=172.31.75.200) enter xgb
(_ray_fit pid=591, ip=172.31.75.200) num_cpus: 6
(_ray_fit pid=593, ip=172.31.77.126) enter xgb
(_ray_fit pid=593, ip=172.31.77.126) num_cpus: 6
(_ray_fit pid=593, ip=172.31.69.172) enter xgb
(_ray_fit pid=593, ip=172.31.69.172) num_cpus: 6
        0.8753   = Validation score   (accuracy)
        5.94s    = Training   runtime
        0.28s    = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
(_ray_fit pid=640, ip=172.31.69.17) enter _fit
(_ray_fit pid=640, ip=172.31.69.17) 6
(_ray_fit pid=5472) enter _fit
(_ray_fit pid=5472) 6
(_ray_fit pid=639, ip=172.31.66.24) enter _fit
(_ray_fit pid=639, ip=172.31.66.24) 6
(_ray_fit pid=639, ip=172.31.70.212) enter _fit
(_ray_fit pid=639, ip=172.31.70.212) 6
(_ray_fit pid=674, ip=172.31.70.86) enter _fit
(_ray_fit pid=674, ip=172.31.70.86) 6
(_ray_fit pid=638, ip=172.31.75.200) enter _fit
(_ray_fit pid=638, ip=172.31.75.200) 6
(_ray_fit pid=640, ip=172.31.77.126) enter _fit
(_ray_fit pid=640, ip=172.31.77.126) 6
(_ray_fit pid=640, ip=172.31.69.172) enter _fit
(_ray_fit pid=640, ip=172.31.69.172) 6
(_ray_fit pid=639, ip=172.31.66.24) start fitting nntorch
(_ray_fit pid=640, ip=172.31.69.17) start fitting nntorch
(_ray_fit pid=639, ip=172.31.70.212) start fitting nntorch
(_ray_fit pid=5472) start fitting nntorch
(_ray_fit pid=674, ip=172.31.70.86) start fitting nntorch
(_ray_fit pid=638, ip=172.31.75.200) start fitting nntorch
(_ray_fit pid=640, ip=172.31.77.126) start fitting nntorch
(_ray_fit pid=640, ip=172.31.69.172) start fitting nntorch
(_ray_fit pid=639, ip=172.31.66.24) [2023-04-06 20:13:32.436 ip-172-31-66-24:639 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=639, ip=172.31.66.24) [2023-04-06 20:13:32.470 ip-172-31-66-24:639 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=640, ip=172.31.69.17) [2023-04-06 20:13:32.424 ip-172-31-69-17:640 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=640, ip=172.31.69.17) [2023-04-06 20:13:32.458 ip-172-31-69-17:640 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=639, ip=172.31.70.212) [2023-04-06 20:13:32.448 ip-172-31-70-212:639 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=639, ip=172.31.70.212) [2023-04-06 20:13:32.482 ip-172-31-70-212:639 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=5472) [2023-04-06 20:13:32.449 ip-172-31-64-125:5472 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=5472) [2023-04-06 20:13:32.482 ip-172-31-64-125:5472 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=674, ip=172.31.70.86) [2023-04-06 20:13:32.463 ip-172-31-70-86:674 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=674, ip=172.31.70.86) [2023-04-06 20:13:32.496 ip-172-31-70-86:674 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=638, ip=172.31.75.200) [2023-04-06 20:13:32.524 ip-172-31-75-200:638 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=640, ip=172.31.77.126) [2023-04-06 20:13:32.576 ip-172-31-77-126:640 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=640, ip=172.31.69.172) [2023-04-06 20:13:32.610 ip-172-31-69-172:640 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
(_ray_fit pid=638, ip=172.31.75.200) [2023-04-06 20:13:32.558 ip-172-31-75-200:638 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=640, ip=172.31.77.126) [2023-04-06 20:13:32.612 ip-172-31-77-126:640 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=640, ip=172.31.69.172) [2023-04-06 20:13:32.647 ip-172-31-69-172:640 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
(_ray_fit pid=639, ip=172.31.70.212) fitting took 35.58976411819458 secs
(_ray_fit pid=640, ip=172.31.69.172) fitting took 37.566407203674316 secs
(_ray_fit pid=640, ip=172.31.77.126) fitting took 40.31784176826477 secs
(_ray_fit pid=638, ip=172.31.75.200) fitting took 43.3432092666626 secs
(_ray_fit pid=674, ip=172.31.70.86) fitting took 43.86594891548157 secs
(_ray_fit pid=5472) fitting took 48.791943311691284 secs
(_ray_fit pid=639, ip=172.31.66.24) fitting took 49.9268913269043 secs
(_ray_fit pid=640, ip=172.31.69.17) fitting took 50.825636863708496 secs
        0.8586   = Validation score   (accuracy)
        51.87s   = Training   runtime
        0.26s    = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ...
{'num_gpus': 0, 'num_cpus': 8}
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
(_ray_fit pid=700, ip=172.31.69.17) enter lgb
(_ray_fit pid=700, ip=172.31.69.17) num_cpus: 6
(_ray_fit pid=698, ip=172.31.75.200) enter lgb
(_ray_fit pid=698, ip=172.31.75.200) num_cpus: 6
(_ray_fit pid=5655) enter lgb
(_ray_fit pid=5655) num_cpus: 6
(_ray_fit pid=699, ip=172.31.70.212) enter lgb
(_ray_fit pid=699, ip=172.31.70.212) num_cpus: 6
(_ray_fit pid=700, ip=172.31.77.126) enter lgb
(_ray_fit pid=700, ip=172.31.77.126) num_cpus: 6
(_ray_fit pid=699, ip=172.31.66.24) enter lgb
(_ray_fit pid=699, ip=172.31.66.24) num_cpus: 6
(_ray_fit pid=734, ip=172.31.70.86) enter lgb
(_ray_fit pid=734, ip=172.31.70.86) num_cpus: 6
(_ray_fit pid=700, ip=172.31.69.172) enter lgb
(_ray_fit pid=700, ip=172.31.69.172) num_cpus: 6
        0.8737   = Validation score   (accuracy)
        1.76s    = Training   runtime
        0.31s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.8753   = Validation score   (accuracy)
        10.73s   = Training   runtime
        0.05s    = Validation runtime
AutoGluon training complete, total runtime = 248.92s ... Best model: "WeightedEnsemble_L2"

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Innixma

LGTM!

github-actions · 2023-04-10T19:14:45Z

Job PR-3115-68fc5a4 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3115/68fc5a4/index.html

Innixma approved these changes Apr 6, 2023

View reviewed changes

Weisu Yin added 4 commits April 10, 2023 16:40

upload folder

f3827ea

distributed resource

87724ce

add explaination

7957535

clean

68fc5a4

yinweisu force-pushed the master branch from 0303a90 to 68fc5a4 Compare April 10, 2023 16:40

yinweisu merged commit 47a78ee into autogluon:master Apr 10, 2023
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular Distributed Training Resource Constraint #3115

Tabular Distributed Training Resource Constraint #3115

yinweisu commented Apr 6, 2023 •

edited

Innixma left a comment

github-actions bot commented Apr 10, 2023

Tabular Distributed Training Resource Constraint #3115

Tabular Distributed Training Resource Constraint #3115

Conversation

yinweisu commented Apr 6, 2023 • edited

Innixma left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 10, 2023

yinweisu commented Apr 6, 2023 •

edited