Optimize distributed model sync #3131

yinweisu · 2023-04-11T21:22:46Z

Issue #, if available:

Description of changes:

Add logic to check if the ray task is executed on the head node. If so, avoid uploading/downloading it cause it's already present in the local disk
Sync on fold level instead of bag level to 1. avoid download duplicate models when training with repeats 2. sync model artifacts when the head is waiting worker nodes to finish

Example run with some logs enabled:

Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1 ... Training model for up to 599.77s of the 599.77s of remaining time.
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S1F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S1F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S1F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S1F2
['model.pkl']
        0.8714   = Validation score   (accuracy)
        0.76s    = Training   runtime
        0.1s     = Validation runtime
Repeating k-fold bagging: 2/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 596.57s of the 596.57s of remaining time.
        Fitting 2 child models (S2F1 - S2F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S2F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S2F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S2F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S2F2
['model.pkl']
        0.8724   = Validation score   (accuracy)
        1.46s    = Training   runtime
        0.2s     = Validation runtime
Repeating k-fold bagging: 3/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 593.33s of the 593.33s of remaining time.
        Fitting 2 child models (S3F1 - S3F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S3F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S3F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S3F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S3F2
['model.pkl']
        0.8723   = Validation score   (accuracy)
        2.14s    = Training   runtime
        0.3s     = Validation runtime
Repeating k-fold bagging: 4/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 590.21s of the 590.21s of remaining time.
        Fitting 2 child models (S4F1 - S4F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S4F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S4F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S4F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S4F2
['model.pkl']
        0.8732   = Validation score   (accuracy)
        2.81s    = Training   runtime
        0.38s    = Validation runtime
Repeating k-fold bagging: 5/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 587.21s of the 587.21s of remaining time.
        Fitting 2 child models (S5F1 - S5F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S5F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S5F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S5F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S5F2
['model.pkl']
        0.8729   = Validation score   (accuracy)
        3.55s    = Training   runtime
        0.5s     = Validation runtime
Repeating k-fold bagging: 6/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 584.08s of the 584.08s of remaining time.
        Fitting 2 child models (S6F1 - S6F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S6F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S6F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S6F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S6F2
['model.pkl']
        0.8733   = Validation score   (accuracy)
        4.18s    = Training   runtime
        0.59s    = Validation runtime
Repeating k-fold bagging: 7/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 580.99s of the 580.99s of remaining time.
        Fitting 2 child models (S7F1 - S7F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S7F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S7F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S7F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S7F2
['model.pkl']
        0.8736   = Validation score   (accuracy)
        4.85s    = Training   runtime
        0.69s    = Validation runtime
Repeating k-fold bagging: 8/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 577.96s of the 577.96s of remaining time.
        Fitting 2 child models (S8F1 - S8F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S8F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S8F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S8F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S8F2
['model.pkl']
        0.8736   = Validation score   (accuracy)
        5.55s    = Training   runtime
        0.8s     = Validation runtime
Repeating k-fold bagging: 9/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 574.88s of the 574.88s of remaining time.
        Fitting 2 child models (S9F1 - S9F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S9F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S9F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S9F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S9F2
['model.pkl']
        0.8737   = Validation score   (accuracy)
        6.43s    = Training   runtime
        0.93s    = Validation runtime
Repeating k-fold bagging: 10/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 571.79s of the 571.79s of remaining time.
        Fitting 2 child models (S10F1 - S10F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S10F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S10F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S10F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S10F2
['model.pkl']
        0.8739   = Validation score   (accuracy)
        7.08s    = Training   runtime
        1.0s     = Validation runtime
Repeating k-fold bagging: 11/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 568.87s of the 568.87s of remaining time.
        Fitting 2 child models (S11F1 - S11F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S11F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S11F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S11F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S11F2
['model.pkl']
        0.8739   = Validation score   (accuracy)
        7.79s    = Training   runtime
        1.12s    = Validation runtime
Repeating k-fold bagging: 12/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 565.77s of the 565.77s of remaining time.
        Fitting 2 child models (S12F1 - S12F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S12F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S12F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S12F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S12F2
['model.pkl']
        0.8736   = Validation score   (accuracy)
        8.45s    = Training   runtime
        1.2s     = Validation runtime
Repeating k-fold bagging: 13/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 562.76s of the 562.76s of remaining time.
        Fitting 2 child models (S13F1 - S13F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S13F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S13F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S13F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S13F2
['model.pkl']
        0.8738   = Validation score   (accuracy)
        9.09s    = Training   runtime
        1.28s    = Validation runtime
Repeating k-fold bagging: 14/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 559.75s of the 559.75s of remaining time.
        Fitting 2 child models (S14F1 - S14F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S14F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S14F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S14F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S14F2
['model.pkl']
        0.8737   = Validation score   (accuracy)
        9.76s    = Training   runtime
        1.38s    = Validation runtime
Repeating k-fold bagging: 15/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 556.59s of the 556.59s of remaining time.
        Fitting 2 child models (S15F1 - S15F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S15F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S15F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S15F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S15F2
['model.pkl']
        0.874    = Validation score   (accuracy)
        10.49s   = Training   runtime
        1.51s    = Validation runtime
Repeating k-fold bagging: 16/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 552.89s of the 552.89s of remaining time.
        Fitting 2 child models (S16F1 - S16F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S16F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S16F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S16F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S16F2
['model.pkl']
        0.8739   = Validation score   (accuracy)
        11.2s    = Training   runtime
        1.62s    = Validation runtime
Repeating k-fold bagging: 17/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 549.7s of the 549.7s of remaining time.
        Fitting 2 child models (S17F1 - S17F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S17F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S17F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S17F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S17F2
['model.pkl']
        0.8738   = Validation score   (accuracy)
        12.03s   = Training   runtime
        1.74s    = Validation runtime
Repeating k-fold bagging: 18/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 546.6s of the 546.59s of remaining time.
        Fitting 2 child models (S18F1 - S18F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S18F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S18F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S18F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S18F2
['model.pkl']
        0.8742   = Validation score   (accuracy)
        12.78s   = Training   runtime
        1.87s    = Validation runtime
Repeating k-fold bagging: 19/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 543.33s of the 543.33s of remaining time.
        Fitting 2 child models (S19F1 - S19F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S19F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S19F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S19F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S19F2
['model.pkl']
        0.874    = Validation score   (accuracy)
        13.38s   = Training   runtime
        1.94s    = Validation runtime
Repeating k-fold bagging: 20/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 540.34s of the 540.34s of remaining time.
        Fitting 2 child models (S20F1 - S20F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S20F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S20F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S20F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S20F2
['model.pkl']
        0.874    = Validation score   (accuracy)
        14.28s   = Training   runtime
        2.11s    = Validation runtime
Completed 20/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 537.07s of remaining time.
        0.874    = Validation score   (accuracy)
        0.01s    = Training   runtime
        0.05s    = Validation runtime
AutoGluon training complete, total runtime = 63.01s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230411_204255/")

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Innixma · 2023-04-11T22:04:42Z

core/src/autogluon/core/models/ensemble/fold_fitting_strategy.py

+        X_ref,
+        y_ref,
+        X_pseudo_ref,
+        y_pseudo_ref,


Is there a reason these aren't just kwargs? Currently if I added some new fit arg like X_unlabeled, I would need to edit this code (at least I think) for it to work in distributed. Is there a way we can avoid this and make the logic more general?

This function is being shared by distributed and local parallel folding. You would need to edit this function anyway to make it work locally.

ok, how about if I remove the comment about distributed?

Is there a reason these aren't just kwargs? Is there a way we can avoid this and make the logic more general? For example, X_val_ref isn't present here.

nvm on X_val_ref, I see you use fold to get that

deep dived the code, I see the reasoning for how it is structured, disregard this comment.

Innixma

LGTM, added one comment above. (Approval is based on assumption that CI passes)

github-actions · 2023-04-12T19:54:53Z

Job PR-3131-c398834 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3131/c398834/index.html

Weisu Yin added 3 commits April 11, 2023 21:09

optimize sync

c0345a7

minor

e8380fd

fix

2e712e1

Innixma reviewed Apr 11, 2023

View reviewed changes

Innixma approved these changes Apr 11, 2023

View reviewed changes

Weisu Yin added 2 commits April 12, 2023 06:33

fix

d230a61

fix

c398834

yinweisu merged commit 44460e2 into autogluon:master Apr 12, 2023
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize distributed model sync #3131

Optimize distributed model sync #3131

yinweisu commented Apr 11, 2023 •

edited

Innixma Apr 11, 2023

yinweisu Apr 11, 2023

Innixma Apr 11, 2023

Innixma Apr 11, 2023

Innixma Apr 11, 2023

Innixma left a comment

github-actions bot commented Apr 12, 2023

Optimize distributed model sync #3131

Optimize distributed model sync #3131

Conversation

yinweisu commented Apr 11, 2023 • edited

Innixma Apr 11, 2023

Choose a reason for hiding this comment

yinweisu Apr 11, 2023

Choose a reason for hiding this comment

Innixma Apr 11, 2023

Choose a reason for hiding this comment

Innixma Apr 11, 2023

Choose a reason for hiding this comment

Innixma Apr 11, 2023

Choose a reason for hiding this comment

Innixma left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 12, 2023

yinweisu commented Apr 11, 2023 •

edited