Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize distributed model sync #3131

Merged
merged 5 commits into from
Apr 12, 2023
Merged

Optimize distributed model sync #3131

merged 5 commits into from
Apr 12, 2023

Conversation

yinweisu
Copy link
Collaborator

@yinweisu yinweisu commented Apr 11, 2023

Issue #, if available:

Description of changes:

  • Add logic to check if the ray task is executed on the head node. If so, avoid uploading/downloading it cause it's already present in the local disk
  • Sync on fold level instead of bag level to 1. avoid download duplicate models when training with repeats 2. sync model artifacts when the head is waiting worker nodes to finish

Example run with some logs enabled:

Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1 ... Training model for up to 599.77s of the 599.77s of remaining time.
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S1F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S1F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S1F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S1F2
['model.pkl']
        0.8714   = Validation score   (accuracy)
        0.76s    = Training   runtime
        0.1s     = Validation runtime
Repeating k-fold bagging: 2/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 596.57s of the 596.57s of remaining time.
        Fitting 2 child models (S2F1 - S2F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S2F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S2F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S2F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S2F2
['model.pkl']
        0.8724   = Validation score   (accuracy)
        1.46s    = Training   runtime
        0.2s     = Validation runtime
Repeating k-fold bagging: 3/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 593.33s of the 593.33s of remaining time.
        Fitting 2 child models (S3F1 - S3F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S3F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S3F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S3F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S3F2
['model.pkl']
        0.8723   = Validation score   (accuracy)
        2.14s    = Training   runtime
        0.3s     = Validation runtime
Repeating k-fold bagging: 4/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 590.21s of the 590.21s of remaining time.
        Fitting 2 child models (S4F1 - S4F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S4F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S4F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S4F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S4F2
['model.pkl']
        0.8732   = Validation score   (accuracy)
        2.81s    = Training   runtime
        0.38s    = Validation runtime
Repeating k-fold bagging: 5/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 587.21s of the 587.21s of remaining time.
        Fitting 2 child models (S5F1 - S5F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S5F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S5F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S5F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S5F2
['model.pkl']
        0.8729   = Validation score   (accuracy)
        3.55s    = Training   runtime
        0.5s     = Validation runtime
Repeating k-fold bagging: 6/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 584.08s of the 584.08s of remaining time.
        Fitting 2 child models (S6F1 - S6F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S6F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S6F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S6F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S6F2
['model.pkl']
        0.8733   = Validation score   (accuracy)
        4.18s    = Training   runtime
        0.59s    = Validation runtime
Repeating k-fold bagging: 7/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 580.99s of the 580.99s of remaining time.
        Fitting 2 child models (S7F1 - S7F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S7F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S7F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S7F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S7F2
['model.pkl']
        0.8736   = Validation score   (accuracy)
        4.85s    = Training   runtime
        0.69s    = Validation runtime
Repeating k-fold bagging: 8/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 577.96s of the 577.96s of remaining time.
        Fitting 2 child models (S8F1 - S8F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S8F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S8F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S8F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S8F2
['model.pkl']
        0.8736   = Validation score   (accuracy)
        5.55s    = Training   runtime
        0.8s     = Validation runtime
Repeating k-fold bagging: 9/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 574.88s of the 574.88s of remaining time.
        Fitting 2 child models (S9F1 - S9F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S9F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S9F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S9F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S9F2
['model.pkl']
        0.8737   = Validation score   (accuracy)
        6.43s    = Training   runtime
        0.93s    = Validation runtime
Repeating k-fold bagging: 10/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 571.79s of the 571.79s of remaining time.
        Fitting 2 child models (S10F1 - S10F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S10F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S10F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S10F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S10F2
['model.pkl']
        0.8739   = Validation score   (accuracy)
        7.08s    = Training   runtime
        1.0s     = Validation runtime
Repeating k-fold bagging: 11/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 568.87s of the 568.87s of remaining time.
        Fitting 2 child models (S11F1 - S11F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S11F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S11F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S11F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S11F2
['model.pkl']
        0.8739   = Validation score   (accuracy)
        7.79s    = Training   runtime
        1.12s    = Validation runtime
Repeating k-fold bagging: 12/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 565.77s of the 565.77s of remaining time.
        Fitting 2 child models (S12F1 - S12F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S12F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S12F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S12F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S12F2
['model.pkl']
        0.8736   = Validation score   (accuracy)
        8.45s    = Training   runtime
        1.2s     = Validation runtime
Repeating k-fold bagging: 13/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 562.76s of the 562.76s of remaining time.
        Fitting 2 child models (S13F1 - S13F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S13F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S13F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S13F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S13F2
['model.pkl']
        0.8738   = Validation score   (accuracy)
        9.09s    = Training   runtime
        1.28s    = Validation runtime
Repeating k-fold bagging: 14/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 559.75s of the 559.75s of remaining time.
        Fitting 2 child models (S14F1 - S14F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S14F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S14F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S14F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S14F2
['model.pkl']
        0.8737   = Validation score   (accuracy)
        9.76s    = Training   runtime
        1.38s    = Validation runtime
Repeating k-fold bagging: 15/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 556.59s of the 556.59s of remaining time.
        Fitting 2 child models (S15F1 - S15F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S15F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S15F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S15F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S15F2
['model.pkl']
        0.874    = Validation score   (accuracy)
        10.49s   = Training   runtime
        1.51s    = Validation runtime
Repeating k-fold bagging: 16/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 552.89s of the 552.89s of remaining time.
        Fitting 2 child models (S16F1 - S16F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S16F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S16F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S16F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S16F2
['model.pkl']
        0.8739   = Validation score   (accuracy)
        11.2s    = Training   runtime
        1.62s    = Validation runtime
Repeating k-fold bagging: 17/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 549.7s of the 549.7s of remaining time.
        Fitting 2 child models (S17F1 - S17F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S17F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S17F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S17F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S17F2
['model.pkl']
        0.8738   = Validation score   (accuracy)
        12.03s   = Training   runtime
        1.74s    = Validation runtime
Repeating k-fold bagging: 18/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 546.6s of the 546.59s of remaining time.
        Fitting 2 child models (S18F1 - S18F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S18F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S18F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S18F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S18F2
['model.pkl']
        0.8742   = Validation score   (accuracy)
        12.78s   = Training   runtime
        1.87s    = Validation runtime
Repeating k-fold bagging: 19/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 543.33s of the 543.33s of remaining time.
        Fitting 2 child models (S19F1 - S19F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S19F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S19F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S19F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S19F2
['model.pkl']
        0.874    = Validation score   (accuracy)
        13.38s   = Training   runtime
        1.94s    = Validation runtime
Repeating k-fold bagging: 20/20
Fitting model: LightGBM_BAG_L1 ... Training model for up to 540.34s of the 540.34s of remaining time.
        Fitting 2 child models (S20F1 - S20F2) | Fitting with ParallelDistributedFoldFittingStrategy
Will download 0 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S20F1/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S20F1
['model.pkl']
Will download 1 objects from s3://weisy-personal/test_distributed/utils/LightGBM_BAG_L1/S20F2/ to AutogluonModels/ag-20230411_204255/models/LightGBM_BAG_L1/S20F2
['model.pkl']
        0.874    = Validation score   (accuracy)
        14.28s   = Training   runtime
        2.11s    = Validation runtime
Completed 20/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 537.07s of remaining time.
        0.874    = Validation score   (accuracy)
        0.01s    = Training   runtime
        0.05s    = Validation runtime
AutoGluon training complete, total runtime = 63.01s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230411_204255/")

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment on lines +608 to +611
X_ref,
y_ref,
X_pseudo_ref,
y_pseudo_ref,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these aren't just kwargs? Currently if I added some new fit arg like X_unlabeled, I would need to edit this code (at least I think) for it to work in distributed. Is there a way we can avoid this and make the logic more general?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is being shared by distributed and local parallel folding. You would need to edit this function anyway to make it work locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, how about if I remove the comment about distributed?

Is there a reason these aren't just kwargs? Is there a way we can avoid this and make the logic more general? For example, X_val_ref isn't present here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm on X_val_ref, I see you use fold to get that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep dived the code, I see the reasoning for how it is structured, disregard this comment.

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added one comment above. (Approval is based on assumption that CI passes)

Weisu Yin added 2 commits April 12, 2023 06:33
@github-actions
Copy link

Job PR-3131-c398834 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3131/c398834/index.html

@yinweisu yinweisu merged commit 44460e2 into autogluon:master Apr 12, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants