Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoMM: Avoid redundant checkpointing and evaluation (2x training speedup) #3001

Merged
merged 1 commit into from
Mar 6, 2023

Conversation

Innixma
Copy link
Contributor

@Innixma Innixma commented Mar 5, 2023

Issue #, if available:

#3000

Description of changes:

AutoMM can redundantly evaluate and checkpoint, leading to a doubling of training time, due to not updating self._last_global_step_saved in the checkpointing callback. Specifically when an epoch is a single step, this doubles training time.

This fixes that issue by updating self._last_global_step_saved in the callback's _save_checkpoint method to align with the parent method's logic.

Code to reproduce:

from autogluon.core.utils.loaders import load_pd

train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 100  # subsample data for faster demo, try setting this to larger values
train_data = train_data.sample(n=subsample_size, random_state=0)

label = 'label'
eval_metric = 'roc_auc'
tuning_data = test_data

# Note: You can also reproduce with MultiModalPredictor directly, there is no difference.
#  I used TabularPredictor to have cleaner logging while debugging that shows the total training time
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label=label, eval_metric=eval_metric)
predictor.fit(
    train_data,
    tuning_data=tuning_data,
    hyperparameters={'AG_AUTOMM': {}},
    fit_weighted_ensemble=False,
)

Before Fix:

Epoch 1, global step 1: 'val_roc_auc' reached 0.46303 (best 0.46303), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=1-step=1.ckpt' as top 3
Epoch 1, global step 1: 'val_roc_auc' reached 0.46303 (best 0.46303), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=1-step=1-v1.ckpt' as top 3
Epoch 2, global step 2: 'val_roc_auc' reached 0.47989 (best 0.47989), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=2-step=2.ckpt' as top 3
Epoch 2, global step 2: 'val_roc_auc' reached 0.47989 (best 0.47989), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=2-step=2-v1.ckpt' as top 3
Epoch 3, global step 3: 'val_roc_auc' reached 0.49901 (best 0.49901), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=3-step=3.ckpt' as top 3
Epoch 3, global step 3: 'val_roc_auc' reached 0.49901 (best 0.49901), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=3-step=3-v1.ckpt' as top 3
Epoch 4, global step 4: 'val_roc_auc' reached 0.52123 (best 0.52123), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=4-step=4.ckpt' as top 3
Epoch 4, global step 4: 'val_roc_auc' reached 0.52123 (best 0.52123), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=4-step=4-v1.ckpt' as top 3
Epoch 5, global step 5: 'val_roc_auc' reached 0.53436 (best 0.53436), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=5-step=5.ckpt' as top 3
Epoch 5, global step 5: 'val_roc_auc' reached 0.53436 (best 0.53436), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=5-step=5-v1.ckpt' as top 3
Epoch 6, global step 6: 'val_roc_auc' reached 0.54119 (best 0.54119), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=6-step=6.ckpt' as top 3
Epoch 6, global step 6: 'val_roc_auc' reached 0.54119 (best 0.54119), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=6-step=6-v1.ckpt' as top 3
Epoch 7, global step 7: 'val_roc_auc' reached 0.54379 (best 0.54379), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=7-step=7.ckpt' as top 3
Epoch 7, global step 7: 'val_roc_auc' reached 0.54379 (best 0.54379), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=7-step=7-v1.ckpt' as top 3
Epoch 8, global step 8: 'val_roc_auc' reached 0.54456 (best 0.54456), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=8-step=8.ckpt' as top 3
Epoch 8, global step 8: 'val_roc_auc' reached 0.54456 (best 0.54456), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=8-step=8-v1.ckpt' as top 3
Epoch 9, global step 9: 'val_roc_auc' reached 0.54456 (best 0.54456), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/epoch=9-step=9.ckpt' as top 3
Epoch 9, global step 9: 'val_roc_auc' was not in top 3
`Trainer.fit` stopped: `max_epochs=10` reached.
Configuration saved in AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/hf_text/config.json
tokenizer config file saved in AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/hf_text/tokenizer_config.json
Special tokens file saved in AutogluonModels/ag-20230305_220745/models/MultiModalPredictor/automm_model/hf_text/special_tokens_map.json
	0.5446	 = Validation score   (roc_auc)
	380.77s	 = Training   runtime
	1.63s	 = Validation runtime
AutoGluon training complete, total runtime = 382.81s ... Best model: "MultiModalPredictor"

After Fix:

Epoch 1, global step 1: 'val_roc_auc' reached 0.46303 (best 0.46303), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=1-step=1.ckpt' as top 3
Epoch 2, global step 2: 'val_roc_auc' reached 0.47989 (best 0.47989), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=2-step=2.ckpt' as top 3
Epoch 3, global step 3: 'val_roc_auc' reached 0.49901 (best 0.49901), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=3-step=3.ckpt' as top 3
Epoch 4, global step 4: 'val_roc_auc' reached 0.52123 (best 0.52123), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=4-step=4.ckpt' as top 3
Epoch 5, global step 5: 'val_roc_auc' reached 0.53436 (best 0.53436), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=5-step=5.ckpt' as top 3
Epoch 6, global step 6: 'val_roc_auc' reached 0.54119 (best 0.54119), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=6-step=6.ckpt' as top 3
Epoch 7, global step 7: 'val_roc_auc' reached 0.54379 (best 0.54379), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=7-step=7.ckpt' as top 3
Epoch 8, global step 8: 'val_roc_auc' reached 0.54456 (best 0.54456), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=8-step=8.ckpt' as top 3
Epoch 9, global step 9: 'val_roc_auc' reached 0.54456 (best 0.54456), saving model to '/home/ubuntu/workspace/code/autogluon-scratch/AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/epoch=9-step=9.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=10` reached.
Configuration saved in AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/hf_text/config.json
tokenizer config file saved in AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/hf_text/tokenizer_config.json
Special tokens file saved in AutogluonModels/ag-20230305_224719/models/MultiModalPredictor/automm_model/hf_text/special_tokens_map.json
	0.5446	 = Validation score   (roc_auc)
	208.98s	 = Training   runtime
	1.66s	 = Validation runtime
AutoGluon training complete, total runtime = 211.06s ... Best model: "MultiModalPredictor"

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma added bug Something isn't working enhancement New feature or request module: multimodal labels Mar 5, 2023
@Innixma Innixma added this to the 0.7.1 Release milestone Mar 5, 2023
@Innixma Innixma linked an issue Mar 5, 2023 that may be closed by this pull request
2 tasks
@github-actions
Copy link

github-actions bot commented Mar 6, 2023

Job PR-3001-faa0097 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3001/faa0097/index.html

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

@Innixma Innixma merged commit 44de374 into autogluon:master Mar 6, 2023
@Innixma Innixma modified the milestones: 0.7.1 Release, 0.8 Release May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request module: multimodal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] AutoMM evaluates same global step twice
2 participants