Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Post 0.7][multimodal] speedup fusion model training with deepspeed strategy #2932

Merged
merged 1 commit into from
Feb 24, 2023

Conversation

liangfu
Copy link
Collaborator

@liangfu liangfu commented Feb 16, 2023

Issue #, if available:

Deepspeed saves model and optimizer states in a shared state in separate folder (even when using single GPU).
Previously, the conversion took place in the _update_best_and_save function. However, it's a bad idea to convert zero checkpoint to state dict there, since multiple checkpointing processes (when using DDP with multi-gpu) would call this function almost simultaneously.

Description of changes:

  1. Move the conversion to _load_state_dict function, so that the conversion is done only once, even when using DDP with multi-gpu
  2. Make explicit covnersion to ensure weight and input have same data type for all model output in fusion_mlp model. This becomes necessary when using deepspeed strategy with mixed-precision setting.

Results

Tested on g4dn.12xl instance with 4x Tesla T4 gpu.

strategy=ddp_spawn, yields 1.12 iterations per second
strategy=ddp,       yields 1.37 iterations per second
strategy=deepspeed, yields 1.46 iterations per second

Code to reproduce the results:

import os
import numpy as np
import warnings
import pandas as pd
from autogluon.core.utils.loaders import load_zip
from autogluon.multimodal import MultiModalPredictor

def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

def main():
    os.environ["DISABLE_ADDMM_CUDA_LT"]="1"
    warnings.filterwarnings('ignore')
    np.random.seed(123)

    download_dir = './ag_automm_tutorial'
    zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'
    load_zip.unzip(zip_file, unzip_dir=download_dir)

    dataset_path = download_dir + '/petfinder_for_tutorial'
    train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
    test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
    label_col = 'AdoptionSpeed'

    image_col = 'Images'
    train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
    test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

    train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
    test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

    train_data = pd.concat([train_data] * 20)

    predictor = MultiModalPredictor(label=label_col)
    hyperparameters = {
        "optimization.max_epochs": 1,
        "env.num_gpus": 4,
        "env.per_gpu_batch_size": 12,
        "env.batch_size": 48,
        "env.strategy": "deepspeed",
    }
    predictor.fit(
        train_data=train_data,
        hyperparameters=hyperparameters,
        # time_limit=120, # seconds
    )

    scores = predictor.evaluate(test_data, metrics=["roc_auc"])
    print(f"scores={scores}")

if __name__=="__main__":
    main()

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@tonyhoo tonyhoo modified the milestone: 0.8 Release Feb 16, 2023
@github-actions
Copy link

Job PR-2932-83b3d62 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2932/83b3d62/index.html

@liangfu liangfu changed the title [multimodal] speedup fusion model training with deepspeed strategy [Post 0.7][multimodal] speedup fusion model training with deepspeed strategy Feb 17, 2023
@github-actions
Copy link

Job PR-2932-f4913f0 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2932/f4913f0/index.html

Copy link
Collaborator

@Raldir Raldir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and clean fix. LGTM!

@liangfu liangfu merged commit a095330 into autogluon:master Feb 24, 2023
@liangfu liangfu deleted the fusion-deepspeed-1 branch February 24, 2023 18:43
Copy link
Contributor

@FANGAreNotGnu FANGAreNotGnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

shchur pushed a commit to shchur/autogluon that referenced this pull request Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants