[Post 0.7][multimodal] speedup fusion model training with deepspeed strategy #2932

liangfu · 2023-02-16T01:04:59Z

Issue #, if available:

Deepspeed saves model and optimizer states in a shared state in separate folder (even when using single GPU).
Previously, the conversion took place in the _update_best_and_save function. However, it's a bad idea to convert zero checkpoint to state dict there, since multiple checkpointing processes (when using DDP with multi-gpu) would call this function almost simultaneously.

Description of changes:

Move the conversion to _load_state_dict function, so that the conversion is done only once, even when using DDP with multi-gpu
Make explicit covnersion to ensure weight and input have same data type for all model output in fusion_mlp model. This becomes necessary when using deepspeed strategy with mixed-precision setting.

Results

Tested on g4dn.12xl instance with 4x Tesla T4 gpu.

strategy=ddp_spawn, yields 1.12 iterations per second
strategy=ddp,       yields 1.37 iterations per second
strategy=deepspeed, yields 1.46 iterations per second

Code to reproduce the results:

import os
import numpy as np
import warnings
import pandas as pd
from autogluon.core.utils.loaders import load_zip
from autogluon.multimodal import MultiModalPredictor

def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

def main():
    os.environ["DISABLE_ADDMM_CUDA_LT"]="1"
    warnings.filterwarnings('ignore')
    np.random.seed(123)

    download_dir = './ag_automm_tutorial'
    zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'
    load_zip.unzip(zip_file, unzip_dir=download_dir)

    dataset_path = download_dir + '/petfinder_for_tutorial'
    train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
    test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
    label_col = 'AdoptionSpeed'

    image_col = 'Images'
    train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
    test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

    train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
    test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

    train_data = pd.concat([train_data] * 20)

    predictor = MultiModalPredictor(label=label_col)
    hyperparameters = {
        "optimization.max_epochs": 1,
        "env.num_gpus": 4,
        "env.per_gpu_batch_size": 12,
        "env.batch_size": 48,
        "env.strategy": "deepspeed",
    }
    predictor.fit(
        train_data=train_data,
        hyperparameters=hyperparameters,
        # time_limit=120, # seconds
    )

    scores = predictor.evaluate(test_data, metrics=["roc_auc"])
    print(f"scores={scores}")

if __name__=="__main__":
    main()

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2023-02-16T20:02:27Z

Job PR-2932-83b3d62 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2932/83b3d62/index.html

github-actions · 2023-02-22T21:17:48Z

Job PR-2932-f4913f0 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2932/f4913f0/index.html

Raldir

Nice catch and clean fix. LGTM!

FANGAreNotGnu

LGTM!

…trategy (autogluon#2932)

tonyhoo modified the milestone: 0.8 Release Feb 16, 2023

liangfu changed the title ~~[multimodal] speedup fusion model training with deepspeed strategy~~ [Post 0.7][multimodal] speedup fusion model training with deepspeed strategy Feb 17, 2023

[multimodal] speedup fusion model training with deepspeed strategy

f4913f0

liangfu force-pushed the fusion-deepspeed-1 branch from 6ee1c5a to f4913f0 Compare February 22, 2023 19:37

liangfu requested review from zhiqiangdon and Raldir February 23, 2023 01:13

Raldir approved these changes Feb 24, 2023

View reviewed changes

liangfu merged commit a095330 into autogluon:master Feb 24, 2023

liangfu deleted the fusion-deepspeed-1 branch February 24, 2023 18:43

FANGAreNotGnu self-requested a review February 25, 2023 00:03

FANGAreNotGnu reviewed Feb 25, 2023

View reviewed changes

shchur pushed a commit to shchur/autogluon that referenced this pull request Feb 27, 2023

[Post 0.7][multimodal] speedup fusion model training with deepspeed s…

e3d08b9

…trategy (autogluon#2932)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Post 0.7][multimodal] speedup fusion model training with deepspeed strategy #2932

[Post 0.7][multimodal] speedup fusion model training with deepspeed strategy #2932

liangfu commented Feb 16, 2023 •

edited

github-actions bot commented Feb 16, 2023

github-actions bot commented Feb 22, 2023

Raldir left a comment

FANGAreNotGnu left a comment

[Post 0.7][multimodal] speedup fusion model training with deepspeed strategy #2932

[Post 0.7][multimodal] speedup fusion model training with deepspeed strategy #2932

Conversation

liangfu commented Feb 16, 2023 • edited

Results

github-actions bot commented Feb 16, 2023

github-actions bot commented Feb 22, 2023

Raldir left a comment

Choose a reason for hiding this comment

FANGAreNotGnu left a comment

Choose a reason for hiding this comment

liangfu commented Feb 16, 2023 •

edited