Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference #3565

tonyhoo · 2023-10-06T16:22:13Z

Issue #, if available:

Description of changes:
Switch from DataParallel to DistributedDataParallel as recommended in the latest Pytorch doc

However, after switching the real time inference speed has been impacted significantly due to the overhead on spawning new process, collecting the data from these process and moving it between CPU and GPUs. See plot below:

Generating script (requires code change):

import os
import warnings
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
from autogluon.core.utils.loaders import load_zip
from autogluon.multimodal import MultiModalPredictor
 
warnings.filterwarnings('ignore')
np.random.seed(123)
 
if __name__ == '__main__':
    download_dir = './ag_multimodal_tutorial'
 
    dataset_path = f'{download_dir}/petfinder_for_tutorial'
 
    test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
 
    label_col = 'AdoptionSpeed'
 
    predictor = MultiModalPredictor.load("/home/ubuntu/workplace/autogluon/autogluon/multimodal/AutogluonModels/ag-20231016_231146")
 
    
    strategies = ["ddp", "dp"]
 
    num_runs = 30  # number of runs for each strategy
    results = []
    for strategy in strategies:
        
        for i in range(num_runs):
            start = time.time()
            predictions = predictor.predict(test_data.drop(columns=label_col), realtime=True, strategy=strategy)
            end = time.time()
            if i == 0:
                continue
            results.append({'strategy': strategy, 'execution_time': end - start})
 
    results_df = pd.DataFrame(results)
 
    # save results to a csv file
    results_df.to_csv(f"results.csv", index=False)
 
    # print the average execution time for each strategy
    average_times = results_df.groupby('strategy')['execution_time'].mean()
    print(average_times)
 
    # plot the results
    results_df.boxplot(by='strategy')
    plt.ylabel('Execution Time (s)')
    plt.show()
    plt.savefig('execution_time_plot.png')

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…e inference

github-actions · 2023-10-14T01:27:56Z

Job PR-3565-9201202 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3565/9201202/index.html

zhiqiangdon · 2023-10-16T16:31:21Z

multimodal/src/autogluon/multimodal/utils/inference.py

@@ -176,7 +176,7 @@ def infer_batch(
    device = torch.device(device_type)
    batch_size = len(batch[next(iter(batch))])
    if 1 < num_gpus <= batch_size:
-        model = nn.DataParallel(model)
+        model = nn.parallel.DistributedDataParallel(model)


Did you test realtime inference in multi-gpu environment and compare the inference time?

Updated the codes and testing results for 30 runs between DDP and DP. It looks like DDP for real time inference has much high latency contributed by results collection operation. Based on the results, I don't recommend to use DDP for real time inference

tonyhoo and others added 3 commits October 6, 2023 16:21

Switch from nn.DataParallel to nn.DistributedDataParallel for realtim…

f286b6e

…e inference

Merge branch 'autogluon:master' into ddp_inference

b9c4007

fix the namespace of nn.parallel.DistributedDataParallel

ae5f1b6

tonyhoo added the resource: GPU Related to GPU label Oct 12, 2023

tonyhoo changed the title ~~[Draft]Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference~~ [WIP]Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference Oct 12, 2023

Merge branch 'autogluon:master' into ddp_inference

9201202

tonyhoo changed the title ~~[WIP]Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference~~ Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference Oct 16, 2023

tonyhoo requested a review from zhiqiangdon October 16, 2023 16:28

zhiqiangdon reviewed Oct 16, 2023

View reviewed changes

tonyhoo and others added 6 commits October 16, 2023 15:50

Merge branch 'autogluon:master' into ddp_inference

00ef8f1

update utils.py

d6a6610

added support for DDP in realtime

8426478

Remove dataparallel codes

a1f38c2

Merge branch 'autogluon:master' into ddp_inference

7b7dce2

lint check

6f48e65

tonyhoo marked this pull request as draft October 23, 2023 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference #3565

Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference #3565

tonyhoo commented Oct 6, 2023 •

edited

github-actions bot commented Oct 14, 2023

zhiqiangdon Oct 16, 2023

tonyhoo Oct 19, 2023

Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference #3565

Are you sure you want to change the base?

Switch from nn.DataParallel to nn.DistributedDataParallel for realtime inference #3565

Conversation

tonyhoo commented Oct 6, 2023 • edited

github-actions bot commented Oct 14, 2023

zhiqiangdon Oct 16, 2023

Choose a reason for hiding this comment

tonyhoo Oct 19, 2023

Choose a reason for hiding this comment

tonyhoo commented Oct 6, 2023 •

edited