[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891

sxjscience · 2023-02-10T05:18:53Z

Issue #, if available:

Multimodal seed is unnecessarily reset #2898

Description of changes:

Disable persistent_worker for evaluation dataloader.
Remove seed in .predict().

*Update, @zhiqiangdon noticed that we can simply set persistent_worker to False in the evaluation dataloader.

After setting persistent_worker, there seems to be an additional overhead in creating the prediction / evaluation dataloader. I thus compared the speed of num_worders_evaluation set to 0 and 2, and noticed that num_workers_evaluation=0 will be 10x faster:

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset

def test_num_workers_evaluation(num_workers=0, nrepeat=20):
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    predictor_2 = MultiModalPredictor(
        label="label",
    )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 2,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": num_workers,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
    )
    start = time.time()
    for i in range(nrepeat):
        prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    end = time.time()
    return end - start


print("Total time for num_workers=0", test_num_workers_evaluation(0, 20))
print("Total time for num_workers=2", test_num_workers_evaluation(2, 20))

Output:

Total time for num_workers=0 14.960921049118042
Total time for num_workers=2 166.43815088272095

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

zhiqiangdon

I just did some tests with your code snippet. I noticed that the cost is mainly from the hanging among different predictions. I guess the issue is from the together usage of dp+persisten_workers=True. I tried to change persisten_workers=False only for the prediction dataloader and find that num_workers_evaluation=2 uses 13.7s, less than 15.9 of using num_workers_evaluation=0. So, num_workers_evaluation=2 is faster. This experiment only uses this toy dataset with small images and single GPU. I think the advantage of num_workers_evaluation=2 would be much larger for larger images and multi-GPUs. My suggestion is to set persisten_workers=False for prediction dataloader(https://github.com/autogluon/autogluon/blob/master/multimodal/src/autogluon/multimodal/data/datamodule.py#L221).

sxjscience · 2023-02-10T06:46:29Z

Sounds good. That's another way to fix the issue (the root cause should be the combination of DP + Persistent worker.). May be both are using multiprocessing so it's creating too many processes?

zhiqiangdon · 2023-02-10T07:03:08Z

Sounds good. That's another way to fix the issue (the root cause should be the combination of DP + Persistent worker.). May be both are using multiprocessing so it's creating too many processes?

Yeah, it seems that Lightning doesn't clean up the processes at the end of prediction.

zhiqiangdon

LGTM. Thanks for investigating the issue!

…error

github-actions · 2023-02-10T23:50:16Z

Job PR-2891-527f600 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/527f600/index.html

github-actions · 2023-02-10T23:52:04Z

Job PR-2891-0c63c36 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/0c63c36/index.html

liangfu · 2023-02-11T00:44:09Z

Evaluated on multi-gpu

master
Total time for num_workers=0 13.779603004455566
<failure>

PR
Total time for num_workers=0 13.705002784729004
Total time for num_workers=2 20.062042236328125

sxjscience · 2023-02-11T00:45:12Z

@zhiqiangdon In my local test, I also find that num_workers=0 is faster than num_workers=2 for evaluation.

sxjscience · 2023-02-11T00:55:30Z

I changed back to use num_workers_evaluation to 0. The reason is that I find that in single-GPU, num_workers=0 is faster than num_workers=2 when persistent_worker is False. In addition, in Multi-GPU, @liangfu noticed that num_workers=0 is also faster.

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset

def test_num_workers_evaluation(num_workers=0, nrepeat=1):
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    predictor_2 = MultiModalPredictor(
        label="label",
    )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 2,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": num_workers,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
    )
    prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    start = time.time()
    for i in range(nrepeat):
        prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    end = time.time()
    return end - start


print("Total time for num_workers=0", test_num_workers_evaluation(0, 1))
print("Total time for num_workers=2", test_num_workers_evaluation(2, 1))

Output is

Total time for num_workers=0 0.7635805606842041
Total time for num_workers=2 0.8145883083343506

sxjscience · 2023-02-11T00:56:51Z

@zhiqiangdon I changed back to num_workers_evaluation=0. See the comment above.

Innixma · 2023-02-11T01:10:28Z

@sxjscience It would be good to measure the speed difference at different inference batch sizes.

For example, the time taken for 1 sample, time taken for 10, time taken for 100, 1000, 10000, etc.

You can see another scenario where we did this comparison (refer to the plot): #2395

It might be the case that optimal num_workers changes depending on the batch size

zhiqiangdon · 2023-02-11T01:10:42Z

@zhiqiangdon I changed back to num_workers_evaluation=0. See the comment above.

Can you try 10x more samples for one prediction?

sxjscience · 2023-02-11T02:00:50Z

@zhiqiangdon have you seen significant speed boost for larger images? I feel that the performance gain of “num_workers=2” might be negligible.

May be we can just adopt an accelerated version of Pillow so the speed can be faster?

zhiqiangdon

I did a simple experiments by repeating the test data 10x and still use single GPU.

test_data_1 = pd.concat([test_data_1]*10, ignore_index=True)

Here is the time comparison with 10x testing data.

num_workers_evaluation=0, persistent_workers=False: 32.4s
num_workers_evaluation=2, persistent_workers=False: 13.7s

sxjscience · 2023-02-11T02:02:38Z

@zhiqiangdon can you share the code for reproducing the performance numbers so I can verify?

zhiqiangdon

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset
import pandas as pd


def test_num_workers_evaluation(num_workers=0, nrepeat=20):
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    # train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    # predictor_2 = MultiModalPredictor(
    #     label="label",
    # )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 2,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": num_workers,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
        time_limit=0,
    )
    start = time.time()
    print("start prediction...\n")
    test_data_1 = pd.concat([test_data_1]*10, ignore_index=True)
    for i in range(nrepeat):
        print(f"iter {i}")
        prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    end = time.time()
    return end - start


if __name__ == "__main__":
    print("Total time for num_workers=0", test_num_workers_evaluation(0, 5))
    print("Total time for num_workers=2", test_num_workers_evaluation(2, 5))

Remember setting persistent_workers=False for prediction dataloader.

sxjscience · 2023-02-11T02:23:07Z

Sounds good. Will check + revise accordingly.

liangfu · 2023-02-11T05:26:35Z

Here are some benchmark results for varying batch_size and num_workers

Key takeaways

with num_workers=2, batch_size=64, total_time=10.786 seconds, there is an error ValueError: semaphore or lock released too many times
num_workers=0 is
- slow
- scale linearly, and reliable
num_workers=2
- unreliable when launching multiple batch prediction, due to overhead in creating / deleting multi-worker dataloader
- more efficient for single large batch prediction

Here is the code to reproduce the experiment

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset
import pandas as pd

def test_num_workers_evaluation():
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 1,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": 0,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
    )
    test_data_1 = pd.concat([train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1])
    nrepeat = 1
    for num_workers in [0, 2]:
        for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]:
            test_data = test_data_1.head(batch_size)
            predictor_1._config.env.num_workers_evaluation = num_workers
            start = time.time()
            for i in range(nrepeat):
                prediction_1 = predictor_1.predict(test_data, as_pandas=False)
            end = time.time()
            print(f"num_workers={num_workers}, batch_size={batch_size}, total_time={end-start:.3f} seconds")

if __name__=="__main__":
    test_num_workers_evaluation()

liangfu · 2023-02-11T06:04:55Z

Here are some more results

It would be good to measure the speed difference at different inference batch sizes.
For example, the time taken for 1 sample, time taken for 10, time taken for 100, 1000, 10000, etc.
You can see another scenario where we did this comparison (refer to the plot): #2395
It might be the case that optimal num_workers changes depending on the batch size

Agree! For num_workers=0, it would take about 1min to predict for batch_size=10000, while num_worker=2 would be 2x ~ 3x faster. In such case, the overhead in creating and releasing the multiprocessing dataloader can be simply ignored. However, if we want such batch prediction speedup, we need to ensure SAGEMAKER_MODEL_SERVER_WORKERS=1 in DLC inference docker.

zhiqiangdon

@liangfu Do you set the persistent_workers=False for prediction dataloader in your experiments? We would like to see the comparison done with persistent_workers=False.

This reverts commit 327181c.

This reverts commit cbc7bf0.

This reverts commit 4a9ea31.

This reverts commit 7629ba1.

This reverts commit cbc7bf0.

sxjscience · 2023-02-11T07:38:11Z

Thanks @liangfu and @zhiqiangdon ! I revised the PR according to the performance benchmark.

github-actions · 2023-02-11T09:06:39Z

Job PR-2891-4a9ea31 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/4a9ea31/index.html

github-actions · 2023-02-11T09:07:06Z

Job PR-2891-64fcc09 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/64fcc09/index.html

set num_workers_evaluation to be 0

d9fb74f

sxjscience requested review from zhiqiangdon, bryanyzhu, suzhoum and liangfu February 10, 2023 05:19

sxjscience added 2 commits February 9, 2023 21:57

Update predictor.py

c16ca0e

Update default.yaml

3450f35

zhiqiangdon reviewed Feb 10, 2023

View reviewed changes

set persistent worker to False

ba16229

sxjscience changed the title ~~[Performance]Set num_workers_evaluation to be 0~~ [Performance]Set persistent_worker in prediction dataloader to False Feb 10, 2023

sxjscience added the model list checked You have updated the model list after modifying multimodal unit tests/docs label Feb 10, 2023

zhiqiangdon approved these changes Feb 10, 2023

View reviewed changes

sxjscience added 4 commits February 10, 2023 12:55

Merge remote-tracking branch 'upstream/master' into inspect_unittest_…

330a5bd

…error

update

fcf7bc2

add version

0c63c36

Update config.py

527f600

liangfu approved these changes Feb 11, 2023

View reviewed changes

update

cbc7bf0

remove seed

327181c

zhiqiangdon reviewed Feb 11, 2023

View reviewed changes

sxjscience added 5 commits February 10, 2023 23:33

Revert "remove seed"

7629ba1

This reverts commit 327181c.

Revert "update"

4a9ea31

This reverts commit cbc7bf0.

Revert "Revert "update""

b72e961

This reverts commit 4a9ea31.

Revert "Revert "remove seed""

a48a812

This reverts commit 7629ba1.

Revert "update"

64fcc09

This reverts commit cbc7bf0.

sxjscience changed the title ~~[Performance]Set persistent_worker in prediction dataloader to False~~ [Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference Feb 11, 2023

sxjscience merged commit 69dc475 into autogluon:master Feb 11, 2023

sxjscience mentioned this pull request Feb 11, 2023

Multimodal seed is unnecessarily reset #2898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891

[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891

sxjscience commented Feb 10, 2023 •

edited

zhiqiangdon left a comment •

edited

sxjscience commented Feb 10, 2023

zhiqiangdon commented Feb 10, 2023

zhiqiangdon left a comment

github-actions bot commented Feb 10, 2023

github-actions bot commented Feb 10, 2023

liangfu commented Feb 11, 2023

sxjscience commented Feb 11, 2023

sxjscience commented Feb 11, 2023

sxjscience commented Feb 11, 2023

Innixma commented Feb 11, 2023 •

edited

zhiqiangdon commented Feb 11, 2023 •

edited

sxjscience commented Feb 11, 2023

zhiqiangdon left a comment

sxjscience commented Feb 11, 2023

zhiqiangdon left a comment •

edited

sxjscience commented Feb 11, 2023

liangfu commented Feb 11, 2023 •

edited

liangfu commented Feb 11, 2023

zhiqiangdon left a comment

sxjscience commented Feb 11, 2023

github-actions bot commented Feb 11, 2023

github-actions bot commented Feb 11, 2023

[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891

[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891

Conversation

sxjscience commented Feb 10, 2023 • edited

zhiqiangdon left a comment • edited

Choose a reason for hiding this comment

sxjscience commented Feb 10, 2023

zhiqiangdon commented Feb 10, 2023

zhiqiangdon left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 10, 2023

github-actions bot commented Feb 10, 2023

liangfu commented Feb 11, 2023

sxjscience commented Feb 11, 2023

sxjscience commented Feb 11, 2023

sxjscience commented Feb 11, 2023

Innixma commented Feb 11, 2023 • edited

zhiqiangdon commented Feb 11, 2023 • edited

sxjscience commented Feb 11, 2023

zhiqiangdon left a comment

Choose a reason for hiding this comment

sxjscience commented Feb 11, 2023

zhiqiangdon left a comment • edited

Choose a reason for hiding this comment

sxjscience commented Feb 11, 2023

liangfu commented Feb 11, 2023 • edited

Key takeaways

liangfu commented Feb 11, 2023

zhiqiangdon left a comment

Choose a reason for hiding this comment

sxjscience commented Feb 11, 2023

github-actions bot commented Feb 11, 2023

github-actions bot commented Feb 11, 2023

sxjscience commented Feb 10, 2023 •

edited

zhiqiangdon left a comment •

edited

Innixma commented Feb 11, 2023 •

edited

zhiqiangdon commented Feb 11, 2023 •

edited

zhiqiangdon left a comment •

edited

liangfu commented Feb 11, 2023 •

edited