Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891

Merged
merged 15 commits into from
Feb 11, 2023

Conversation

sxjscience
Copy link
Collaborator

@sxjscience sxjscience commented Feb 10, 2023

Issue #, if available:

Description of changes:

  • Disable persistent_worker for evaluation dataloader.
  • Remove seed in .predict().

*Update, @zhiqiangdon noticed that we can simply set persistent_worker to False in the evaluation dataloader.

After setting persistent_worker, there seems to be an additional overhead in creating the prediction / evaluation dataloader. I thus compared the speed of num_worders_evaluation set to 0 and 2, and noticed that num_workers_evaluation=0 will be 10x faster:

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset

def test_num_workers_evaluation(num_workers=0, nrepeat=20):
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    predictor_2 = MultiModalPredictor(
        label="label",
    )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 2,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": num_workers,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
    )
    start = time.time()
    for i in range(nrepeat):
        prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    end = time.time()
    return end - start


print("Total time for num_workers=0", test_num_workers_evaluation(0, 20))
print("Total time for num_workers=2", test_num_workers_evaluation(2, 20))

Output:

Total time for num_workers=0 14.960921049118042
Total time for num_workers=2 166.43815088272095

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did some tests with your code snippet. I noticed that the cost is mainly from the hanging among different predictions. I guess the issue is from the together usage of dp+persisten_workers=True. I tried to change persisten_workers=False only for the prediction dataloader and find that num_workers_evaluation=2 uses 13.7s, less than 15.9 of using num_workers_evaluation=0. So, num_workers_evaluation=2 is faster. This experiment only uses this toy dataset with small images and single GPU. I think the advantage of num_workers_evaluation=2 would be much larger for larger images and multi-GPUs. My suggestion is to set persisten_workers=False for prediction dataloader(https://github.com/autogluon/autogluon/blob/master/multimodal/src/autogluon/multimodal/data/datamodule.py#L221).

@sxjscience
Copy link
Collaborator Author

Sounds good. That's another way to fix the issue (the root cause should be the combination of DP + Persistent worker.). May be both are using multiprocessing so it's creating too many processes?

@sxjscience sxjscience changed the title [Performance]Set num_workers_evaluation to be 0 [Performance]Set persistent_worker in prediction dataloader to False Feb 10, 2023
@sxjscience sxjscience added the model list checked You have updated the model list after modifying multimodal unit tests/docs label Feb 10, 2023
@zhiqiangdon
Copy link
Contributor

Sounds good. That's another way to fix the issue (the root cause should be the combination of DP + Persistent worker.). May be both are using multiprocessing so it's creating too many processes?

Yeah, it seems that Lightning doesn't clean up the processes at the end of prediction.

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for investigating the issue!

@github-actions
Copy link

Job PR-2891-527f600 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/527f600/index.html

@github-actions
Copy link

Job PR-2891-0c63c36 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/0c63c36/index.html

@liangfu
Copy link
Collaborator

liangfu commented Feb 11, 2023

Evaluated on multi-gpu

master
Total time for num_workers=0 13.779603004455566
<failure>

PR
Total time for num_workers=0 13.705002784729004
Total time for num_workers=2 20.062042236328125

@sxjscience
Copy link
Collaborator Author

@zhiqiangdon In my local test, I also find that num_workers=0 is faster than num_workers=2 for evaluation.

@sxjscience
Copy link
Collaborator Author

I changed back to use num_workers_evaluation to 0. The reason is that I find that in single-GPU, num_workers=0 is faster than num_workers=2 when persistent_worker is False. In addition, in Multi-GPU, @liangfu noticed that num_workers=0 is also faster.

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset

def test_num_workers_evaluation(num_workers=0, nrepeat=1):
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    predictor_2 = MultiModalPredictor(
        label="label",
    )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 2,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": num_workers,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
    )
    prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    start = time.time()
    for i in range(nrepeat):
        prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    end = time.time()
    return end - start


print("Total time for num_workers=0", test_num_workers_evaluation(0, 1))
print("Total time for num_workers=2", test_num_workers_evaluation(2, 1))

Output is

Total time for num_workers=0 0.7635805606842041
Total time for num_workers=2 0.8145883083343506

@sxjscience
Copy link
Collaborator Author

@zhiqiangdon I changed back to num_workers_evaluation=0. See the comment above.

@Innixma
Copy link
Contributor

Innixma commented Feb 11, 2023

@sxjscience It would be good to measure the speed difference at different inference batch sizes.

For example, the time taken for 1 sample, time taken for 10, time taken for 100, 1000, 10000, etc.

You can see another scenario where we did this comparison (refer to the plot): #2395

It might be the case that optimal num_workers changes depending on the batch size

@zhiqiangdon
Copy link
Contributor

zhiqiangdon commented Feb 11, 2023

@zhiqiangdon I changed back to num_workers_evaluation=0. See the comment above.

Can you try 10x more samples for one prediction?

@sxjscience
Copy link
Collaborator Author

@zhiqiangdon have you seen significant speed boost for larger images? I feel that the performance gain of “num_workers=2” might be negligible.

May be we can just adopt an accelerated version of Pillow so the speed can be faster?

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a simple experiments by repeating the test data 10x and still use single GPU.

test_data_1 = pd.concat([test_data_1]*10, ignore_index=True)

Here is the time comparison with 10x testing data.

  • num_workers_evaluation=0, persistent_workers=False: 32.4s
  • num_workers_evaluation=2, persistent_workers=False: 13.7s

@sxjscience
Copy link
Collaborator Author

@zhiqiangdon can you share the code for reproducing the performance numbers so I can verify?

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset
import pandas as pd


def test_num_workers_evaluation(num_workers=0, nrepeat=20):
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    # train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    # predictor_2 = MultiModalPredictor(
    #     label="label",
    # )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 2,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": num_workers,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
        time_limit=0,
    )
    start = time.time()
    print("start prediction...\n")
    test_data_1 = pd.concat([test_data_1]*10, ignore_index=True)
    for i in range(nrepeat):
        print(f"iter {i}")
        prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
    end = time.time()
    return end - start


if __name__ == "__main__":
    print("Total time for num_workers=0", test_num_workers_evaluation(0, 5))
    print("Total time for num_workers=2", test_num_workers_evaluation(2, 5))

Remember setting persistent_workers=False for prediction dataloader.

@sxjscience
Copy link
Collaborator Author

Sounds good. Will check + revise accordingly.

@liangfu
Copy link
Collaborator

liangfu commented Feb 11, 2023

Here are some benchmark results for varying batch_size and num_workers

image

Key takeaways

  • with num_workers=2, batch_size=64, total_time=10.786 seconds, there is an error ValueError: semaphore or lock released too many times
  • num_workers=0 is
    • slow
    • scale linearly, and reliable
  • num_workers=2
    • unreliable when launching multiple batch prediction, due to overhead in creating / deleting multi-worker dataloader
    • more efficient for single large batch prediction

Here is the code to reproduce the experiment

import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset
import pandas as pd

def test_num_workers_evaluation():
    download_dir = "./"
    train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
    predictor_1 = MultiModalPredictor(
        label="label",
    )
    model_names = ["timm_image"]
    hyperparameters = {
        "optimization.max_epochs": 1,
        "model.names": model_names,
        "model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
        "env.num_workers": 2,
        "env.num_workers_evaluation": 0,
    }
    predictor_1.fit(
        train_data=train_data_1,
        hyperparameters=hyperparameters,
        seed=42,
    )
    test_data_1 = pd.concat([train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1,
                             train_data_1, test_data_1, train_data_1, test_data_1])
    nrepeat = 1
    for num_workers in [0, 2]:
        for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]:
            test_data = test_data_1.head(batch_size)
            predictor_1._config.env.num_workers_evaluation = num_workers
            start = time.time()
            for i in range(nrepeat):
                prediction_1 = predictor_1.predict(test_data, as_pandas=False)
            end = time.time()
            print(f"num_workers={num_workers}, batch_size={batch_size}, total_time={end-start:.3f} seconds")

if __name__=="__main__":
    test_num_workers_evaluation()

@liangfu
Copy link
Collaborator

liangfu commented Feb 11, 2023

Here are some more results

image


It would be good to measure the speed difference at different inference batch sizes.
For example, the time taken for 1 sample, time taken for 10, time taken for 100, 1000, 10000, etc.
You can see another scenario where we did this comparison (refer to the plot): #2395
It might be the case that optimal num_workers changes depending on the batch size

Agree! For num_workers=0, it would take about 1min to predict for batch_size=10000, while num_worker=2 would be 2x ~ 3x faster. In such case, the overhead in creating and releasing the multiprocessing dataloader can be simply ignored. However, if we want such batch prediction speedup, we need to ensure SAGEMAKER_MODEL_SERVER_WORKERS=1 in DLC inference docker.

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangfu Do you set the persistent_workers=False for prediction dataloader in your experiments? We would like to see the comparison done with persistent_workers=False.

This reverts commit 327181c.
This reverts commit cbc7bf0.
This reverts commit 4a9ea31.
This reverts commit cbc7bf0.
@sxjscience sxjscience changed the title [Performance]Set persistent_worker in prediction dataloader to False [Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference Feb 11, 2023
@sxjscience
Copy link
Collaborator Author

Thanks @liangfu and @zhiqiangdon ! I revised the PR according to the performance benchmark.

@github-actions
Copy link

Job PR-2891-4a9ea31 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/4a9ea31/index.html

@github-actions
Copy link

Job PR-2891-64fcc09 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2891/64fcc09/index.html

@sxjscience sxjscience merged commit 69dc475 into autogluon:master Feb 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model list checked You have updated the model list after modifying multimodal unit tests/docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants