-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891
[Performance]Set persistent_worker in prediction dataloader to False + Remove seed in inference #2891
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did some tests with your code snippet. I noticed that the cost is mainly from the hanging among different predictions. I guess the issue is from the together usage of dp
+persisten_workers=True
. I tried to change persisten_workers=False
only for the prediction dataloader and find that num_workers_evaluation=2
uses 13.7s, less than 15.9 of using num_workers_evaluation=0
. So, num_workers_evaluation=2
is faster. This experiment only uses this toy dataset with small images and single GPU. I think the advantage of num_workers_evaluation=2
would be much larger for larger images and multi-GPUs. My suggestion is to set persisten_workers=False
for prediction dataloader(https://github.com/autogluon/autogluon/blob/master/multimodal/src/autogluon/multimodal/data/datamodule.py#L221).
Sounds good. That's another way to fix the issue (the root cause should be the combination of DP + Persistent worker.). May be both are using multiprocessing so it's creating too many processes? |
Yeah, it seems that Lightning doesn't clean up the processes at the end of prediction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for investigating the issue!
Job PR-2891-527f600 is done. |
Job PR-2891-0c63c36 is done. |
Evaluated on multi-gpu
|
@zhiqiangdon In my local test, I also find that |
I changed back to use import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset
def test_num_workers_evaluation(num_workers=0, nrepeat=1):
download_dir = "./"
train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
predictor_1 = MultiModalPredictor(
label="label",
)
predictor_2 = MultiModalPredictor(
label="label",
)
model_names = ["timm_image"]
hyperparameters = {
"optimization.max_epochs": 2,
"model.names": model_names,
"model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
"env.num_workers": 2,
"env.num_workers_evaluation": num_workers,
}
predictor_1.fit(
train_data=train_data_1,
hyperparameters=hyperparameters,
seed=42,
)
prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
start = time.time()
for i in range(nrepeat):
prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
end = time.time()
return end - start
print("Total time for num_workers=0", test_num_workers_evaluation(0, 1))
print("Total time for num_workers=2", test_num_workers_evaluation(2, 1)) Output is
|
@zhiqiangdon I changed back to |
@sxjscience It would be good to measure the speed difference at different inference batch sizes. For example, the time taken for 1 sample, time taken for 10, time taken for 100, 1000, 10000, etc. You can see another scenario where we did this comparison (refer to the plot): #2395 It might be the case that optimal num_workers changes depending on the batch size |
Can you try 10x more samples for one prediction? |
@zhiqiangdon have you seen significant speed boost for larger images? I feel that the performance gain of “num_workers=2” might be negligible. May be we can just adopt an accelerated version of Pillow so the speed can be faster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a simple experiments by repeating the test data 10x and still use single GPU.
test_data_1 = pd.concat([test_data_1]*10, ignore_index=True)
Here is the time comparison with 10x testing data.
- num_workers_evaluation=0, persistent_workers=False: 32.4s
- num_workers_evaluation=2, persistent_workers=False: 13.7s
@zhiqiangdon can you share the code for reproducing the performance numbers so I can verify? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import time
from autogluon.multimodal import MultiModalPredictor
from autogluon.multimodal.utils.misc import shopee_dataset
import pandas as pd
def test_num_workers_evaluation(num_workers=0, nrepeat=20):
download_dir = "./"
train_data_1, test_data_1 = shopee_dataset(download_dir=download_dir)
# train_data_2, test_data_2 = shopee_dataset(download_dir=download_dir, is_bytearray=True)
predictor_1 = MultiModalPredictor(
label="label",
)
# predictor_2 = MultiModalPredictor(
# label="label",
# )
model_names = ["timm_image"]
hyperparameters = {
"optimization.max_epochs": 2,
"model.names": model_names,
"model.timm_image.checkpoint_name": "swin_tiny_patch4_window7_224",
"env.num_workers": 2,
"env.num_workers_evaluation": num_workers,
}
predictor_1.fit(
train_data=train_data_1,
hyperparameters=hyperparameters,
seed=42,
time_limit=0,
)
start = time.time()
print("start prediction...\n")
test_data_1 = pd.concat([test_data_1]*10, ignore_index=True)
for i in range(nrepeat):
print(f"iter {i}")
prediction_1 = predictor_1.predict(test_data_1, as_pandas=False)
end = time.time()
return end - start
if __name__ == "__main__":
print("Total time for num_workers=0", test_num_workers_evaluation(0, 5))
print("Total time for num_workers=2", test_num_workers_evaluation(2, 5))
Remember setting persistent_workers=False
for prediction dataloader.
Sounds good. Will check + revise accordingly. |
Here are some benchmark results for varying batch_size and num_workers Key takeaways
Here is the code to reproduce the experiment
|
Here are some more results
Agree! For num_workers=0, it would take about 1min to predict for batch_size=10000, while num_worker=2 would be 2x ~ 3x faster. In such case, the overhead in creating and releasing the multiprocessing dataloader can be simply ignored. However, if we want such batch prediction speedup, we need to ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangfu Do you set the persistent_workers=False
for prediction dataloader in your experiments? We would like to see the comparison done with persistent_workers=False
.
Thanks @liangfu and @zhiqiangdon ! I revised the PR according to the performance benchmark. |
Job PR-2891-4a9ea31 is done. |
Job PR-2891-64fcc09 is done. |
Issue #, if available:
Description of changes:
.predict()
.*Update, @zhiqiangdon noticed that we can simply set persistent_worker to False in the evaluation dataloader.
After setting
persistent_worker
, there seems to be an additional overhead in creating the prediction / evaluation dataloader. I thus compared the speed ofnum_worders_evaluation
set to 0 and 2, and noticed thatnum_workers_evaluation=0
will be 10x faster:Output:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.