Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NLP] Evaluate batched inference calls singularly #2538

Merged
merged 7 commits into from Jun 20, 2023

Conversation

davidkyle
Copy link
Member

Batch inference calls use more memory which can lead to OOM errors in extreme cases. This change iterates over the requests in a batch evaluating them one at a time.

Comparing batched to un-batched evaluation, benchmarking shows that memory usage is significantly lower and the total time for inference is similar in both cases. The benchmark data was generated with the ELSER model with different size batches. Each item in the batch contained 512 tokens. Inference Time is the time to process the entire batch whether singularly or all at once.

Num items in request Memory Max RSS (MB) Batched Memory Max RSS (MB) Inference Time (ms) Batched Inference Time (ms)
0 946 943 0 0
10 2605 1219 5022 5309
20 4237 1234 9717 9478
30 5960 1239 14434 14408
40 6032 1251 19902 19396
50 6616 1257 24853 24112

Co-authored-by: David Roberts <dave.roberts@elastic.co>
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle davidkyle merged commit 23b6900 into elastic:main Jun 20, 2023
13 checks passed
@davidkyle davidkyle deleted the un-batch branch June 20, 2023 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants