[NLP] Evaluate batched inference calls singularly #2538
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Batch inference calls use more memory which can lead to OOM errors in extreme cases. This change iterates over the requests in a batch evaluating them one at a time.
Comparing batched to un-batched evaluation, benchmarking shows that memory usage is significantly lower and the total time for inference is similar in both cases. The benchmark data was generated with the ELSER model with different size batches. Each item in the batch contained 512 tokens. Inference Time is the time to process the entire batch whether singularly or all at once.