Vllm model handler #32410

damccorm · 2024-09-09T00:21:32Z

This PR adds a model handler for running inference using vLLM. To leverage vLLM's dynamic batching, this spins up a central vLLM serving process, coordinated by a single global model wrapper, which individual worker threads can send RPCs to.

Testing this is tricky since it requires a gpu; I followed the same pattern we use for TensorRT - just launch some examples directly on Dataflow and validate that they run to completion. I didn't do any kind of result validation since results produced by a LLM are non-deterministic.

Part of #32528

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

…damccorm/vllm

github-actions · 2024-09-23T19:05:54Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping for label python.
R: @Abacn for label build.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

shunping · 2024-09-24T13:58:09Z

sdks/python/apache_beam/ml/inference/vllm_inference.py

+      formatted = []
+      for message in messages:
+        formatted.append({"role": message.role, "content": message.content})
+      completion = client.chat.completions.create(


What happens if an server exception occurs during the query?

It will bubble up as an exception and either be retried or sent to a DLQ depending on user configuration. With that said, previously if the server died it would have just stayed dead. I added some logic to handle/avoid that problem by restarting the server if we can't connect.

shunping · 2024-09-24T13:58:11Z

sdks/python/apache_beam/ml/inference/vllm_inference.py

+    client = getVLLMClient(model.get_server_port())
+    inference_args = inference_args or {}
+    predictions = []
+    # TODO(https://github.com/apache/beam/issues/32528): We should add support


Do they support batch mode in the query?

They do not as far as I can tell, unfortunately. I plan on addressing this in a follow up pr though - with vLLMs dynamic batching it still almost certainly makes sense to do something here.

shunping

Thanks! LGTM

damccorm added 3 commits April 22, 2024 18:22

Vllm first pass [wip]

d4a0071

Example for integration tests wip

76235ff

Still wip

63cea60

github-actions bot added python examples labels Sep 9, 2024

damccorm added 6 commits September 8, 2024 20:28

Merge branch 'master' of https://github.com/damccorm/beam into users/…

8dbea08

…damccorm/vllm

Test changes

e6014e0

Dockerfile improvements

bbcca46

Remove bad change

5b234ce

Clean up test args

6159159

clean up invocation

404a0fc

github-actions bot added the build label Sep 9, 2024

damccorm added 9 commits September 9, 2024 14:34

string fix

b407542

string fix

a9c97c2

clean up

5bd6ea8

lint

c5bf4f9

Get tests working with 5xx driver

000401d

cleanup

19a018b

Fixes, everything is now working

7a173dc

Merge in master

a51aa51

Batching

7703f53

damccorm changed the title ~~[WIP] Do not merge: Vllm model handler~~ Vllm model handler Sep 23, 2024

lint

93bbd49

damccorm marked this pull request as ready for review September 23, 2024 18:50

github-actions bot added the Next Action: Reviewers label Sep 23, 2024

shunping reviewed Sep 24, 2024

View reviewed changes

Feedback + CHANGES.md

98791c7

shunping approved these changes Sep 24, 2024

View reviewed changes

damccorm merged commit 97cb452 into apache:master Sep 24, 2024
104 checks passed

damccorm deleted the users/damccorm/vllm branch September 24, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vllm model handler #32410

Vllm model handler #32410

damccorm commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 23, 2024

shunping Sep 24, 2024

damccorm Sep 24, 2024

shunping Sep 24, 2024

damccorm Sep 24, 2024

shunping left a comment

Vllm model handler #32410

Vllm model handler #32410

Conversation

damccorm commented Sep 9, 2024 • edited Loading

GitHub Actions Tests Status (on master branch)

github-actions bot commented Sep 23, 2024

shunping Sep 24, 2024

Choose a reason for hiding this comment

damccorm Sep 24, 2024

Choose a reason for hiding this comment

shunping Sep 24, 2024

Choose a reason for hiding this comment

damccorm Sep 24, 2024

Choose a reason for hiding this comment

shunping left a comment

Choose a reason for hiding this comment

damccorm commented Sep 9, 2024 •

edited

Loading