Skip to content

Commit

Permalink
Update huggingface readme (kserve#3678)
Browse files Browse the repository at this point in the history
* update wording for huggingface README

small update to make readme easier to understand

Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>

* Update README.md

Signed-off-by: Alexa Griffith agriffith50@bloomberg.net

* Update python/huggingfaceserver/README.md

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>

* update vllm

Signed-off-by: alexagriffith <agriffith50@bloomberg.net>

* Update README.md

---------

Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>
Signed-off-by: Alexa Griffith agriffith50@bloomberg.net
Signed-off-by: alexagriffith <agriffith50@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: asd981256 <asd981256@gmail.com>
  • Loading branch information
3 people authored and asd981256 committed May 14, 2024
1 parent 73bad9a commit 5567a73
Showing 1 changed file with 18 additions and 13 deletions.
31 changes: 18 additions & 13 deletions python/huggingfaceserver/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Huggingface Serving Runtime
# HuggingFace Serving Runtime

The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box.
The Huggingface serving runtime implements a runtime that can serve HuggingFace transformer based model out of the box.
The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification,
token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform
the inference on a more optimized inference engine like triton inference server and vLLM for text generation.


## Run Huggingface Server Locally
## Run HuggingFace Server Locally

```bash
python -m huggingfaceserver --model_id=bert-base-uncased --model_name=bert
Expand Down Expand Up @@ -45,7 +45,7 @@ curl -H "content-type:application/json" -v localhost:8080/v1/models/bert:predict
> 1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
> 2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.
1. Serve the huggingface model using KServe python runtime for both preprocess(tokenization)/postprocess and inference.
1. Serve the BERT model using KServe python HuggingFace runtime for both preprocess(tokenization)/postprocess and inference.
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand All @@ -70,7 +70,7 @@ spec:
memory: 2Gi
```
2. Serve the huggingface model using triton inference runtime and KServe transformer for the preprocess(tokenization) and postprocess.
2. Serve the BERT model using Triton inference runtime and KServe transformer with HuggingFace runtime for the preprocess(tokenization) and postprocess steps.
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand Down Expand Up @@ -111,8 +111,11 @@ spec:
cpu: 100m
memory: 2Gi
```
3. Serve the huggingface model using vllm runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
vllm supported models - https://docs.vllm.ai/en/latest/models/supported_models.html
3. Serve the llama2 model using KServe HuggingFace vLLM runtime. For the llama2 model, vLLM is supported and used as the default backend.
If available for a model, vLLM is set as the default backend, otherwise KServe HuggingFace runtime is used as a failsafe.
You can find vLLM support models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand All @@ -135,10 +138,10 @@ spec:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
```
If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the KServe python runtime will be used.
If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the HuggingFace backend is used.
```yaml
apiVersion: serving.kserve.io/v1beta1
Expand All @@ -165,18 +168,20 @@ spec:
nvidia.com/gpu: "1"
```
Perform the inference for vllm specific runtime
Perform the inference:
KServe Huggingface runtime deployments supports OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.
vllm runtime deployments only support OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.
Sample OpenAI Completions request:
Sample OpenAI Completions request
```bash
curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model": "gpt2", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"gpt2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```
Sample OpenAI Chat request
Sample OpenAI Chat request:
```bash
curl -H "content-type:application/json" -v localhost:8080/openai/v1/chat/completions -d '{"model": "gpt2", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
Expand Down

0 comments on commit 5567a73

Please sign in to comment.