Update huggingface readme (kserve#3678)

* update wording for huggingface README small update to make readme easier to understand Signed-off-by: Alexa Griffith <agriffith50@bloomberg.net> * Update README.md Signed-off-by: Alexa Griffith agriffith50@bloomberg.net * Update python/huggingfaceserver/README.md Co-authored-by: Filippe Spolti <filippespolti@gmail.com> Signed-off-by: Alexa Griffith <agriffith50@bloomberg.net> * update vllm Signed-off-by: alexagriffith <agriffith50@bloomberg.net> * Update README.md --------- Signed-off-by: Alexa Griffith <agriffith50@bloomberg.net> Signed-off-by: Alexa Griffith agriffith50@bloomberg.net Signed-off-by: alexagriffith <agriffith50@bloomberg.net> Signed-off-by: Dan Sun <dsun20@bloomberg.net> Co-authored-by: Filippe Spolti <filippespolti@gmail.com> Co-authored-by: Dan Sun <dsun20@bloomberg.net> Signed-off-by: asd981256 <asd981256@gmail.com>
asd981256 · May 14, 2024 · 5567a73 · 5567a73
1 parent 73bad9a
commit 5567a73
Showing 1 changed file with 18 additions and 13 deletions.
diff --git a/python/huggingfaceserver/README.md b/python/huggingfaceserver/README.md
@@ -1,12 +1,12 @@
-# Huggingface Serving Runtime
+# HuggingFace Serving Runtime
 
-The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box.
+The Huggingface serving runtime implements a runtime that can serve HuggingFace transformer based model out of the box.
 The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification,
 token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform
 the inference on a more optimized inference engine like triton inference server and vLLM for text generation.
 
 
-## Run Huggingface Server Locally
+## Run HuggingFace Server Locally
 
 ```bash
 python -m huggingfaceserver --model_id=bert-base-uncased --model_name=bert
@@ -45,7 +45,7 @@ curl -H "content-type:application/json" -v localhost:8080/v1/models/bert:predict
 > 1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
 > 2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.
 
-1. Serve the huggingface model using KServe python runtime for both preprocess(tokenization)/postprocess and inference.
+1. Serve the BERT model using KServe python HuggingFace runtime for both preprocess(tokenization)/postprocess and inference.
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -70,7 +70,7 @@ spec:
           memory: 2Gi
 ```
 
-2. Serve the huggingface model using triton inference runtime and KServe transformer for the preprocess(tokenization) and postprocess.
+2. Serve the BERT model using Triton inference runtime and KServe transformer with HuggingFace runtime for the preprocess(tokenization) and postprocess steps.
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -111,8 +111,11 @@ spec:
           cpu: 100m
           memory: 2Gi
 ```
-3. Serve the huggingface model using vllm runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
-vllm supported models - https://docs.vllm.ai/en/latest/models/supported_models.html 
+
+3. Serve the llama2 model using KServe HuggingFace vLLM runtime. For the llama2 model, vLLM is supported and used as the default backend.
+If available for a model, vLLM is set as the default backend, otherwise KServe HuggingFace runtime is used as a failsafe.
+You can find vLLM support models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
+
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -135,10 +138,10 @@ spec:
           cpu: "6"
           memory: 24Gi
           nvidia.com/gpu: "1"
-
 ```
 
-If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the KServe python runtime will be used.
+
+If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the HuggingFace backend is used.
 
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
@@ -165,18 +168,20 @@ spec:
           nvidia.com/gpu: "1"
 ```
 
-Perform the inference for vllm specific runtime
+Perform the inference:
+
+KServe Huggingface runtime deployments supports OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.
 
-vllm runtime deployments only support OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.
+Sample OpenAI Completions request:
 
-Sample OpenAI Completions request
 ```bash
 curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model": "gpt2", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
 
 {"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"gpt2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
 ```
 
-Sample OpenAI Chat request
+Sample OpenAI Chat request:
+
 ```bash
 curl -H "content-type:application/json" -v localhost:8080/openai/v1/chat/completions -d '{"model": "gpt2", "messages": [{"role": "user","content": "<message>"}], "stream":false }'