Latency and Throughput Inquiry #20

ctandrewtran · 2023-07-23T19:50:16Z

Hello-

I've been looking into hosting an LLM on AWS Infrastructure. I am mainly looking to host Flan T5 XXL. My question is below

Inquiry: what is the recommended container for hosting Flan T5 XXL?
Context: I've hosted Flan T5 XXL using the TGI Container and the DJL-FasterTransformer container. Using the same Prompt, TGI takes around 5-6 seconds whereas the DJL-FasterTransformer container takes .5-1.5 seconds. The DJL-FasterTransformer Container has the tensor-parallel-degree set to 4. The SM_NM_GPU for TGI was set to 4. Both were hosted using ml.g5.12xlarge.

Are there recommended configs for the TGI Container that I might be missing?

xyang16 · 2023-07-28T22:17:53Z

@ctandrewtran

What is the recommended container for hosting Flan T5 XXL?

g5.12xlarge is recommended

Are there recommended configs for the TGI Container that I might be missing?

Could you please share your config?

ctandrewtran · 2023-07-29T12:08:32Z

@ctandrewtran

What is the recommended container for hosting Flan T5 XXL?

g5.12xlarge is recommended

Are there recommended configs for the TGI Container that I might be missing?

Could you please share your config?

I am using g5.12.xlarge and its deployed within a VPC

config= {
    'HF_MODEL_ID':'google/flan-t5-xxl',
    'SM_NUM_GPUS':'4',
    'HF_MODEL_QUANTIZE':'bitsandbytes'
}

With this, I am getting 5-6 seconds latency for a prompt + question whereas with DJL-FasterTransformer Container it is sub-second.

Is this the expected latency?

nth-attempt · 2023-09-13T06:08:44Z

Hi @ctandrewtran do you also do bitsandbytes quantization for FasterTransformer? Wondering if the latency differences are due to quantization!

ishaan-jaff · 2023-11-23T00:30:25Z

@ctandrewtran @nth-attempt

If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests
(i'd love feedback if you are trying to do this)

Here's the quick start:
doc: https://docs.litellm.ai/docs/simple_proxy#model-alias

Step 1 Create a Config.yaml

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "zephyr-beta",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency and Throughput Inquiry #20

Latency and Throughput Inquiry #20

ctandrewtran commented Jul 23, 2023 •

edited

Loading

xyang16 commented Jul 28, 2023 •

edited

Loading

ctandrewtran commented Jul 29, 2023 •

edited

Loading

nth-attempt commented Sep 13, 2023

ishaan-jaff commented Nov 23, 2023 •

edited

Loading

Latency and Throughput Inquiry #20

Latency and Throughput Inquiry #20

Comments

ctandrewtran commented Jul 23, 2023 • edited Loading

xyang16 commented Jul 28, 2023 • edited Loading

ctandrewtran commented Jul 29, 2023 • edited Loading

nth-attempt commented Sep 13, 2023

ishaan-jaff commented Nov 23, 2023 • edited Loading

Step 1 Create a Config.yaml

Step 2: Start the litellm proxy:

Step3 Make Request to LiteLLM proxy:

ctandrewtran commented Jul 23, 2023 •

edited

Loading

xyang16 commented Jul 28, 2023 •

edited

Loading

ctandrewtran commented Jul 29, 2023 •

edited

Loading

ishaan-jaff commented Nov 23, 2023 •

edited

Loading