Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency and Throughput Inquiry #20

Open
ctandrewtran opened this issue Jul 23, 2023 · 4 comments
Open

Latency and Throughput Inquiry #20

ctandrewtran opened this issue Jul 23, 2023 · 4 comments

Comments

@ctandrewtran
Copy link

ctandrewtran commented Jul 23, 2023

Hello-

I've been looking into hosting an LLM on AWS Infrastructure. I am mainly looking to host Flan T5 XXL. My question is below

Inquiry: what is the recommended container for hosting Flan T5 XXL?
Context: I've hosted Flan T5 XXL using the TGI Container and the DJL-FasterTransformer container. Using the same Prompt, TGI takes around 5-6 seconds whereas the DJL-FasterTransformer container takes .5-1.5 seconds. The DJL-FasterTransformer Container has the tensor-parallel-degree set to 4. The SM_NM_GPU for TGI was set to 4. Both were hosted using ml.g5.12xlarge.

  • Are there recommended configs for the TGI Container that I might be missing?
@xyang16
Copy link
Contributor

xyang16 commented Jul 28, 2023

@ctandrewtran

  • What is the recommended container for hosting Flan T5 XXL?

g5.12xlarge is recommended

  • Are there recommended configs for the TGI Container that I might be missing?

Could you please share your config?

@ctandrewtran
Copy link
Author

ctandrewtran commented Jul 29, 2023

@ctandrewtran

  • What is the recommended container for hosting Flan T5 XXL?

g5.12xlarge is recommended

  • Are there recommended configs for the TGI Container that I might be missing?

Could you please share your config?

I am using g5.12.xlarge and its deployed within a VPC

config= {
    'HF_MODEL_ID':'google/flan-t5-xxl',
    'SM_NUM_GPUS':'4',
    'HF_MODEL_QUANTIZE':'bitsandbytes'
}

With this, I am getting 5-6 seconds latency for a prompt + question whereas with DJL-FasterTransformer Container it is sub-second.

Is this the expected latency?

@nth-attempt
Copy link

Hi @ctandrewtran do you also do bitsandbytes quantization for FasterTransformer? Wondering if the latency differences are due to quantization!

@ishaan-jaff
Copy link

ishaan-jaff commented Nov 23, 2023

@ctandrewtran @nth-attempt

If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests
(i'd love feedback if you are trying to do this)

Here's the quick start:
doc: https://docs.litellm.ai/docs/simple_proxy#model-alias

Step 1 Create a Config.yaml

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "zephyr-beta",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants