Skip to content

non descriptive Internal server error for openai badrequest exception #42

@dongwang218

Description

@dongwang218

Describe the bug:
When calling azure openai gpt-4o with max_tokens=8049, the response is Internal server error.

Look at the Ray dashboard, the backend error is

 await app(scope, receive, sender)
 File "/home/dongwang/miniconda3/envs/matrix/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
 response = await f(request)
 File "/home/dongwang/miniconda3/envs/matrix/lib/python3.10/site-packages/fastapi/routing.py", line 302, in app
 raw_response = await run_endpoint_function(
 File "/home/dongwang/miniconda3/envs/matrix/lib/python3.10/site-packages/fastapi/routing.py", line 213, in run_endpoint_function
 return await dependant.call(**values)
 File "/storage/home/dongwang/workspace/github/matrix/matrix/app_server/llm/azure_openai_proxy.py", line 73, in create_chat_completion
 return await self.client.chat.completions.create(**completion_request)
 File "/home/dongwang/miniconda3/envs/matrix/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2454, in create
 return await self._post(
 File "/home/dongwang/miniconda3/envs/matrix/lib/python3.10/site-packages/openai/_base_client.py", line 1791, in post
 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
 File "/home/dongwang/miniconda3/envs/matrix/lib/python3.10/site-packages/openai/_base_client.py", line 1591, in request
 raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': 'max_tokens is too large: 8192. This model supports at most 4096 completion tokens, whereas you provided 8192.', 'type': 'invalid_request_error', 'param': 'max_tokens', 'code': 'invalid_value'}}
INFO 2025-07-31 17:32:18,517 openai_OpenaiDeployment x4dezgdk 29e7dc77-ce9d-4031-8fe8-210ce9ac1f62 -- POST /openai/v1/chat/completions 500 170.8ms

Describe how to reproduce:

matrix check_health --app_name openai --max_tokens 8192 --use_curl False

{'request': {'model': 'gpt-4o', 'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'What is 2+4=?'}], 'temperature': 0.7, 'metadata': {'request_timestamp': 1753984008.6355495}}, 'response': {'error': 'Internal Server Error', 'response_timestamp': 1753984021.2901425}}

Describe the expected behavior:
the response.error should contain the correct error message.

Environment:
pip install -e .[vllm_083]

Additional Context:
Note the limitation is only the output token < 4096, according to this, the context windows is 128k https://community.openai.com/t/gpt-4-128k-only-has-4096-completion-tokens/515790.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions