Skip to content

refactor: remove max_tokens parameter from LLM configurations and upd…#997

Merged
iziang merged 4 commits into
mainfrom
support/max_tokens
Jun 27, 2025
Merged

refactor: remove max_tokens parameter from LLM configurations and upd…#997
iziang merged 4 commits into
mainfrom
support/max_tokens

Conversation

@iziang
Copy link
Copy Markdown
Contributor

@iziang iziang commented Jun 27, 2025

…ate related logic

  • Removed max_tokens from LLM input and output schemas across multiple YAML files.
  • Updated LLM service logic to dynamically calculate context length based on model's max_tokens.
  • Adjusted frontend components to reflect the removal of max_tokens input.
  • Updated test data to align with the new model configurations.

…ate related logic

- Removed max_tokens from LLM input and output schemas across multiple YAML files.
- Updated LLM service logic to dynamically calculate context length based on model's max_tokens.
- Adjusted frontend components to reflect the removal of max_tokens input.
- Updated test data to align with the new model configurations.
@apecloud-bot apecloud-bot added the size/L Denotes a PR that changes 100-499 lines. label Jun 27, 2025
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Token Management Logic Ignored

The LLM runner calculates output_max_tokens and passes it to the CompletionService constructor. However, the max_tokens parameter is commented out in all litellm API calls within CompletionService, causing the dynamically calculated token limits to be ignored. This breaks the intended token management logic and can lead to LLM responses exceeding their allocated token budget or causing API errors.

aperag/llm/completion/completion_service.py#L84-L155

temperature=self.temperature,
# max_tokens=self.max_tokens,
messages=messages,
stream=False,
caching=self.caching,
)
return self._extract_content_from_response(response)
except CompletionError:
# Re-raise our custom completion errors
raise
except Exception as e:
logger.error(f"Async completion generation failed: {str(e)}")
raise wrap_litellm_error(e, "completion", self.provider, self.model) from e
async def _acompletion_stream_raw(self, history, prompt, memory=False) -> AsyncGenerator[str, None]:
"""Core async completion method for streaming responses."""
try:
self._validate_inputs(prompt)
messages = self._build_messages(history, prompt, memory)
response = await litellm.acompletion(
custom_llm_provider=self.provider,
model=self.model,
base_url=self.base_url,
api_key=self.api_key,
temperature=self.temperature,
# max_tokens=self.max_tokens,
messages=messages,
stream=True,
caching=self.caching,
)
# Process the raw stream and yield clean text chunks
async for chunk in response:
if not chunk.choices:
continue
choice = chunk.choices[0]
if choice.finish_reason == "stop":
return
content_to_yield = None
if choice.delta and choice.delta.content:
content_to_yield = choice.delta.content
elif hasattr(choice.delta, "reasoning_content") and choice.delta.reasoning_content:
content_to_yield = choice.delta.reasoning_content
if content_to_yield:
yield content_to_yield
except CompletionError:
# Re-raise our custom completion errors
raise
except Exception as e:
logger.error(f"Async streaming generation failed: {str(e)}")
raise wrap_litellm_error(e, "completion", self.provider, self.model) from e
def _completion_core(self, history, prompt, memory=False) -> str:
"""Core sync completion method (non-streaming only)."""
try:
self._validate_inputs(prompt)
messages = self._build_messages(history, prompt, memory)
response = litellm.completion(
custom_llm_provider=self.provider,
model=self.model,
base_url=self.base_url,
api_key=self.api_key,
temperature=self.temperature,
# max_tokens=self.max_tokens,
messages=messages,
stream=False,
caching=self.caching,

aperag/flow/runners/llm.py#L203-L204

cs = CompletionService(custom_llm_provider, model_name, base_url, api_key, temperature, output_max_tokens)

Fix in Cursor


Was this report helpful? Give feedback by reacting with 👍 or 👎

@iziang iziang merged commit 771c271 into main Jun 27, 2025
6 of 7 checks passed
@iziang iziang deleted the support/max_tokens branch June 27, 2025 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: acompletion does not pre-validate max_tokens, leading to a generic BadRequestError when exceeding model limits

2 participants