refactor: remove max_tokens parameter from LLM configurations and upd… by iziang · Pull Request #997 · apecloud/ApeRAG

iziang · 2025-06-27T03:55:54Z

…ate related logic

Removed max_tokens from LLM input and output schemas across multiple YAML files.
Updated LLM service logic to dynamically calculate context length based on model's max_tokens.
Adjusted frontend components to reflect the removal of max_tokens input.
Updated test data to align with the new model configurations.

…ate related logic - Removed max_tokens from LLM input and output schemas across multiple YAML files. - Updated LLM service logic to dynamically calculate context length based on model's max_tokens. - Adjusted frontend components to reflect the removal of max_tokens input. - Updated test data to align with the new model configurations.

cursor

Bug: Token Management Logic Ignored

The LLM runner calculates output_max_tokens and passes it to the CompletionService constructor. However, the max_tokens parameter is commented out in all litellm API calls within CompletionService, causing the dynamically calculated token limits to be ignored. This breaks the intended token management logic and can lead to LLM responses exceeding their allocated token budget or causing API errors.

aperag/llm/completion/completion_service.py#L84-L155

ApeRAG/aperag/llm/completion/completion_service.py

Lines 84 to 155 in 9ddce62

    
                       temperature=self.temperature, 
        
                       # max_tokens=self.max_tokens, 
        
                       messages=messages, 
        
                       stream=False, 
        
                       caching=self.caching, 
        
                   ) 
        
                   return self._extract_content_from_response(response) 
        
               except CompletionError: 
        
                   # Re-raise our custom completion errors 
        
                   raise 
        
               except Exception as e: 
        
                   logger.error(f"Async completion generation failed: {str(e)}") 
        
                   raise wrap_litellm_error(e, "completion", self.provider, self.model) from e 
        
           async def _acompletion_stream_raw(self, history, prompt, memory=False) -> AsyncGenerator[str, None]: 
        
               """Core async completion method for streaming responses.""" 
        
               try: 
        
                   self._validate_inputs(prompt) 
        
                   messages = self._build_messages(history, prompt, memory) 
        
                   response = await litellm.acompletion( 
        
                       custom_llm_provider=self.provider, 
        
                       model=self.model, 
        
                       base_url=self.base_url, 
        
                       api_key=self.api_key, 
        
                       temperature=self.temperature, 
        
                       # max_tokens=self.max_tokens, 
        
                       messages=messages, 
        
                       stream=True, 
        
                       caching=self.caching, 
        
                   ) 
        
                   # Process the raw stream and yield clean text chunks 
        
                   async for chunk in response: 
        
                       if not chunk.choices: 
        
                           continue 
        
                       choice = chunk.choices[0] 
        
                       if choice.finish_reason == "stop": 
        
                           return 
        
                       content_to_yield = None 
        
                       if choice.delta and choice.delta.content: 
        
                           content_to_yield = choice.delta.content 
        
                       elif hasattr(choice.delta, "reasoning_content") and choice.delta.reasoning_content: 
        
                           content_to_yield = choice.delta.reasoning_content 
        
                       if content_to_yield: 
        
                           yield content_to_yield 
        
               except CompletionError: 
        
                   # Re-raise our custom completion errors 
        
                   raise 
        
               except Exception as e: 
        
                   logger.error(f"Async streaming generation failed: {str(e)}") 
        
                   raise wrap_litellm_error(e, "completion", self.provider, self.model) from e 
        
           def _completion_core(self, history, prompt, memory=False) -> str: 
        
               """Core sync completion method (non-streaming only).""" 
        
               try: 
        
                   self._validate_inputs(prompt) 
        
                   messages = self._build_messages(history, prompt, memory) 
        
                   response = litellm.completion( 
        
                       custom_llm_provider=self.provider, 
        
                       model=self.model, 
        
                       base_url=self.base_url, 
        
                       api_key=self.api_key, 
        
                       temperature=self.temperature, 
        
                       # max_tokens=self.max_tokens, 
        
                       messages=messages, 
        
                       stream=False, 
        
                       caching=self.caching,

aperag/flow/runners/llm.py#L203-L204

ApeRAG/aperag/flow/runners/llm.py

Lines 203 to 204 in 9ddce62

    
           cs = CompletionService(custom_llm_provider, model_name, base_url, api_key, temperature, output_max_tokens)

Fix in Cursor

Was this report helpful? Give feedback by reacting with 👍 or 👎

apecloud-bot added the size/L Denotes a PR that changes 100-499 lines. label Jun 27, 2025

This comment was marked as outdated.

Sign in to view

iziang linked an issue Jun 27, 2025 that may be closed by this pull request

[Bug]: acompletion does not pre-validate max_tokens, leading to a generic BadRequestError when exceeding model limits #998

Closed

chore: tidy up

d760620

This comment was marked as outdated.

Sign in to view

chore: tidy up

b967807

This comment was marked as outdated.

Sign in to view

chore: tidy up

9ddce62

cursor Bot reviewed Jun 27, 2025

View reviewed changes

iziang merged commit 771c271 into main Jun 27, 2025
6 of 7 checks passed

iziang deleted the support/max_tokens branch June 27, 2025 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: remove max_tokens parameter from LLM configurations and upd…#997

refactor: remove max_tokens parameter from LLM configurations and upd…#997
iziang merged 4 commits into
mainfrom
support/max_tokens

iziang commented Jun 27, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	temperature=self.temperature,
	# max_tokens=self.max_tokens,
	messages=messages,
	stream=False,
	caching=self.caching,
	)

	return self._extract_content_from_response(response)

	except CompletionError:
	# Re-raise our custom completion errors
	raise
	except Exception as e:
	logger.error(f"Async completion generation failed: {str(e)}")
	raise wrap_litellm_error(e, "completion", self.provider, self.model) from e

	async def _acompletion_stream_raw(self, history, prompt, memory=False) -> AsyncGenerator[str, None]:
	"""Core async completion method for streaming responses."""
	try:
	self._validate_inputs(prompt)
	messages = self._build_messages(history, prompt, memory)

	response = await litellm.acompletion(
	custom_llm_provider=self.provider,
	model=self.model,
	base_url=self.base_url,
	api_key=self.api_key,
	temperature=self.temperature,
	# max_tokens=self.max_tokens,
	messages=messages,
	stream=True,
	caching=self.caching,
	)

	# Process the raw stream and yield clean text chunks
	async for chunk in response:
	if not chunk.choices:
	continue
	choice = chunk.choices[0]
	if choice.finish_reason == "stop":
	return
	content_to_yield = None
	if choice.delta and choice.delta.content:
	content_to_yield = choice.delta.content
	elif hasattr(choice.delta, "reasoning_content") and choice.delta.reasoning_content:
	content_to_yield = choice.delta.reasoning_content
	if content_to_yield:
	yield content_to_yield

	except CompletionError:
	# Re-raise our custom completion errors
	raise
	except Exception as e:
	logger.error(f"Async streaming generation failed: {str(e)}")
	raise wrap_litellm_error(e, "completion", self.provider, self.model) from e

	def _completion_core(self, history, prompt, memory=False) -> str:
	"""Core sync completion method (non-streaming only)."""
	try:
	self._validate_inputs(prompt)
	messages = self._build_messages(history, prompt, memory)

	response = litellm.completion(
	custom_llm_provider=self.provider,
	model=self.model,
	base_url=self.base_url,
	api_key=self.api_key,
	temperature=self.temperature,
	# max_tokens=self.max_tokens,
	messages=messages,
	stream=False,
	caching=self.caching,


	cs = CompletionService(custom_llm_provider, model_name, base_url, api_key, temperature, output_max_tokens)

Conversation

iziang commented Jun 27, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Bug: Token Management Logic Ignored

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants