Skip to content

Conversation

@sebastiand-cerebras
Copy link
Contributor

@sebastiand-cerebras sebastiand-cerebras commented Dec 4, 2025

This PR adds a specific configuration for the Cerebras provider to optimize rate limit handling and integration tracking.

Key changes:

  • Conservative Token Limit: Sets maxCompletionTokens to 16k. The Cerebras rate limiter estimates token consumption by reserving the full max_completion_tokens quota upfront. Using a conservative default prevents premature rate limiting, ensuring smoother operation even when actual generation is small.
  • Integration Header: Adds the X-Cerebras-3rd-Party-Integration: opencode` header.
  • Configuration: Sets autoload: false.

Testing:
Verified functionality with the following models: gpt-oss-120b, qwen-235, zai-glm4.6

@rekram1-node
Copy link
Collaborator

wouldn’t this kinda neuter a lot of models?

Can you explain why you need this models like gpt oss have 32k max completion output tokens and opencode should be respecting that…

What kinda plan are you on where you get ratelimited?

@sebastiand-cerebras
Copy link
Contributor Author

Cerebras handles rate limiting differently from most providers. It estimates token usage upfront using the  max_completion_tokens  value, so if a client always sends 32k, each request is counted as if it might produce 32k tokens, even when the actual completion is much smaller. On Cerebras Code plans this causes users to hit rate limits very quickly in agentic coding workflows that make many short calls, which is why a more conservative default like 8,192 tokens gives a much smoother experience without materially limiting typical code completions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants