<div style="display: flex; align-items: center; gap: 40px;">

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSkez75fZoo82SccEXRMVRlj9sZsQifRUhURQ&s" width="240">
<img src="https://pbs.twimg.com/profile_images/1783589223406415872/3KMxGGrF_400x400.jpg" width="130">






<div>
  <h2>Cerebras Inference</h2>
  <p>Cerebras Systems builds the world's largest computer chip - the Wafer Scale Engine (WSE) - designed specifically for AI workloads. This cookbook provides comprehensive examples, tutorials, and best practices for developing and deploying AI models using Cerebras infrastructure, including both training on WSE clusters and fast inference via Cerebras Cloud.</p>

  <h2>LiteLLM 🚅</h2>
    <p>LiteLLM simplifies access to 100+ large language models (LLMs) with a unified API. It enables easy model integration, spend tracking, rate-limiting, fallbacks, and observability—helping developers manage LLMs like OpenAI, Anthropic, Groq, Cohere, Google, and more from a single interface.</p>
  </div>
</div>



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1KrVzIjva5AqYwuaciHBNY8eIkrnIx1VJ?usp=sharing)

## Get Your API Keys

Before you begin, make sure you have:

1. A Cerebras API key (Get yours at [Cerebras Cloud](https://cloud.cerebras.ai/))
2. Basic familiarity with Python and Jupyter notebooks

This notebook is designed to run in Google Colab, so no local Python installation is required.

###Cerebras using LiteLLM 🚅

###Install Requirements

In [None]:
!pip install -q litellm

###Setup API Keys

In [None]:
import os
from google.colab import userdata

os.environ["CEREBRAS_API_KEY"] = userdata.get("CEREBRAS_API_KEY")

###Initialize Cerebras Model via LiteLLM 🚅:

In [None]:
import os
import litellm

api_key = os.environ.get("CEREBRAS_API_KEY")

response = litellm.completion(
    model="cerebras/llama3.3-70b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {
            "role": "user",
            "content": "Explain how Cerebras achieves ultra-fast inference speeds."
        }
    ],
    temperature=0.7,
    max_tokens=500
)

print(response['choices'][0]['message']['content'])

###Example 1: Streaming Response

In [None]:
response = litellm.completion(
    model="cerebras/llama3.3-70b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {
            "role": "user",
            "content": "Explain quantum computing in simple terms"
        }
    ],
    temperature=0.8,
    max_tokens=1024,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end='')

### Example 2: Code Generation with Qwen Coder

In [None]:
response = litellm.completion(
    model="cerebras/qwen-3-coder-480b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {
            "role": "user",
            "content": "Write a Python function to implement a binary search algorithm with detailed comments."
        }
    ],
    temperature=0.2,
    max_tokens=700
)

print(response['choices'][0]['message']['content'])

### Example 3: Comparing Multiple Models

In [None]:
models = [
    "cerebras/llama3.3-70b",
    "cerebras/qwen-3-coder-480b",
    "cerebras/gpt-oss-120b"
]

prompt = "What are the key features of functional programming?"

for model in models:
    print(f"\n{'='*60}")
    print(f"Model: {model}")
    print(f"{'='*60}")

    response = litellm.completion(
        model=model,
        api_key=api_key,
        api_base="https://api.cerebras.ai/v1",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.6,
        max_tokens=300
    )

    print(response['choices'][0]['message']['content'])

 ### Example 4: Code Debugging with Qwen Coder

In [None]:
buggy_code = '''
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

result = calculate_average([1, 2, 3, 4, 5])
print(result)
'''

prompt = f"""Analyze this Python code and suggest improvements for edge cases and error handling:

{buggy_code}

Provide an improved version with better error handling."""

response = litellm.completion(
    model="cerebras/qwen-3-coder-480b",
    api_key=api_key,
    api_base="https://api.cerebras.ai/v1",
    messages=[
        {"role": "user", "content": prompt}
    ],
    temperature=0.3,
    max_tokens=600
)

print("Original Code:")
print(buggy_code)
print("\nImproved Version:")
print(response['choices'][0]['message']['content'])

### Example 5: Different Temperature Settings

In [None]:
temperatures = [0.1, 0.5, 0.9]
prompt = "Write a creative opening line for a science fiction story."

for temp in temperatures:
    print(f"\n--- Temperature: {temp} ---")

    response = litellm.completion(
        model="cerebras/llama3.3-70b",
        api_key=api_key,
        api_base="https://api.cerebras.ai/v1",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=100
    )

    print(response['choices'][0]['message']['content'])

### Example 6: Batch Processing Multiple Queries

In [None]:
queries = [
    "What is machine learning?",
    "Explain neural networks.",
    "What is deep learning?",
    "Define artificial intelligence.",
    "What is natural language processing?"
]

print("Batch Processing Results:\n")

for i, query in enumerate(queries, 1):
    print(f"\n{i}. Query: {query}")

    response = litellm.completion(
        model="cerebras/llama3.3-70b",
        api_key=api_key,
        api_base="https://api.cerebras.ai/v1",
        messages=[{"role": "user", "content": query}],
        temperature=0.5,
        max_tokens=150
    )

    print(f"Answer: {response['choices'][0]['message']['content']}")
    print("-" * 80)

###Building a Simple Chatbot with LiteLLM 🚅

In [None]:
import os
import litellm

print("Cerebras Chatbot: Hello! Type 'exit' to end the conversation.\n")

chat_history = []

while True:
    user_input = input("You: ")

    if user_input.lower() == "exit":
        print("Chatbot: Goodbye! 👋")
        break

    chat_history.append({"role": "user", "content": user_input})

    try:
        response = litellm.completion(
            model="cerebras/llama3.3-70b",
            api_key=api_key,
            api_base="https://api.cerebras.ai/v1",
            messages=chat_history,
            temperature=0.7,
            max_tokens=500,
        )

        reply = response["choices"][0]["message"]["content"]
        print("Chatbot:", reply)

        chat_history.append({"role": "assistant", "content": reply})

    except Exception as e:
        print("Error:", str(e))


### Advanced: Using LiteLLM with Multiple Providers

In [None]:
def query_with_fallback(prompt, primary_model, fallback_model):
    try:
        print(f"Trying primary model: {primary_model}")
        response = litellm.completion(
            model=primary_model,
            api_key=api_key,
            api_base="https://api.cerebras.ai/v1",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300
        )
        return response['choices'][0]['message']['content']

    except Exception as e:
        print(f"Primary model failed: {e}")
        print(f"Trying fallback model: {fallback_model}")

        response = litellm.completion(
            model=fallback_model,
            api_key=api_key,
            api_base="https://api.cerebras.ai/v1",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300
        )
        return response['choices'][0]['message']['content']

prompt = "Explain the concept of recursion in programming."
result = query_with_fallback(
    prompt,
    primary_model="cerebras/llama3.3-70b",
    fallback_model="cerebras/gpt-oss-120b"
)

print("\nResponse:")
print(result)

## Conclusion

This notebook demonstrated how to use Cerebras models with LiteLLM, including:

1. Basic completion requests
2. Streaming responses
3. Code generation with specialized models
4. Model comparison
5. Temperature settings
6. Batch processing
7. Building a chatbot
8. Fallback strategies

LiteLLM provides a unified interface for accessing Cerebras's ultra-fast inference platform, making it easy to integrate into your applications.