# Groq Tutorial - Ultra-Fast Inference

This notebook covers working with Groq's lightning-fast LLM inference using the `llm_playbook` package.

## What You'll Learn

- Setting up the Groq client
- Available models (Llama, Mixtral, Gemma)
- Basic chat with incredibly fast responses
- Streaming for real-time output

## Available Models

| Model | Description |
|-------|-------------|
| `llama-3.3-70b-versatile` | Latest Llama 3.3 (default) |
| `llama-3.1-70b-versatile` | Llama 3.1 70B |
| `llama-3.1-8b-instant` | Fast small model |
| `mixtral-8x7b-32768` | Mixtral MoE |
| `gemma2-9b-it` | Google Gemma 2 |

## Why Groq?

- **Blazing fast**: 10x faster than other providers
- **LPU hardware**: Custom chips designed for LLM inference
- **Free tier**: Generous limits for experimentation
- **Open models**: Access to Llama, Mixtral, and more

## Setup

Install the package and configure your API key.

In [3]:
# Install the package
!pip install -q git+https://github.com/deepakdeo/python-llm-playbook.git

  Preparing metadata (setup.py) ... [?25l[?25hdone


In [5]:
# Setup API Key from Colab Secrets
import os
from google.colab import userdata

# Add your GROQ_API_KEY in the Secrets pane (ðŸ”‘ icon in left sidebar)
# Get your key at: https://console.groq.com
os.environ['GROQ_API_KEY'] = userdata.get('GROQ_API_KEY')
print("API key configured!")

API key configured!


## 1. Basic Chat - Feel the Speed!

Groq's main selling point is speed. Let's see it in action.

In [7]:
import time
from llm_playbook import GroqClient

# Initialize the client (uses llama-3.3-70b-versatile by default)
client = GroqClient()

# Time the response
start = time.time()
response = client.chat("What is machine learning in one sentence?")
elapsed = time.time() - start

print(f"Response: {response}")
print(f"\nTime: {elapsed:.2f} seconds")
print("That's FAST! ðŸš€")

Response: Machine learning is a subset of artificial intelligence that involves training algorithms to learn from data and make predictions, decisions, or take actions without being explicitly programmed for a specific task.

Time: 0.23 seconds
That's FAST! ðŸš€


In [None]:
# Compare with a longer response
start = time.time()
response = client.chat(
    "Explain the difference between supervised and unsupervised learning.",
    max_tokens=200
)
elapsed = time.time() - start

print(f"Response: {response}")
print(f"\nTime: {elapsed:.2f} seconds for ~200 tokens")

## 2. Available Models

Groq hosts several open-source models. Let's try them out.

In [None]:
# Llama 3.3 70B (default)
llama_client = GroqClient(model="llama-3.3-70b-versatile")

response = llama_client.chat("What makes you different from other language models?")
print("Llama 3.3 70B:")
print(response)

In [None]:
# Llama 3.1 8B - even faster!
small_client = GroqClient(model="llama-3.1-8b-instant")

start = time.time()
response = small_client.chat("What is 2+2? Just the number.")
elapsed = time.time() - start

print(f"Llama 3.1 8B: {response}")
print(f"Time: {elapsed:.3f} seconds - instant!")

In [None]:
# Mixtral 8x7B
mixtral_client = GroqClient(model="mixtral-8x7b-32768")

response = mixtral_client.chat("Write a haiku about speed.")
print("Mixtral 8x7B:")
print(response)

In [None]:
# Gemma 2 9B
gemma_client = GroqClient(model="gemma2-9b-it")

response = gemma_client.chat("What's unique about Google's Gemma model?")
print("Gemma 2 9B:")
print(response)

## 3. System Prompts

Control the model's behavior with system prompts.

In [None]:
# Code assistant persona
response = client.chat(
    message="How do I read a CSV file?",
    system_prompt="You are a Python expert. Give brief, practical code examples. No explanations, just code."
)
print(response)

In [None]:
# JSON output mode
response = client.chat(
    message="List 3 colors",
    system_prompt="Respond only in valid JSON format. No markdown, no explanation.",
    temperature=0.0
)
print(response)

## 4. Multi-turn Conversations

Maintain context across multiple exchanges.

In [8]:
from llm_playbook import ChatMessage

history = []
system = "You are a helpful coding tutor. Be concise."

# Turn 1
q1 = "What is a Python list?"
a1 = client.chat(q1, system_prompt=system, history=history)

print(f"Student: {q1}")
print(f"Tutor: {a1}\n")

history.append(ChatMessage(role="user", content=q1))
history.append(ChatMessage(role="assistant", content=a1))

Student: What is a Python list?
Tutor: **Python List**: A mutable, ordered collection of elements that can be of any data type, including strings, integers, floats, and other lists. Defined using square brackets `[]`. Example: `my_list = [1, 2, 3, "hello"]`



In [9]:
# Turn 2
q2 = "How do I add items to it?"
a2 = client.chat(q2, system_prompt=system, history=history)

print(f"Student: {q2}")
print(f"Tutor: {a2}")

Student: How do I add items to it?
Tutor: **Adding Items to a List**:

* **Append**: Add a single item to the end of the list using `append()`. Example: `my_list.append(4)`
* **Extend**: Add multiple items to the end of the list using `extend()`. Example: `my_list.extend([5, 6, 7])`
* **Insert**: Add an item at a specific position using `insert()`. Example: `my_list.insert(0, 0)`


## 5. Streaming - Real-time Output

Stream tokens as they're generated. Even with Groq's speed, streaming provides better UX.

In [10]:
print("Streaming: ", end="")

for token in client.stream("Write a limerick about programming."):
    print(token, end="", flush=True)

print()

Streaming: There once was a coder so fine,
Whose programs were truly divine.
She coded with care,
And debugged with flair,
And her apps were always on time.


In [None]:
# Streaming a longer response
print("Streaming story:\n")

for token in client.stream(
    message="Write a 4-sentence story about a robot.",
    max_tokens=150
):
    print(token, end="", flush=True)

print()

## 6. Speed Comparison

Let's benchmark different models on Groq.

In [None]:
models = [
    ("llama-3.1-8b-instant", "Llama 3.1 8B"),
    ("gemma2-9b-it", "Gemma 2 9B"),
    ("mixtral-8x7b-32768", "Mixtral 8x7B"),
    ("llama-3.3-70b-versatile", "Llama 3.3 70B"),
]

prompt = "What is Python? Answer in one sentence."
print(f"Prompt: {prompt}\n")
print("-" * 60)

for model_id, name in models:
    try:
        test_client = GroqClient(model=model_id)
        start = time.time()
        response = test_client.chat(prompt)
        elapsed = time.time() - start
        print(f"\n{name} ({elapsed:.2f}s):")
        print(f"  {response[:100]}..." if len(response) > 100 else f"  {response}")
    except Exception as e:
        print(f"\n{name}: Error - {e}")

## 7. When to Use Groq

Groq is ideal for:

- **Real-time applications**: Chatbots, live coding assistants
- **High-throughput**: Processing many requests quickly
- **Prototyping**: Fast iteration during development
- **Cost-sensitive**: Free tier + open models

Consider other providers when:
- You need the absolute best quality (try Claude or GPT-4)
- You need multimodal capabilities (images, audio)
- You need very long context windows (1M+ tokens)

## Summary

You've learned:

1. **Speed**: Groq is incredibly fast - often sub-second responses
2. **Models**: Access to Llama, Mixtral, Gemma, and more
3. **Usage**: Same familiar interface as other providers
4. **Streaming**: Real-time output for better UX

## Next Steps

- Try the [Ollama notebook](05_ollama.ipynb) for local LLM inference
- Check out [06_comparison.ipynb](06_comparison.ipynb) for side-by-side comparisons
- Explore Groq for building real-time AI applications