# ChatRoutes AutoBranch - Creative Writing Demo

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chatroutes/chatroutes-autobranch/blob/master/notebooks/creative_writing_colab.ipynb)

This notebook demonstrates **chatroutes-autobranch** for creative writing scenarios using:
- ✅ **100% FREE** local LLMs via Ollama
- ✅ **FREE** embeddings via sentence-transformers
- ⚡ **Optional GPU acceleration** (see cost warning below)

---

## ⚠️ GPU vs CPU: Cost & Performance

| Runtime | Speed (per response) | Cost | Best For |
|---------|---------------------|------|----------|
| **CPU (Free)** | ~40-50s | $0 | Testing, learning |
| **GPU (Free Tier)** | ~1-3s | $0 (limited hours/day) | Quick demos |
| **GPU (Colab Pro)** | ~1-3s | $10/month | Regular use |
| **GPU (Colab Pro+)** | ~0.5-1s | $50/month | Heavy use |

**💡 Recommendation**: Start with **CPU (free, no limits)** for learning. Upgrade to GPU if you need speed.

---

## 🔧 Cell 1: Environment Setup & GPU Detection

In [None]:
import os
import sys
import subprocess

# Detect GPU
def check_gpu():
    """Check if GPU is available and show details."""
    try:
        result = subprocess.run(['nvidia-smi', '--query-gpu=name,memory.total', '--format=csv,noheader'],
                                capture_output=True, text=True, timeout=5)
        if result.returncode == 0 and result.stdout.strip():
            gpu_info = result.stdout.strip()
            print("✅ GPU DETECTED:")
            print(f"   {gpu_info}")
            print("\n⚡ GPU will significantly speed up inference:")
            print("   - Ollama models: 20-40x faster")
            print("   - Embeddings: 5-10x faster")
            print("   - Total time: ~5-10 minutes (vs 30-40 minutes on CPU)")
            return True
        else:
            print("ℹ️  No GPU detected - using CPU")
            print("\n📊 CPU Performance (expected):")
            print("   - Ollama llama3.1:8b: ~40-50s per response")
            print("   - Total time: ~30-40 minutes for all scenarios")
            print("\n💡 To enable GPU: Runtime → Change runtime type → GPU")
            return False
    except FileNotFoundError:
        print("ℹ️  No GPU detected - using CPU")
        print("\n📊 CPU Performance: ~30-40 minutes total")
        return False

has_gpu = check_gpu()

# Show cost warning for GPU
if has_gpu:
    print("\n⚠️  GPU COST WARNING:")
    print("   - Free tier: Limited GPU hours per day")
    print("   - Colab Pro: $10/month for more GPU hours")
    print("   - This notebook will use ~10-15 minutes of GPU time")
    print("\n   To switch to CPU (free, unlimited):")
    print("   Runtime → Change runtime type → None")
else:
    print("\n✅ CPU is completely FREE with no usage limits!")
    print("   Just slower - perfect for learning and testing.")

## 📦 Cell 2: Install Dependencies

This cell installs:
1. **chatroutes-autobranch** - The main library
2. **sentence-transformers** - Free embeddings
3. **Ollama** - Free local LLM server

⏱️ **First run**: ~2-3 minutes (downloads packages)

In [None]:
# Install chatroutes-autobranch and dependencies
print("📦 Installing chatroutes-autobranch...")
!pip install -q --upgrade chatroutes-autobranch sentence-transformers requests
print("✅ Python packages installed!")

# Install Ollama
print("\n🦙 Installing Ollama...")
!curl -fsSL https://ollama.com/install.sh | sh > /dev/null 2>&1
print("✅ Ollama installed!")

## 🚀 Cell 3: Start Ollama Server

Ollama runs as a background server. This cell:
1. Starts the Ollama server
2. Waits for it to be ready
3. Verifies the connection

⚠️ **Note**: The server runs until the Colab session ends.

In [None]:
import subprocess
import time
import requests

print("🚀 Starting Ollama server...")

# Start Ollama in background
ollama_process = subprocess.Popen(
    ['ollama', 'serve'],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE
)

# Wait for server to start
max_retries = 30
for i in range(max_retries):
    try:
        response = requests.get('http://localhost:11434/api/tags', timeout=2)
        if response.status_code == 200:
            print("✅ Ollama server is ready!")
            break
    except:
        pass
    time.sleep(1)
    if (i + 1) % 5 == 0:
        print(f"   Waiting for server... ({i+1}/{max_retries}s)")
else:
    print("❌ Failed to start Ollama server")
    raise Exception("Ollama server failed to start")

## 🤖 Cell 4: Download LLM Model

Choose a model based on your needs:

| Model | Size | Speed (CPU) | Speed (GPU) | Quality | RAM |
|-------|------|-------------|-------------|---------|-----|
| **llama3.1:8b** | 4.9 GB | ~40s | ~1s | Good | 8GB |
| **qwen3:14b** | 9.3 GB | ~80s | ~3s | Excellent | 16GB |
| **gpt-oss:20b** | 13 GB | ~120s | ~5s | Best | 24GB |

**💡 Recommendation**: Use **llama3.1:8b** for speed, **qwen3:14b** for quality.

⏱️ **Download time**: 2-5 minutes (one-time per session)

In [None]:
import subprocess

# CONFIGURE YOUR MODEL HERE
MODEL = "llama3.1:8b"  # Options: "llama3.1:8b", "qwen3:14b", "gpt-oss:20b"

print(f"📥 Downloading {MODEL}...")
print(f"   This is a one-time download per Colab session.")
print(f"   ⏱️  Expected time: 2-5 minutes\n")

result = subprocess.run(
    ['ollama', 'pull', MODEL],
    capture_output=False,
    text=True
)

if result.returncode == 0:
    print(f"\n✅ {MODEL} downloaded successfully!")
    
    # Test generation
    print("\n🧪 Testing model with quick generation...")
    test_result = subprocess.run(
        ['ollama', 'run', MODEL, 'Say hello in one sentence.'],
        capture_output=True,
        text=True,
        timeout=120
    )
    
    if test_result.returncode == 0:
        print("✅ Model is working!")
        print(f"   Response: {test_result.stdout[:100]}...")
    else:
        print("⚠️  Model test failed, but continuing...")
else:
    print(f"\n❌ Failed to download {MODEL}")
    raise Exception(f"Model download failed")

## 📥 Cell 5: Download Embedding Models

Sentence-transformers will download embedding models on first use:

| Model | Size | Dimension | Quality Score |
|-------|------|-----------|---------------|
| jina-embeddings-v2-base-en | 560 MB | 768D | 60.3 |
| all-mpnet-base-v2 | 420 MB | 768D | 57.8 |
| bge-large-en-v1.5 | 1.2 GB | 1024D | 59.5 |

**These download automatically when needed** - no action required!

⏱️ **Download time**: ~1-2 minutes per model (one-time)

In [None]:
from sentence_transformers import SentenceTransformer
import torch

print("📊 Embedding Model Information:")
print("\nModels will download automatically on first use.")
print("Each model downloads once per session (~1-2 min each).\n")

# Show device info
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"🖥️  Embeddings will use: {device.upper()}")

if device == 'cuda':
    print("   ⚡ GPU will accelerate embeddings 5-10x!")
else:
    print("   📊 CPU embeddings are still fast (~1-2s per batch)")

print("\n✅ Ready to generate embeddings!")

## 🎨 Cell 6: Run Creative Writing Demo

This cell runs 4 creative writing scenarios demonstrating different features:

1. **AI Memory Story** - High diversity across genres
2. **Mars Detective Twists** - Clustering similar plot ideas
3. **Rom-Com Endings** - Entropy-based stopping
4. **Style Variations** - Intent alignment

⏱️ **Expected runtime**:
- **GPU**: ~5-10 minutes
- **CPU**: ~30-40 minutes

**💡 Tip**: You can continue working in other tabs while this runs!

In [None]:
# Download the example script
!wget -q https://raw.githubusercontent.com/chatroutes/chatroutes-autobranch/master/examples/creative_writting_usage.py

# Run the demo
print("🎨 Starting Creative Writing Demo...\n")
print("=" * 80)

!python creative_writting_usage.py

## 📊 Cell 7: Performance Comparison (Optional)

Compare CPU vs GPU performance with a quick benchmark.

In [None]:
import requests
import time
import torch
from sentence_transformers import SentenceTransformer

print("⚡ Performance Benchmark\n")
print("=" * 80)

# Test 1: Ollama inference
print("\n🦙 Test 1: Ollama LLM Generation")
print(f"   Model: {MODEL}")
print("   Prompt: 'Write one sentence about AI.'\n")

start = time.time()
response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': MODEL,
        'prompt': 'Write one sentence about AI.',
        'stream': False
    },
    timeout=120
)
ollama_time = time.time() - start

print(f"   ⏱️  Time: {ollama_time:.2f}s")
print(f"   📝 Response: {response.json().get('response', '')[:100]}...")

# Test 2: Embeddings
print("\n🔢 Test 2: Sentence Embeddings")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"   Device: {device.upper()}")
print("   Model: all-mpnet-base-v2")
print("   Texts: 10 sentences\n")

model = SentenceTransformer('all-mpnet-base-v2', device=device)
texts = [f"This is test sentence number {i}." for i in range(10)]

start = time.time()
embeddings = model.encode(texts)
embed_time = time.time() - start

print(f"   ⏱️  Time: {embed_time:.3f}s")
print(f"   📊 Generated: {len(embeddings)} embeddings of {len(embeddings[0])}D")

# Summary
print("\n" + "=" * 80)
print("📊 BENCHMARK SUMMARY")
print("=" * 80)
print(f"\nLLM Generation ({MODEL}): {ollama_time:.2f}s per response")
print(f"Embeddings (10 texts):     {embed_time:.3f}s")
print(f"\nRuntime: {'GPU ⚡' if device == 'cuda' else 'CPU 🐢'}")

if device == 'cpu':
    print("\n💡 Switch to GPU for ~20-40x speedup!")
    print("   Runtime → Change runtime type → GPU")
else:
    print("\n✅ Using GPU acceleration!")

## 🧹 Cell 8: Cleanup (Optional)

Stop Ollama server and free up memory.

In [None]:
import signal

print("🧹 Cleaning up...\n")

# Stop Ollama
try:
    ollama_process.send_signal(signal.SIGTERM)
    ollama_process.wait(timeout=5)
    print("✅ Ollama server stopped")
except:
    print("⚠️  Could not stop Ollama (may already be stopped)")

print("\n✅ Cleanup complete!")
print("\n💡 To restart, run the cells again from Cell 3.")

---

## 💾 How Model Downloads Work in Colab

### Ollama Models
- **Location**: `/usr/share/ollama/.ollama/models/`
- **Persistence**: Lost when runtime disconnects
- **Download**: 2-5 minutes per model
- **Re-download**: Required each new session

### Sentence-Transformers
- **Location**: `/root/.cache/huggingface/`
- **Persistence**: Lost when runtime disconnects
- **Download**: 1-2 minutes per model (auto on first use)
- **Re-download**: Required each new session

### Tips for Faster Startup

1. **Use smaller models**:
   ```python
   MODEL = "llama3.1:8b"  # 4.9 GB, fastest
   ```

2. **Mount Google Drive** (advanced):
   ```python
   from google.colab import drive
   drive.mount('/content/drive')
   # Cache models to Drive (persists across sessions)
   ```

3. **Colab Pro**:
   - Faster downloads
   - Longer session timeouts
   - More GPU availability

---

## 📚 Additional Resources

- [ChatRoutes AutoBranch GitHub](https://github.com/chatroutes/chatroutes-autobranch)
- [Ollama Documentation](https://ollama.ai/docs)
- [Sentence-Transformers Docs](https://www.sbert.net/)
- [Google Colab FAQ](https://research.google.com/colaboratory/faq.html)

---

## ❓ Troubleshooting

### Problem: "Ollama server not responding"
**Solution**: Restart Cell 3 (Start Ollama Server)

### Problem: "Model download failed"
**Solution**: 
```bash
# Check available space
!df -h /

# Use smaller model
MODEL = "llama3.1:8b"  # Only 4.9 GB
```

### Problem: "Out of memory"
**Solution**:
- Use smaller model (llama3.1:8b)
- Restart runtime: Runtime → Restart runtime
- For GPU: Use Colab Pro (more VRAM)

### Problem: "GPU not detected"
**Solution**: Runtime → Change runtime type → GPU (T4)

---

**Made with ❤️ using ChatRoutes AutoBranch**