# VeraGPT

In [None]:
# --- Set your GitHub repo URL ---
REPO_URL = "https://github.com/ankush357159/fusion-gpt.git"
REPO_DIR = "/content/fusion-gpt"

# Clone (or re-clone) the repo
import os

# Ensure we are in a stable directory before attempting to remove and clone
%cd /content

if os.path.isdir(REPO_DIR):
    !rm -rf "$REPO_DIR"
!git clone "$REPO_URL" "$REPO_DIR"

In [None]:
# Install veraGPT dependencies
%cd /content/fusion-gpt/veraGPT
!pip -q install -r requirements.txt

In [None]:
# (Optional) If your model is gated/private, set your HF token
import os
os.environ["HUGGINGFACE_HUB_TOKEN"] = ""  # <- paste token or leave blank for public models

### Check Runtime

**Colab CPU**: Use TinyLlama (smaller model)  
**Colab T4 GPU**: Can use any model (Mistral-7B recommended)

In [None]:
# Detect runtime and recommend model
import torch

if torch.cuda.is_available():
    print("GPU Available: T4 GPU detected")
    print("Recommended: Use any model (Mistral-7B, Phi-2, or TinyLlama)")
    RECOMMENDED_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
else:
    print("CPU Only: GPU not available")
    print("WARNING: Large models (Mistral-7B) will fail with OOM on CPU!")
    print("Recommended: Use TinyLlama-1.1B (only 2GB RAM)")
    print("\nTo enable GPU: Runtime → Change runtime type → T4 GPU")
    RECOMMENDED_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"\nRecommended model: {RECOMMENDED_MODEL}")

In [None]:
# Run a single prompt with auto-detected model
%cd /content/fusion-gpt/veraGPT

# Use recommended model based on runtime
!python src/main.py --model "$RECOMMENDED_MODEL" --prompt "Write a short welcome message for veraGPT." --timing

### OPTION 2: Persistent Model Server (RECOMMENDED for Colab)

**Load model once, then ask multiple questions without reloading.**  
This is **10-100x faster** for subsequent prompts since the model stays in memory.

In [None]:
# Load the model ONCE (uses recommended model based on your runtime)
%cd /content/fusion-gpt/veraGPT
import sys
sys.path.insert(0, '/content/fusion-gpt/veraGPT/src')

from server import ModelServer
from config import Config

# Build config with recommended model
cfg = Config.from_env()
cfg.model.model_name_or_path = RECOMMENDED_MODEL

# Initialize and load model
server = ModelServer(cfg)
server.load()  # Takes ~15s for TinyLlama, ~60s for Mistral-7B

print(f"\nModel '{RECOMMENDED_MODEL}' loaded! Now you can ask questions quickly.")

In [None]:
# Ask a question (FAST - no model reloading!)
response = server.ask(
    "Please explain Newton's second law of motion",
    show_timing=True
)
print(response)

In [None]:
# Ask another question (still FAST!)
response = server.ask(
    "What is quantum entanglement?",
    show_timing=True
)
print(response)

## Notes
- For quantized loading, add `--quant 4` or `--quant 8` (CUDA only).
- To load a LoRA adapter, add `--lora-path /path/to/adapter`.
- Interactive mode is not ideal in Colab; prefer the single-prompt cell.

### Troubleshooting: OOM Errors on CPU

### Problem: Process killed during model loading
```
Loading checkpoint shards:  33% 1/3 [00:24<00:49, 24.59s/it]^C
```

### Cause:
- **Mistral-7B needs ~14-18 GB RAM**
- **Colab CPU only has ~12 GB RAM**
- Process gets killed (Out of Memory)

### Solutions:

**Option 1: Switch to GPU (Best)**
```
1. Runtime → Change runtime type → T4 GPU
2. Re-run cells
3. Will work perfectly with Mistral-7B
```

**Option 2: Use Smaller Model on CPU**
```python
# Already configured! Cell 5 auto-detects and uses TinyLlama on CPU
# Just run cells 5-8 normally
```

**Option 3: Manual Override**
```python
# Force TinyLlama even on GPU (for faster responses)
RECOMMENDED_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```

### Memory Requirements:

| Model | RAM Needed | Colab CPU | Colab T4 GPU |
|-------|-----------|-----------|--------------|
| TinyLlama-1.1B | 2-3 GB | ✅ Works | ✅ Fast (15 tok/s) |
| Phi-2 | 6-8 GB | ⚠️ Tight | ✅ Works (12 tok/s) |
| Mistral-7B | 14-18 GB | ❌ OOM | ✅ Works (3 tok/s) |

### Speed Comparison:

| Model | CPU | T4 GPU |
|-------|-----|--------|
| TinyLlama | 2-5 min/response | 5-10s |
| Mistral-7B | ❌ Crashes | 30-60s |