# Multi-Model Inference for Paper Replication

This notebook runs inference across all 7 models from the paper using Together AI API.

**Models:**
1. Llama2-Chat-7b
2. Mistral-Instruct-7b
3. SOLAR-10.7b
4. Llama2-Chat-13b
5. Vicuna-13b
6. Mixtral-Instruct-8x7b
7. WizardLM-13b

**Setup Required:**
1. Get Together AI API key from https://api.together.xyz/
2. Upload your `query_data.jsonl` to `/content/data/`
3. Set your API key in the cell below
4. Run all cells

In [None]:
# Install dependencies
!pip install transformers accelerate torch sentencepiece bitsandbytes aiohttp requests --quiet

In [None]:
# Set your Together AI API key here
import os
os.environ["TOGETHER_API_KEY"] = "your-api-key-here"  # Replace with your actual API key

# Verify it's set
if os.environ.get("TOGETHER_API_KEY") == "your-api-key-here":
    print("⚠️  WARNING: Please set your actual Together AI API key above!")
else:
    print("✓ API key is set")

In [None]:
# Upload the multi-model inference files
# If running from cloned repo, skip this. Otherwise, upload the files.

# For now, let's check if files exist
import os
files_needed = ["model_registry.py", "together_client.py", "run_multi_model.py"]
missing = [f for f in files_needed if not os.path.exists(f)]

if missing:
    print(f"Missing files: {missing}")
    print("Please upload these files from the reproduce/ folder")
else:
    print("✓ All required files present")

In [None]:
# Verify data file exists
import os
data_path = "/content/data/query_data.jsonl"

if not os.path.exists(data_path):
    print(f"⚠️  Data file not found at {data_path}")
    print("Please upload your query_data.jsonl to /content/data/")
    
    # Create directory
    os.makedirs("/content/data", exist_ok=True)
else:
    # Count queries
    with open(data_path) as f:
        n_queries = sum(1 for _ in f)
    print(f"✓ Found {n_queries} queries in {data_path}")

## Option 1: Run All Models with API (Recommended)

This will run all 7 models using Together AI API. Fast and reliable.

**Estimated time:** 15-30 minutes  
**Estimated cost:** $0.10-0.30

In [None]:
# Run all models with API
!python run_multi_model.py --models all --backend api --queries-path /content/data/query_data.jsonl --output-dir /content/outputs

## Option 2: Run Specific Models

Run only the models you need:

In [None]:
# List available models
!python run_multi_model.py --list-models

In [None]:
# Run specific models (example: just Mistral and Llama2-7B)
!python run_multi_model.py --models mistral-7b-instruct llama2-7b-chat --backend api --queries-path /content/data/query_data.jsonl --output-dir /content/outputs

## Option 3: Hybrid Approach

Run small models locally (free) and large models via API:

In [None]:
# Auto-select backend based on model size
!python run_multi_model.py --models all --backend auto --queries-path /content/data/query_data.jsonl --output-dir /content/outputs

## Check Results

View the generated outputs:

In [None]:
# List output files
!ls -lh /content/outputs/

In [None]:
# View sample outputs from a model
import json

model_key = "mistral-7b-instruct"  # Change this to view other models
output_file = f"/content/outputs/{model_key}_preds.jsonl"

print(f"Sample outputs from {model_key}:")
print("="*80)

with open(output_file, "r") as f:
    for i, line in enumerate(f):
        if i >= 3:  # Show first 3 examples
            break
        obj = json.loads(line)
        print(f"\nExample {i+1}:")
        print(f"Anchor: {obj[\'anchor\'][:100]}...")
        print(f"Output (first 300 chars): {obj[\'output\'][:300]}...")
        print(f"Backend: {obj[\'backend\']}")
        print(f"Use chat template: {obj[\'use_chat_template\']}")
        print("-"*80)

## Download Results

Download all output files to your local machine:

In [None]:
# Zip all outputs for easy download
!zip -r /content/all_model_outputs.zip /content/outputs/

from google.colab import files
files.download("/content/all_model_outputs.zip")

## Troubleshooting

**"TOGETHER_API_KEY not found"**  
Make sure you set your API key in the cell above and ran that cell.

**"Model not found" or License Issues**  
Some models require accepting licenses on HuggingFace. For Llama2-based models:
1. Go to https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
2. Accept the license
3. Login: `!huggingface-cli login`

**"CUDA out of memory"**  
If running locally, reduce batch size or use API instead.

**Spacing issues in output**  
The code uses v2 with the per-example slicing fix. Check that outputs have `"use_chat_template": true`.

For more help, see `MULTI_MODEL_GUIDE.md` in the repository.