# vLLM Remote Testing - Clean Setup

This notebook tests vLLM queries via WebIDE backend from a remote Jupyter server on EC2.

## Setup Requirements:
1. Jupyter server running on EC2
2. SSH tunnel active
3. WebIDE running on EC2

## SIMPLIFIED Setup - No SSH Needed!

**Since SSH authentication is problematic, let's use the simple approach:**

### Step 1: Use This Notebook Locally
- Just run this notebook in VS Code locally (no SSH tunnel needed)
- It will connect directly to your WebIDE on EC2

### Step 2: Make Sure WebIDE is Running
- Your WebIDE should be running on: `http://10.230.0.178:3001`
- That's it! No SSH, no tunnels, no complications.

### Why This Works:
- Tests the exact same `/api/vllm/completions` endpoint
- Uses the same request format as your WebIDE frontend
- Much simpler setup!

In [3]:
# SIMPLIFIED: Connect directly to WebIDE on EC2 (no SSH tunnel)
import requests
import json
from typing import Dict, Any, List

# Direct connection to your WebIDE on EC2
BASE_URL = "http://10.230.0.178:3001"
COMPLETIONS_ENDPOINT = f"{BASE_URL}/api/vllm/completions"
MODELS_ENDPOINT = f"{BASE_URL}/api/vllm/models"

HEADERS = {
    "Content-Type": "application/json"
}

print("🎯 SIMPLE APPROACH: Local Jupyter → EC2 WebIDE")
print(f"🌐 WebIDE URL: {BASE_URL}")
print(f"🚀 Testing endpoint: {COMPLETIONS_ENDPOINT}")
print("📝 No SSH tunnels, no complications!")

# Test connection
try:
    response = requests.get(f"{BASE_URL}", timeout=10)
    print(f"✅ WebIDE is reachable! Status: {response.status_code}")
except Exception as e:
    print(f"❌ Cannot reach WebIDE: {e}")
    print("🔧 Make sure WebIDE is running on EC2")

🎯 SIMPLE APPROACH: Local Jupyter → EC2 WebIDE
🌐 WebIDE URL: http://10.230.0.178:3001
🚀 Testing endpoint: http://10.230.0.178:3001/api/vllm/completions
📝 No SSH tunnels, no complications!
✅ WebIDE is reachable! Status: 200


In [4]:
# Check available models (using the working URL)
import time

def get_available_models() -> List[str]:
    """Get available models from WebIDE backend."""
    try:
        print(f"🔍 Checking models: {MODELS_ENDPOINT}")
        response = requests.get(MODELS_ENDPOINT, headers=HEADERS, timeout=10)
        response.raise_for_status()
        
        models_data = response.json()
        models = [model['id'] for model in models_data.get('data', [])]
        
        print(f"✅ Found {len(models)} models:")
        for i, model in enumerate(models, 1):
            print(f"  {i}. {model}")
        
        return models
    
    except Exception as e:
        print(f"❌ Error: {e}")
        return []

# ENHANCED execution guard with timestamp to prevent any duplicates
current_time = time.time()
execution_key = f"models_fetched_{current_time}"

# Check if we already executed this recently (within 5 seconds)
if ('available_models' in globals() and 
    available_models and 
    hasattr(globals().get('_last_model_fetch', None), '__dict__') and
    current_time - getattr(globals().get('_last_model_fetch', type('obj', (), {'time': 0})()), 'time', 0) < 5):
    
    print(f"✅ Using recently cached models: {available_models}")
    print(f"   (Found {len(available_models)} models, cached {current_time - _last_model_fetch.time:.1f}s ago)")
    
elif 'available_models' not in globals() or not available_models:
    print("🔧 Fetching available models...")
    available_models = get_available_models()
    # Mark when we last fetched
    _last_model_fetch = type('obj', (), {'time': current_time})()
    
else:
    print(f"✅ Using cached models: {available_models}")
    print(f"   (Found {len(available_models)} models already loaded)")
    # Update timestamp even for existing cache
    _last_model_fetch = type('obj', (), {'time': current_time})()

🔧 Fetching available models...
🔍 Checking models: http://10.230.0.178:3001/api/vllm/models
❌ Error: 500 Server Error: Internal Server Error for url: http://10.230.0.178:3001/api/vllm/models


In [3]:
# Test exact WebIDE vLLM query
def test_vllm_completion(code: str, cursor_pos: int) -> Dict[str, Any]:
    """Test vLLM completion via WebIDE backend."""
    
    # Create FIM prompt (same as WebIDE)
    prefix = code[:cursor_pos]
    suffix = code[cursor_pos:]
    fim_prompt = f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"
    
    # Use first available model
    model = available_models[0] if available_models else "default"
    
    request_body = {
        "model": model,
        "prompt": fim_prompt,
        "echo": False,
        "n": 1,
        "max_tokens": 128,
        "temperature": 0.2,
        "top_p": 0.95
    }
    
    print(f"🚀 Testing vLLM completion...")
    print(f"📝 Model: {model}")
    print(f"📍 Cursor position: {cursor_pos}")
    print(f"🔍 Prompt preview: {fim_prompt[:100]}...")
    
    try:
        response = requests.post(
            COMPLETIONS_ENDPOINT,
            headers=HEADERS,
            json=request_body,
            timeout=30
        )
        response.raise_for_status()
        
        result = response.json()
        
        print(f"✅ Success! Status: {response.status_code}")
        
        choices = result.get('choices', [])
        print(f"🎯 Generated {len(choices)} completion(s):")
        
        for i, choice in enumerate(choices, 1):
            text = choice.get('text', '')
            print(f"\nCompletion {i}:")
            print(f"'{text}'")
        
        return {"success": True, "result": result, "request": request_body}
        
    except Exception as e:
        print(f"❌ Request failed: {e}")
        return {"success": False, "error": str(e), "request": request_body}

# Test with the exact console log example
test_code = """void AutoRefill::HandleDeleteRequest()
void AutoRefill::HandleModifyRequest(const StrategyModifier& modifier)"""

cursor_position = len(test_code)  # End of code

print("=== Testing Exact WebIDE Example ===")
print(f"Code: {repr(test_code)}")
print(f"Cursor at position: {cursor_position}\n")

result = test_vllm_completion(test_code, cursor_position)

=== Testing Exact WebIDE Example ===
Code: 'void AutoRefill::HandleDeleteRequest()\nvoid AutoRefill::HandleModifyRequest(const StrategyModifier& modifier)'
Cursor at position: 109

🚀 Testing vLLM completion...
📝 Model: /data/starcoder2_7b_triple_Train_w8a8_optimal
📍 Cursor position: 109
🔍 Prompt preview: <fim_prefix>void AutoRefill::HandleDeleteRequest()
void AutoRefill::HandleModifyRequest(const Strate...
✅ Success! Status: 200
🎯 Generated 1 completion(s):

Completion 1:
'    TBDEBUG("HandleModifyRequest: " << modifier);
void AutoRefill::HandleValidateRequest(ValidationContext& context) const
void AutoRefill::HandleStreamOpen(const StreamIdentifier& stream) { TBDEBUG("HandleStreamOpen: " << stream); }
void AutoRefill::HandleStreamFailed(const StreamIdentifier& stream) { TBERROR("HandleStreamFailed: " << stream); }
void AutoRefill::HandleSnapshotEnd(const'


## Results Summary

If the above test works, you have successfully:
- ✅ Connected to WebIDE backend directly
- ✅ Tested vLLM completion with exact WebIDE format
- ✅ Verified the same request path as the frontend

This setup allows you to test vLLM queries exactly as the WebIDE frontend does!

## FIM Validation Results Summary

The validation loop above tests your WebIDE's vLLM backend against real FIM test cases from the JSONL file.

### What It Tests:
- ✅ **Loads FIM tasks** from `multiline_fim_viewer_compatible.jsonl`
- ✅ **Sends each task** to your WebIDE backend exactly like the frontend does
- ✅ **Validates completions** against expected outputs
- ✅ **Provides metrics** on success rate and quality

### Key Metrics:
- **Success Rate**: % of completions that generated valid output
- **Similarity Score**: How close generated text is to expected (0-1)
- **Performance Stats**: Average prefix/suffix lengths and response times

### This Validates:
1. **WebIDE Backend** is properly handling vLLM requests
2. **vLLM Model** is generating reasonable completions
3. **Request Format** matches exactly what your frontend sends
4. **End-to-End Flow** from test cases → WebIDE → vLLM → results

Run the validation to see how well your setup performs on real FIM tasks!

In [None]:
# FIM VALIDATION - 1000 Test Cases (Simplified)
import json
import os
import time

print("🎯 Running FIM validation on 1000 test cases...")

# Check prerequisites
if 'available_models' not in globals() or not available_models:
    print("❌ Run previous cells first to initialize models")
    exit()

# Load test cases
jsonl_path = "/Users/shi.yu/Documents/tbricks_dataviewer/fim_singleline_dataset_deterministic.jsonl"
if not os.path.exists(jsonl_path):
    print(f"❌ Test file not found: {jsonl_path}")
    exit()

# Load and parse test cases
test_cases = []
with open(jsonl_path, 'r', encoding='utf-8', errors='replace') as f:
    for i, line in enumerate(f):
        if i >= 1000:  # Load up to 1000 cases
            break
        line = line.strip()
        if line:
            try:
                test_cases.append(json.loads(line))
            except:
                continue  # Skip malformed lines silently

print(f"✅ Loaded {len(test_cases)} test cases")

# Run validation
successful_tests = 0
failed_tests = 0
total_response_time = 0
results = []

for i, test_case in enumerate(test_cases, 1):
    # Parse FIM content
    content = test_case.get('content', '')
    if '<fim_middle>' not in content:
        failed_tests += 1
        continue
        
    parts = content.split('<fim_middle>')
    prompt = parts[0] + '<fim_middle>'
    expected = parts[1] if len(parts) > 1 else ""
    
    # Make API call
    try:
        request_body = {
            "model": available_models[0],
            "prompt": prompt,
            "max_tokens": 100,
            "temperature": 0.3,
            "echo": False
        }
        
        start_time = time.time()
        response = requests.post(COMPLETIONS_ENDPOINT, headers=HEADERS, json=request_body, timeout=20)
        response_time = (time.time() - start_time) * 1000
        
        response.raise_for_status()
        result = response.json()
        choices = result.get('choices', [])
        
        if choices:
            generated = choices[0].get('text', '').strip()
            total_response_time += response_time
            
            # Check match quality
            if expected and expected.lower().strip() in generated.lower():
                match_type = "✅ EXACT"
                successful_tests += 1
            elif expected and any(word in generated.lower() for word in expected.lower().split()[:3]):
                match_type = "✅ PARTIAL"
                successful_tests += 1
            else:
                match_type = "❌ NO MATCH"
                failed_tests += 1
                
            # Store compact result (only for failed cases or every 100th)
            if match_type == "❌ NO MATCH" or i % 100 == 0:
                results.append(f"Test {i}: {match_type} ({response_time:.0f}ms)")
        else:
            results.append(f"Test {i}: ❌ NO OUTPUT")
            failed_tests += 1
            
    except Exception as e:
        if i % 100 == 0:  # Only log errors for milestone tests
            results.append(f"Test {i}: ❌ ERROR - {str(e)[:50]}")
        failed_tests += 1
    
    # Progress indicator every 50 tests
    if i % 50 == 0 or i == len(test_cases):
        success_rate = (successful_tests / i) * 100
        print(f"Progress: {i}/{len(test_cases)} | Success: {success_rate:.1f}% | Avg: {(total_response_time/max(successful_tests,1)):.0f}ms")
    
    # Rate limiting - shorter delay for large batch
    time.sleep(0.1)

# Summary
print("\n" + "="*60)
print("📊 VALIDATION RESULTS")
print("="*60)
print(f"✅ Success: {successful_tests}/{len(test_cases)} ({(successful_tests/len(test_cases))*100:.1f}%)")
print(f"❌ Failed: {failed_tests}/{len(test_cases)}")

if successful_tests > 0:
    avg_time = total_response_time / successful_tests
    print(f"⚡ Avg response: {avg_time:.0f}ms")
    print(f"⏱️ Total time: {(total_response_time/1000):.1f}s")

print(f"🤖 Model: {available_models[0]}")

# Overall assessment
success_rate = (successful_tests / len(test_cases)) * 100
if success_rate >= 70:
    print("🎉 EXCELLENT - WebIDE pipeline working well!")
elif success_rate >= 50:
    print("👍 GOOD - Backend functional")
elif success_rate >= 30:
    print("⚠️ PARTIAL - Some issues detected")
else:
    print("🚨 POOR - Check configuration")

# Show sample results
show_details = False  # Set to True to see milestone and failed test results
if show_details and results:
    print(f"\nSample Results ({len(results)} shown):")
    for result in results[-10:]:  # Show last 10 results
        print(f"  {result}")

print("✅ Validation completed!")

🔥 COMPREHENSIVE VALIDATION - Testing 10 FIM cases...
🔥 Simple test: 1 + 1 = 2
🔥 Available models: ['/data/starcoder2_7b_triple_Train_w8a8_optimal']
🔥 Variables found - proceeding with 10-case validation...
🔥 Loading from: /Users/shi.yu/Documents/tbricks_dataviewer/fim_singleline_dataset_deterministic.jsonl
🔥 File exists - loading test cases...
🔥 Loaded 10 test cases

🎯 STARTING 10-CASE FIM VALIDATION

🔸 Test Case 1/10
----------------------------------------
📝 Prompt length: 85 chars
? Prompt preview: <fim_prefix>void SetResponseTime( market_t m, const tbricks::<fim_suffix><fim_mi...
🎯 Expected: Duration...
⚡ Response time: 2264.5ms
✨ Generated: Duration& t ) { m_response_times[m] = t; }Duration& t ) { m_...
🔥 ✅ EXACT MATCH FOUND!

🔸 Test Case 2/10
----------------------------------------
📝 Prompt length: 105 chars
? Prompt preview: <fim_prefix>m_validationTable.InsertColumn(etf_static_data::strategy_parameters:...
🎯 Expected: ETF_StaticDataOpenValidationApp...
⚡ Response time: 2261.2m

In [1]:
# EMERGENCY TEST - New cell to check if kernel is working
print("🆘 EMERGENCY TEST - Can you see this output?")
print("🆘 If yes, the kernel is working but the previous cell was stuck")
print("🆘 Current time:", __import__('datetime').datetime.now())

# Test very basic functionality
import sys
print(f"🆘 Python version: {sys.version}")
print("🆘 Simple math test: 2 + 2 =", 2 + 2)

# Check if we can access previous variables
print("🆘 Globals check:")
available_models_exists = 'available_models' in globals()
print(f"🆘 available_models exists: {available_models_exists}")

if available_models_exists:
    print(f"🆘 available_models value: {available_models}")
    print("🆘 ✅ Previous cells' variables are accessible!")
else:
    print("🆘 ❌ Need to re-run previous cells after kernel restart")

print("🆘 END OF EMERGENCY TEST")

🆘 EMERGENCY TEST - Can you see this output?
🆘 If yes, the kernel is working but the previous cell was stuck
🆘 Current time: 2025-10-01 23:10:22.490316
🆘 Python version: 3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, 11:23:37) [Clang 14.0.6 ]
🆘 Simple math test: 2 + 2 = 4
🆘 Globals check:
🆘 available_models exists: False
🆘 ❌ Need to re-run previous cells after kernel restart
🆘 END OF EMERGENCY TEST
