# 10. Evaluate Fine-tuned OpenVLA on LIBERO

**Goal**: Evaluate the fine-tuned OpenVLA model using simulation rollouts in LIBERO.

## Evaluation Strategy

For Vision-Language-Action (VLA) models, proper evaluation means:
- Running the policy in **simulation rollouts** (not just prediction loss)
- Measuring **task success rate** (did the robot complete the task?)
- Using **held-out tasks** for generalization testing

## LIBERO Train/Test Split

| Suite | Tasks | Purpose |
|-------|-------|--------|
| libero_spatial | 10 | Spatial reasoning |
| libero_object | 10 | Object manipulation |
| libero_goal | 10 | Goal-conditioned |
| libero_90 | 90 | Training set |
| libero_10 | 10 | **Held-out test set** |

---
## 1. Setup

In [None]:
# ============================================================
# CRITICAL: Set paths BEFORE importing packages
# ============================================================
import os
import sys

# Auto-detect environment (NERSC Perlmutter vs SciServer)
if os.environ.get('PSCRATCH'):
    SCRATCH = os.environ['PSCRATCH']  # NERSC Perlmutter
elif os.environ.get('SCRATCH'):
    SCRATCH = os.environ['SCRATCH']  # Generic scratch
else:
    SCRATCH = "/home/idies/workspace/Temporary/dpark1/scratch"  # SciServer default

CACHE_DIR = f"{SCRATCH}/.cache"
os.environ['XDG_CACHE_HOME'] = CACHE_DIR
os.environ['HF_HOME'] = f"{CACHE_DIR}/huggingface"
os.environ['TORCH_HOME'] = f"{CACHE_DIR}/torch"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Suppress TF warnings

# Add LIBERO to Python path (required for editable install)
LIBERO_PATH = f"{SCRATCH}/LIBERO"
if LIBERO_PATH not in sys.path:
    sys.path.insert(0, LIBERO_PATH)

# Import LIBERO EARLY (before transformers trust_remote_code can interfere)
from libero.libero import benchmark
from libero.libero.envs import OffScreenRenderEnv
print("LIBERO imported successfully!")

# Fine-tuned model checkpoint path
CHECKPOINT_PATH = f"{SCRATCH}/openvla_finetune/final"

print(f"Base directory: {SCRATCH}")
print(f"Checkpoint path: {CHECKPOINT_PATH}")

In [None]:
import torch
import numpy as np
from PIL import Image
from pathlib import Path
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import json
from datetime import datetime

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---
## 2. Load Fine-tuned Model

In [None]:
def load_finetuned_model(checkpoint_path, device="cuda:0"):
    """
    Load a fine-tuned OpenVLA model with LoRA weights.
    
    Args:
        checkpoint_path: Path to the fine-tuned checkpoint
        device: Device to load model on
    
    Returns:
        model, processor
    """
    from transformers import AutoModelForVision2Seq, AutoProcessor
    
    checkpoint_path = Path(checkpoint_path)
    print(f"Loading fine-tuned model from: {checkpoint_path}")
    
    # Debug: Show checkpoint contents
    if checkpoint_path.exists():
        print(f"\nCheckpoint contents:")
        for f in sorted(checkpoint_path.iterdir()):
            size = f.stat().st_size / 1024 / 1024  # MB
            print(f"  {f.name} ({size:.1f} MB)")
    else:
        raise FileNotFoundError(f"Checkpoint not found: {checkpoint_path}")
    
    # Detect checkpoint type
    has_adapter_config = (checkpoint_path / "adapter_config.json").exists()
    has_adapter_model = (checkpoint_path / "adapter_model.safetensors").exists() or \
                        (checkpoint_path / "adapter_model.bin").exists()
    has_model_safetensors = (checkpoint_path / "model.safetensors").exists()
    has_pytorch_model = (checkpoint_path / "pytorch_model.bin").exists()
    
    print(f"\nCheckpoint type detection:")
    print(f"  adapter_config.json: {has_adapter_config}")
    print(f"  adapter_model.*: {has_adapter_model}")
    print(f"  model.safetensors: {has_model_safetensors}")
    print(f"  pytorch_model.bin: {has_pytorch_model}")
    
    # Determine loading strategy
    is_lora = has_adapter_config and has_adapter_model
    is_full_model = has_model_safetensors or has_pytorch_model
    
    if is_lora:
        from peft import PeftModel
        
        print("\nDetected LoRA checkpoint - loading base model + adapters...")
        
        # Load base model
        print("Loading base model...")
        base_model = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            cache_dir=f"{CACHE_DIR}/huggingface",
            attn_implementation="eager",
        )
        
        # Load LoRA weights
        print("Loading LoRA adapters...")
        model = PeftModel.from_pretrained(base_model, str(checkpoint_path))
        
        # Merge for faster inference
        print("Merging LoRA weights...")
        model = model.merge_and_unload()
        
    elif is_full_model:
        print("\nDetected full model checkpoint - loading directly...")
        model = AutoModelForVision2Seq.from_pretrained(
            str(checkpoint_path),
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            attn_implementation="eager",
        )
    else:
        # Fallback: try loading as LoRA anyway (might be HF Trainer checkpoint)
        print("\nUnknown checkpoint format - attempting LoRA load...")
        print("If this fails, check that training completed and saved properly.")
        
        from peft import PeftModel
        
        base_model = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            cache_dir=f"{CACHE_DIR}/huggingface",
            attn_implementation="eager",
        )
        
        try:
            model = PeftModel.from_pretrained(base_model, str(checkpoint_path))
            model = model.merge_and_unload()
        except Exception as e:
            print(f"\nError loading checkpoint: {e}")
            print("\nFalling back to base model (NOT fine-tuned!)...")
            print("WARNING: This will give zero-shot performance, not fine-tuned!")
            model = base_model
    
    model.to(device)
    model.eval()
    
    # Load processor
    processor = AutoProcessor.from_pretrained(
        "openvla/openvla-7b",
        trust_remote_code=True,
        cache_dir=f"{CACHE_DIR}/huggingface",
    )
    
    print(f"\nModel loaded on {device}")
    return model, processor

In [None]:
# Check if checkpoint exists
checkpoint_exists = Path(CHECKPOINT_PATH).exists()
print(f"Checkpoint exists: {checkpoint_exists}")

if checkpoint_exists:
    print(f"\nCheckpoint contents:")
    for f in Path(CHECKPOINT_PATH).iterdir():
        print(f"  {f.name}")
else:
    print(f"\nCheckpoint not found at {CHECKPOINT_PATH}")
    print("Available checkpoints:")
    finetune_dir = Path(f"{SCRATCH}/openvla_finetune")
    if finetune_dir.exists():
        for f in finetune_dir.iterdir():
            print(f"  {f.name}")

In [None]:
# Load the fine-tuned model
# Change CHECKPOINT_PATH above if needed

DEVICE = "cuda:0"
model, processor = load_finetuned_model(CHECKPOINT_PATH, DEVICE)

---
## 3. Create Policy Wrapper

In [None]:
class OpenVLAPolicy:
    """
    Policy wrapper for OpenVLA inference in LIBERO.
    """
    
    def __init__(self, model, processor, device="cuda:0", unnorm_key="bridge_orig"):
        self.model = model
        self.processor = processor
        self.device = device
        self.unnorm_key = unnorm_key  # Default to bridge_orig for LIBERO
    
    def predict(self, obs, instruction):
        """
        Predict action from observation and instruction.
        
        Args:
            obs: Dictionary with image observation
            instruction: Natural language task instruction
        
        Returns:
            action: 7-DoF action array
        """
        # Get image from observation - LIBERO uses 'agentview_image' key
        if isinstance(obs, dict):
            # Try different possible keys for the image
            image = None
            for key in ['agentview_image', 'agentview_rgb', 'image', 'pixels']:
                if key in obs and obs[key] is not None:
                    image = obs[key]
                    break
            
            if image is None:
                # Debug: print available keys
                print(f"Warning: No image found in obs. Available keys: {list(obs.keys())}")
                return np.zeros(7)
        else:
            image = obs
        
        # Ensure image is a proper array
        if not isinstance(image, np.ndarray) or image.ndim < 2:
            print(f"Warning: Invalid image type={type(image)}, ndim={getattr(image, 'ndim', 'N/A')}")
            return np.zeros(7)
        
        # Rotate image (LIBERO convention)
        image = np.rot90(image, k=2)
        
        # Convert to PIL
        pil_image = Image.fromarray(image.astype(np.uint8))
        
        # Format prompt
        prompt = f"In: What action should the robot take to {instruction.lower()}?\nOut:"
        
        # Process inputs
        inputs = self.processor(prompt, pil_image, return_tensors="pt")
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # Convert pixel_values to bfloat16
        if 'pixel_values' in inputs:
            inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
        
        # Predict action
        with torch.no_grad():
            action = self.model.predict_action(
                **inputs,
                unnorm_key=self.unnorm_key,
                do_sample=False,
            )
        
        # Post-process action
        action = np.array(action)
        
        # Invert gripper for LIBERO convention
        if len(action) >= 7:
            action[6] = -action[6]
        
        return action

# Create policy with bridge_orig unnorm_key (good for LIBERO-like manipulation)
policy = OpenVLAPolicy(model, processor, DEVICE, unnorm_key="bridge_orig")
print("Policy created with unnorm_key='bridge_orig'!")

---
## 4. Setup LIBERO Evaluation

In [None]:
# LIBERO was imported at the top to avoid trust_remote_code interference
# benchmark and OffScreenRenderEnv are already available

def get_benchmark_instance(suite_name):
    """Get an instantiated benchmark object for a suite."""
    BenchmarkClass = benchmark.get_benchmark(suite_name)
    return BenchmarkClass()

def create_libero_env(task_id, benchmark_instance, image_size=256):
    """Create LIBERO environment for a task."""
    bddl_file = benchmark_instance.get_task_bddl_file_path(task_id)
    
    env_args = {
        "bddl_file_name": bddl_file,
        "camera_heights": image_size,
        "camera_widths": image_size,
    }
    
    env = OffScreenRenderEnv(**env_args)
    env.seed(0)
    return env

# List available suites
print("Available LIBERO suites:")
for suite in ["libero_spatial", "libero_object", "libero_goal", "libero_90", "libero_10"]:
    try:
        bench = get_benchmark_instance(suite)
        print(f"  {suite}: {bench.n_tasks} tasks")
    except Exception as e:
        print(f"  {suite}: Not available ({e})")

---
## 5. Evaluation Functions

In [None]:
def evaluate_single_episode(policy, env, instruction, max_steps=300, render=False):
    """
    Run a single evaluation episode.
    
    Returns:
        success: bool
        frames: list of frames (if render=True)
    """
    obs = env.reset()
    frames = []
    
    for step in range(max_steps):
        # Get action from policy
        action = policy.predict(obs, instruction)
        
        # Clip action to valid range
        action = np.clip(action, -1, 1)
        
        # Step environment
        obs, reward, done, info = env.step(action)
        
        if render:
            # Get image for rendering - try multiple keys
            for key in ['agentview_image', 'agentview_rgb', 'image']:
                if key in obs and obs[key] is not None:
                    frames.append(obs[key].copy())
                    break
        
        if done:
            break
    
    success = info.get('success', False) or reward > 0
    return success, frames


def evaluate_task(policy, task_id, benchmark_instance, n_trials=10, max_steps=300):
    """
    Evaluate policy on a single task.
    
    Returns:
        dict with success_rate, successes, trials
    """
    task = benchmark_instance.get_task(task_id)
    instruction = task.language
    task_name = benchmark_instance.get_task_names()[task_id]
    
    env = create_libero_env(task_id, benchmark_instance)
    
    successes = 0
    for trial in range(n_trials):
        success, _ = evaluate_single_episode(policy, env, instruction, max_steps)
        if success:
            successes += 1
    
    env.close()
    
    return {
        'task_id': task_id,
        'task_name': task_name,
        'instruction': instruction,
        'successes': successes,
        'trials': n_trials,
        'success_rate': successes / n_trials,
    }


def evaluate_suite(policy, suite_name, n_trials=10, max_tasks=None, max_steps=300):
    """
    Evaluate policy on an entire LIBERO suite.
    
    Returns:
        dict with per-task results and overall success rate
    """
    print(f"\nEvaluating on {suite_name}")
    print("=" * 60)
    
    bench = get_benchmark_instance(suite_name)
    n_tasks = bench.n_tasks if max_tasks is None else min(bench.n_tasks, max_tasks)
    
    print(f"Tasks: {n_tasks}")
    print(f"Trials per task: {n_trials}")
    print()
    
    results = []
    
    for task_id in tqdm(range(n_tasks), desc=f"Evaluating {suite_name}"):
        result = evaluate_task(policy, task_id, bench, n_trials=n_trials, max_steps=max_steps)
        results.append(result)
        
        print(f"  {result['task_name']}: {result['success_rate']*100:.1f}% ({result['successes']}/{result['trials']})")
    
    # Calculate overall success rate
    total_successes = sum(r['successes'] for r in results)
    total_trials = sum(r['trials'] for r in results)
    overall_success_rate = total_successes / total_trials
    
    return {
        'suite': suite_name,
        'tasks': results,
        'overall_success_rate': overall_success_rate,
        'total_successes': total_successes,
        'total_trials': total_trials,
    }

---
## 6. Run Evaluation

### 6.1 Quick Test (1 task, 1 trial)

In [None]:
# Quick test on a single task
SUITE = "libero_spatial"

bench = get_benchmark_instance(SUITE)
print(f"Testing on {SUITE}")
print(f"Task 0: {bench.get_task_names()[0]}")
print(f"Instruction: {bench.get_task(0).language}")

# Single trial
result = evaluate_task(policy, task_id=0, benchmark_instance=bench, n_trials=1)
print(f"\nResult: {'SUCCESS' if result['success_rate'] > 0 else 'FAILURE'}")

### 6.2 Visualize a Rollout

In [None]:
# Visualize a single rollout
SUITE = "libero_spatial"
TASK_ID = 0

bench = get_benchmark_instance(SUITE)
task = bench.get_task(TASK_ID)
instruction = task.language

env = create_libero_env(TASK_ID, bench)
success, frames = evaluate_single_episode(policy, env, instruction, max_steps=100, render=True)
env.close()

print(f"Task: {instruction}")
print(f"Success: {success}")
print(f"Frames captured: {len(frames)}")

# Show frames
if frames:
    fig, axes = plt.subplots(2, 5, figsize=(15, 6))
    step_indices = np.linspace(0, len(frames)-1, 10, dtype=int)
    
    for idx, ax in enumerate(axes.flat):
        frame_idx = step_indices[idx]
        # Rotate for display (LIBERO convention)
        ax.imshow(np.rot90(frames[frame_idx], k=2))
        ax.set_title(f"Step {frame_idx}")
        ax.axis('off')
    
    plt.suptitle(f"Task: {instruction}\nSuccess: {success}", fontsize=12)
    plt.tight_layout()
    plt.show()
else:
    print("No frames captured - check render=True and observation keys")

### 6.3 Full Suite Evaluation

In [None]:
# ============================================================
# CONFIGURATION - Adjust these settings
# ============================================================

EVAL_SUITE = "libero_spatial"  # Suite to evaluate on
N_TRIALS = 10                   # Trials per task (10 for full eval, 1 for quick test)
MAX_TASKS = None                # None for all tasks, or set a number for quick test
MAX_STEPS = 300                 # Max steps per episode

print(f"Evaluation Configuration:")
print(f"  Suite: {EVAL_SUITE}")
print(f"  Trials per task: {N_TRIALS}")
print(f"  Max tasks: {MAX_TASKS or 'All'}")
print(f"  Max steps: {MAX_STEPS}")

In [None]:
# Run full evaluation
results = evaluate_suite(
    policy,
    EVAL_SUITE,
    n_trials=N_TRIALS,
    max_tasks=MAX_TASKS,
    max_steps=MAX_STEPS,
)

In [None]:
# Print summary
print("\n" + "=" * 60)
print("EVALUATION SUMMARY")
print("=" * 60)
print(f"Suite: {results['suite']}")
print(f"Overall Success Rate: {results['overall_success_rate']*100:.1f}%")
print(f"Total: {results['total_successes']}/{results['total_trials']}")
print()

print("Per-task results:")
for task_result in results['tasks']:
    print(f"  {task_result['task_name']}: {task_result['success_rate']*100:.1f}%")

In [None]:
# Visualize results
task_names = [r['task_name'][:30] for r in results['tasks']]
success_rates = [r['success_rate'] * 100 for r in results['tasks']]

plt.figure(figsize=(12, 6))
bars = plt.barh(task_names, success_rates, color='steelblue')
plt.xlabel('Success Rate (%)')
plt.title(f"Fine-tuned OpenVLA on {EVAL_SUITE}\nOverall: {results['overall_success_rate']*100:.1f}%")
plt.xlim(0, 100)

# Add value labels
for bar, rate in zip(bars, success_rates):
    plt.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2, 
             f'{rate:.0f}%', va='center')

plt.tight_layout()
plt.show()

---
## 7. Save Results

In [None]:
# Save results to JSON
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = f"{SCRATCH}/openvla_finetune/eval_{EVAL_SUITE}_{timestamp}.json"

results['checkpoint'] = CHECKPOINT_PATH
results['timestamp'] = datetime.now().isoformat()
results['config'] = {
    'n_trials': N_TRIALS,
    'max_tasks': MAX_TASKS,
    'max_steps': MAX_STEPS,
}

with open(output_path, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Results saved to: {output_path}")

---
## 8. Compare with Baseline

### Expected Results Reference

| Model | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal |
|-------|----------------|---------------|-------------|
| Zero-shot (no fine-tuning) | 0-10% | 0-10% | 0-10% |
| After fine-tuning | 70-80% | 75-85% | 65-75% |
| Paper reported | 84.7% | 88.4% | 79.2% |

In [None]:
# Compare with baselines
baselines = {
    'Zero-shot': 5.0,
    'Paper (fine-tuned)': 84.7,
}

our_result = results['overall_success_rate'] * 100

print("\nComparison with Baselines:")
print("=" * 40)
print(f"Zero-shot OpenVLA:        ~5%")
print(f"Our fine-tuned model:     {our_result:.1f}%")
print(f"Paper reported:           84.7%")
print()

if our_result > 50:
    print("Fine-tuning successful! Model shows significant improvement.")
elif our_result > 20:
    print("Some improvement, but may need more training epochs or data.")
else:
    print("Limited improvement. Check training configuration.")

---
## 9. Evaluate on Held-Out Tasks (Optional)

To test generalization, evaluate on `libero_10` which contains tasks **not seen during training**.

In [None]:
# Evaluate on held-out tasks
# Only run this if you trained on libero_90

EVAL_HELD_OUT = False  # Set to True to evaluate on held-out tasks

if EVAL_HELD_OUT:
    held_out_results = evaluate_suite(
        policy,
        "libero_10",
        n_trials=10,
        max_tasks=None,
        max_steps=300,
    )
    
    print(f"\nHeld-out (libero_10) Success Rate: {held_out_results['overall_success_rate']*100:.1f}%")
else:
    print("Set EVAL_HELD_OUT = True to evaluate on held-out tasks")

---
## Summary

This notebook evaluated the fine-tuned OpenVLA model on LIBERO tasks.

**Key Takeaways:**
1. VLA models should be evaluated via **simulation rollouts**, not just loss
2. **Task success rate** is the primary metric
3. Use **held-out tasks** (libero_10) to test generalization
4. Expected fine-tuned performance: **70-80%** success rate

**Next Steps:**
- If success rate is low, try more training epochs
- Try evaluating on different suites (object, goal)
- Fine-tune on libero_90, evaluate on libero_10 for generalization