# Edge Reasoning Evaluation Framework
This notebook provides an interactive workflow for reproducing the evaluation results in our paper. It supports both **server** and **Tegra** platforms.

### Quick Navigation
1. [Environment Setup](#setup)
2. [Platform setup](#platform)
3. [Server Evaluations](#server)
4. [Tegra Evaluations](#tegra)
5. [Results Processing](#results)
6. [Analytical Models](#analytical)
7. [Validation](#validation)


## 1. Environment Setup {#setup}

Dtermine platform and set up the appropriate environment.


In [19]:
# Platform detection using our make infrastructure
platform = !make platform
platform = platform[0].split(":")[-1].strip()
print("\n* device information:")
!make info



* device information:


928.72s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


/home/modfi/models/edgereasoning/.venv/bin/python setup.py --info-only
Environment Setup
Auto-detected platform: server

Device Information:
  platform: server
  python_version: 3.12.3
  system: Linux
  machine: x86_64
  processor: x86_64
  in_container: False
  gpus:
    GPU 0: NVIDIA RTX A6000 (49140 MiB)
    GPU 1: NVIDIA RTX A6000 (49140 MiB)


## 2. Server Evaluations

For server platforms, we can run MMLU and Planner evaluations with different modes:

### Available modes:
- **base**: Full reasoning evaluation (4096 tokens)
- **budget**: Budget evaluation (configurable tokens)
- **noreasoning**: Direct answer selection
- **scale**: Parameter scaling experiments


In [24]:
# Server evaluation options
if platform == "server":
    print("  SERVER EVALUATION OPTIONS")
    print("\n Run evaluations using make commands:")
    
    #!make help | grep -A10 "Server Evaluations"
    
    #!make server-mmlu
    
    #print("# !make planner")
    
    !cd eval/server/mmlu && ./run.sh budget --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
    
else:
    print("  Server evaluations not available on this platform")
    print("   Detected platform:", platform)


  SERVER EVALUATION OPTIONS

 Run evaluations using make commands:


1677.06s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Starting MMLU Evaluation - Mode: budget
Time: Fri Aug 22 11:43:05 PM PDT 2025
Working directory: /home/modfi/models/edgereasoning/eval/server/mmlu

* Running budget evaluation...
INFO 08-22 23:43:08 [__init__.py:241] Automatically detected platform cuda.
Starting Budget MMLU Evaluation - ALL SUBJECTS
Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Config: configs/budget.yaml
Output base: /home/modfi/models/edgereasoning/data/mmlu/server/DeepSeek-R1-Qwen-1.5B/mmlu_20250822_234309_budget_DeepSeek-R1-Distill-Qwen-1.5B
Timestamp: 20250822_234309

* Setting up model...
Setting up model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Loading tokenizer: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Loading VLLM model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
INFO 08-22 23:43:11 [utils.py:326] non-default args: {'model': 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 1024, 'gpu_memory_utilization': 0.6, 'show_hidden_metrics_for_version': 

In [None]:
platform = !make platform
platform = platform[0].split(":")[-1].strip()
print(platform)

server


## 3. Tegra Evaluations

For Tegra platforms, we can run synthetic benchmarks and MMLU evaluations:

### Available evaluations:
- **base**: Base MMLU evaluation
- **budget**: Budget evaluation
- **scaling**: Test-time scaling evaluation  
- **prefill**: Prefill synthetic benchmarks
- **decode**: Decode synthetic benchmarks
- **synthetic**: All synthetic benchmarks
- **all**: All evaluations


In [None]:
# Tegra evaluation options  
if platform == "tegra":
    print("TEGRA EVALUATION OPTIONS")
    
    print("\nAvailable Tegra targets:")
    #!make help | grep -A15 "Tegra/Jetson Evaluation"
    
    #!make prefill
    
    #!make tegra-base
    
    #!cd eval/tegra && ./open.sh 1
    
    
    print(f"\nResults will be saved to: data/mmlu/tegra/")
    
else:
    print("WARNING: Tegra evaluations not available on this platform")
    print("   Detected platform:", platform)


## 4. Results Processing

After running evaluations, process the results to generate figures and coefficients for the analytical models:


In [None]:
print("POST-PROCESSING WORKFLOW")

#Step 1: Process benchmark results")
# !python postprocess.py --results --sub-config prefill
# !python postprocess.py --results --sub-config decode

#Step 2: Generate analytical models"
# !cd third_party/token2metrics
# !python -m prefilltokens.main
# !python -m prefillenergy.cli --all
# !python -m decodetokens.main
# !python -m decodeenergy.cli --all


#"Step 3: Expected outputs"
#"- Figures 1-5 in outputs/"
#"- Fit coefficients for config/analytic.yaml"
#"- Model parameters in validation/"

# Check if results directory exists
results_base = Path("data")
if results_base.exists():
    print(f"\nFound results directory: {results_base}")
    subdirs = [d for d in results_base.iterdir() if d.is_dir()]
    if subdirs:
        print("   Available result sets:")
        for subdir in subdirs[:5]:  
            print(f"   └── {subdir.name}/")
        if len(subdirs) > 5:
            print(f"   └── ... and {len(subdirs)-5} more")
    else:
        print("   (Empty - run evaluations first)")
else:
    print("\nWARNING: No results directory found")
    print("   Run evaluations first to generate data")


## 5. Analytical Models

Test the analytical latency and energy prediction models with different input/output token combinations:


In [None]:
!python energy_model.py --help

In [None]:
print("ANALYTICAL MODEL TESTING")
test_cases = [
    (128, 128),
    (256, 64), 
    (512, 256),
    (1024, 512)
]

# python latency_model.py -i {input_tokens} -o {output_tokens}

for input_tokens, output_tokens in test_cases:
    import subprocess
    subprocess.run([
        "python", "energy_model.py",
        "-i", str(input_tokens),
        "-o", str(output_tokens)
    ])

# You can modify these values and run:

input_tokens = 384   
output_tokens = 256 

#Test with {input_tokens} input tokens, {output_tokens} output tokens:
#python latency_model.py -i {input_tokens} -o {output_tokens}
#python energy_model.py -i {input_tokens} -o {output_tokens}


ANALYTICAL MODEL TESTING
python energy_model.py -i 128 -o 128
python energy_model.py -i 256 -o 64
python energy_model.py -i 512 -o 256
python energy_model.py -i 1024 -o 512


## 6. Validation & Summary

Validate the analytical models and review the complete workflow:


In [2]:
!pwd
!cd third_party/token2metrics/prefillenergy/ && ./run.sh

/home/modfi/models/edgereasoning
Processing prefill results from Tegra...
Input dir: ../../../data/synthetic/gpu/prefill
Output dir: ../../../data/synthetic/gpu/prefill/processed
🔋 Energy Analysis Pipeline
Collecting energy files...
No energy files found in the specified directory
❌ No energy data found to analyze
🔄 Energy-Performance Correlation Analysis
❌ Error: Performance file not found: ../../../data/synthetic/gpu/prefill/processed/all_results_by_model_*.xlsx
🔍 Power Insights Analysis
🔍 Auto-detected correlation file: /home/modfi/models/edgereasoning/third_party/token2metrics/prefillenergy/energy/../../decodenergy/output/energy_performance_correlation.xlsx
Correlation file: /home/modfi/models/edgereasoning/third_party/token2metrics/prefillenergy/energy/../../decodenergy/output/energy_performance_correlation.xlsx
Auto-detected correlation file: /home/modfi/models/edgereasoning/third_party/token2metrics/prefillenergy/energy/../../decodenergy/output/energy_performance_correlation.xls

In [None]:
#call the run.sh script

!cd third_party/token2metrics/prefillenergy/
!./run.sh
!cd third_party/token2metrics/decodeenergy/
!./run.sh


#Or run individual scripts

# !cd third_party/token2metrics/prefillenergy/ && python generate_lookup_table.py
# !cd third_party/token2metrics/decodeenergy/ && python generate_lookup_table.py
# !cd third_party/token2metrics/decodeenergy/ && python empirical.py


In [None]:

print(f"\nCurrent Platform: {platform.upper()}")
print(f"Repository Structure:")
print("   eval/server/     - Server MMLU & Planner evaluations")
print("   eval/tegra/      - Tegra containerized evaluations")
print("   data/            - All results organized by platform")
print("   third_party/     - Post-processing and model fitting")
