# LM-Evaluation-Harness Quick Start Guide

This notebook demonstrates how to use the lm-evaluation-harness (lm-eval) to evaluate language models using command-line interface.

## Prerequisites

Install lm-eval with the required backends:

In [None]:
# Install lm-eval with API support
!pip install "lm_eval[api]"

## 1. List Available Tasks

First, let's see what evaluation tasks are available:

In [None]:
# List all available tasks (showing first 20 lines)
!lm-eval ls tasks | head -20

### Search for Specific Tasks

You can search for specific tasks using grep:

In [None]:
# Search for MMLU tasks
!lm-eval ls tasks | grep mmlu | head -20

# Search for math-related tasks
!lm-eval ls tasks | grep -i math

# Search for Chinese language tasks
!lm-eval ls tasks | grep zho

## 2. Set MODEL_NAME and BASE_URL

Before running evaluations, set these to your API server:

- **BASE_URL**: The full endpoint URL of your API.
  - For **chat API**: `http://<host>:<port>/v1/chat/completions` (e.g. `http://localhost:8000/v1/chat/completions`).
  - For **completions API**: `http://<host>:<port>/v1/completions` (e.g. `http://localhost:8000/v1/completions`).
- **MODEL_NAME**: The model ID exposed by your server. Query the server (e.g. `GET http://localhost:8000/v1/models`) and use the `id` of the model you want to evaluate.

Run the cell below to list models (optional). If your API is not on `localhost:8000`, set `BASE_URL_FOR_MODELS` in that cell and `BASE_URL` in the evaluation cells below to your server URL. Then set `MODEL_NAME` and `BASE_URL` in the evaluation cells.

In [None]:
# Optional: list models from your API (adjust the base URL if your server is not on localhost:8000)
import urllib.request
import json
BASE_URL_FOR_MODELS = "http://localhost:8000/v1"  # without /chat/completions or /completions
try:
    with urllib.request.urlopen(f"{BASE_URL_FOR_MODELS}/models") as resp:
        data = json.load(resp)
    for m in data.get("data", []):
        print(m.get("id", m))
except Exception as e:
    print("Could not list models (is your API server running?):", e)

### Run Evaluation (chat API)

Set `MODEL_NAME` and `BASE_URL` below. Use the **`--apply_chat_template`** flag so prompts are sent as chat messages and avoid "messages as list[dict]" / AssertionError. Then run the cell.

In [None]:
%%bash
# Set these to your API server.
export MODEL_NAME="your-model-id"
export BASE_URL="http://localhost:8000/v1/chat/completions"

lm-eval --model local-chat-completions \
    --model_args model=$MODEL_NAME,base_url=$BASE_URL \
    --apply_chat_template \
    --tasks gsm8k \
    --batch_size 1 \
    --limit 10 \
    --output_path ./results \
    --log_samples

## 3. View Results

After evaluation completes, view the results:

In [None]:
import json
import os

if not os.path.exists('./results/results.json'):
    print("No results yet. Run 'Run Evaluation (chat API)' above first.")
else:
    with open('./results/results.json', 'r') as f:
        results = json.load(f)
    print("=== Evaluation Results ===")
    print(json.dumps(results, indent=2))

## 4. Example: local-completions with tokenizer (ARC)

Tasks like ARC and MMLU use **loglikelihood** and require the **completions API** and a **tokenizer**. This demo uses **arc_easy** (small, single task) with `--limit 5` so it won't overload smaller models; for full MMLU use the same command with `--tasks mmlu` (see full doc). Set `MODEL_NAME`, `BASE_URL` (`/v1/completions`), and `TOKENIZER_PATH`, then run.

In [None]:
%%bash
# Completions API: use /v1/completions. Tokenizer: required for loglikelihood; replace with your model's tokenizer (HuggingFace name or path).
export MODEL_NAME="your-model-id"
export BASE_URL="http://localhost:8000/v1/completions"
export TOKENIZER_PATH="Qwen/Qwen2.5-7B"   # replace according to your actual model

lm-eval --model local-completions \
    --model_args model=$MODEL_NAME,base_url=$BASE_URL,tokenizer=$TOKENIZER_PATH \
    --tasks arc_easy \
    --batch_size 1 \
    --limit 5 \
    --output_path ./results_arc

## 5. Learn more

For config file, multiple tasks, MMLU/completion API, and caching, see the documentation.

## Tips

- Use `--batch_size 1` for API evaluation. Use model name from `GET /v1/models` and full `base_url` (e.g. `http://localhost:8000/v1/chat/completions`).
- For more tasks, config file, MMLU/completion API, and caching, see the full documentation (*How to Evaluate LLM*) and [lm-eval docs](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs).