GitHub - eval-protocol/openai-rft-quickstart

Eval Protocol OpenAI RFT Quickstart

This repo is a minimal example of how to:

Write an Eval Protocol @evaluation_test that operates on an EvaluationRow
Convert it into an OpenAI RFT Python grader using build_python_grader_from_evaluation_test
Validate and run that grader via the OpenAI /fine_tuning/alpha/graders/* HTTP APIs

The OpenAI RFT adapter lets you reuse Eval Protocol evaluation tests as Python graders for OpenAI Reinforcement Fine-Tuning (RFT). Because your grading logic lives in an Eval Protocol @evaluation_test, you can reuse the exact same code as an OpenAI Python grader—making it easy to start with OpenAI RFT and later move to other Eval Protocol–supported training workflows (or vice versa) without rewriting your evals.

The main files are:

example_rapidfuzz.py: defines an @evaluation_test called rapidfuzz_eval that uses rapidfuzz.WRatio to score similarity between a model’s answer and a reference answer.
test_openai_grader.py: builds a Python grader from rapidfuzz_eval, validates it with /graders/validate, and runs it once with /graders/run.

Installation

Create and activate a virtual environment (optional but recommended):

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install eval-protocol rapidfuzz

Set your API keys:

export OPENAI_API_KEY="sk-..."
export FIREWORKS_API_KEY="fw_"

Running the Examples

Run the Eval Protocol test with pytest

pytest example_rapidfuzz.py -vs

Validate and run the Python grader against OpenAI’s /graders/* APIs

python test_openai_grader.py

You should see output similar to:

validate response: {
  "grader": {
    "type": "python",
    "source": "def _ep_eval(row, **kwargs):\n    \"\"\"\n    Example @evaluation_test that scores a row using rapidfuzz.WRatio and\n    attaches an EvaluateResult.\n    \"\"\"\n    reference = row.ground_truth\n    assistant_msgs = [m for m in row.messages if m.role == 'assistant']\n    last_assistant_content = assistant_msgs[-1].content if assistant_msgs else ''\n    prediction = last_assistant_content if isinstance(last_assistant_content, str) else ''\n    from rapidfuzz import fuzz, utils\n    score = float(fuzz.WRatio(str(prediction), str(reference), processor=utils.default_process) / 100.0)\n    row.evaluation_result = EvaluateResult(score=score)\n    return row\n\n\nfrom typing import Any, Dict\nfrom types import SimpleNamespace\n\n\nclass EvaluationRow(SimpleNamespace):\n    \"\"\"Minimal duck-typed stand-in for an evaluation row.\n\n    Extend this with whatever attributes your eval logic uses.\n    \"\"\"\n    pass\n\n\nclass EvaluateResult(SimpleNamespace):\n    \"\"\"Simple stand-in for Eval Protocol's EvaluateResult.\n\n    This lets evaluation-style functions that construct EvaluateResult(score=...)\n    run inside the Python grader sandbox without importing eval_protocol.\n    \"\"\"\n\n    def __init__(self, score: float, **kwargs: Any) -> None:\n        super().__init__(score=score, **kwargs)\n\n\nclass Message(SimpleNamespace):\n    \"\"\"Duck-typed stand-in for eval_protocol.models.Message (role/content).\"\"\"\n    pass\n\n\ndef _build_row(sample: Dict[str, Any], item: Dict[str, Any]) -> EvaluationRow:\n    # Start from any item-provided messages (EP-style), defaulting to [].\n    raw_messages = item.get(\"messages\") or []\n    normalized_messages = []\n    for m in raw_messages:\n        if isinstance(m, dict):\n            normalized_messages.append(\n                Message(\n                    role=m.get(\"role\"),\n                    content=m.get(\"content\"),\n                )\n            )\n        else:\n            # Already Message-like; rely on duck typing (must have role/content)\n            normalized_messages.append(m)\n\n    reference = item.get(\"reference_answer\")\n    prediction = sample.get(\"output_text\")\n\n    # EP-style: ensure the model prediction is present as the last assistant message\n    if prediction is not None:\n        normalized_messages = list(normalized_messages)  # shallow copy\n        normalized_messages.append(Message(role=\"assistant\", content=prediction))\n\n    return EvaluationRow(\n        ground_truth=reference,\n        messages=normalized_messages,\n        item=item,\n        sample=sample,\n    )\n\n\ndef grade(sample: Dict[str, Any], item: Dict[str, Any]) -> float:\n    row = _build_row(sample, item)\n    result = _ep_eval(row=row)\n\n    # Try to normalize different result shapes into a float score\n    try:\n        from collections.abc import Mapping\n\n        if isinstance(result, (int, float)):\n            return float(result)\n\n        # EvaluateResult-like object with .score\n        if hasattr(result, \"score\"):\n            return float(result.score)\n\n        # EvaluationRow-like object with .evaluation_result.score\n        eval_res = getattr(result, \"evaluation_result\", None)\n        if eval_res is not None:\n            if isinstance(eval_res, Mapping):\n                if \"score\" in eval_res:\n                    return float(eval_res[\"score\"])\n            elif hasattr(eval_res, \"score\"):\n                return float(eval_res.score)\n\n        # Dict-like with score\n        if isinstance(result, Mapping) and \"score\" in result:\n            return float(result[\"score\"])\n    except Exception:\n        pass\n\n    return 0.0\n",
    "name": "grader-R5FhpA6BFQlo"
  }
}
run response: {
  "reward": 0.7555555555555555,
  "metadata": {
    "name": "grader-5XXSBZ9B1OJj",
    "type": "python",
    "errors": {
      "formula_parse_error": false,
      "sample_parse_error": false,
      "sample_parse_error_details": null,
      "truncated_observation_error": false,
      "unresponsive_reward_error": false,
      "invalid_variable_error": false,
      "invalid_variable_error_details": null,
      "other_error": false,
      "python_grader_server_error": false,
      "python_grader_server_error_type": null,
      "python_grader_runtime_error": false,
      "python_grader_runtime_error_details": null,
      "model_grader_server_error": false,
      "model_grader_refusal_error": false,
      "model_grader_refusal_error_details": null,
      "model_grader_parse_error": false,
      "model_grader_parse_error_details": null,
      "model_grader_exceeded_max_tokens_error": false,
      "model_grader_server_error_details": null,
      "endpoint_grader_internal_error": false,
      "endpoint_grader_internal_error_details": null,
      "endpoint_grader_server_error": false,
      "endpoint_grader_server_error_details": null,
      "endpoint_grader_safety_check_error": false
    },
    "execution_time": 6.831332206726074,
    "scores": {},
    "token_usage": null,
    "sampled_model_name": null
  },
  "sub_rewards": {},
  "model_grader_token_usage_per_model": {}
}

Grader Constraints

When you convert an @evaluation_test into an OpenAI Python grader, it must satisfy OpenAI’s runtime limits, i.e. no network access, fixed set of packages (e.g., numpy, pandas, rapidfuzz, etc.). For more details, see OpenAI graders documentation.

How It Works

example_rapidfuzz.py:
- Defines a tiny inline demo dataset (DEMO_ROWS) with a user/assistant conversation and a ground_truth.
- Implements rapidfuzz_eval(row: EvaluationRow, **kwargs) decorated with @evaluation_test, which:
  - Pulls the last assistant message from row.messages.
  - Compares it to row.ground_truth using rapidfuzz.fuzz.WRatio.
  - Writes the score into row.evaluation_result = EvaluateResult(score=...).
test_openai_grader.py:
- Imports rapidfuzz_eval and builds the grader with build_python_grader_from_evaluation_test(rapidfuzz_eval).
- Validates the grader via:
  - POST https://api.openai.com/v1/fine_tuning/alpha/graders/validate
- Runs the grader once via:
  - POST https://api.openai.com/v1/fine_tuning/alpha/graders/run
- Uses a simple payload with:
  - item = {"reference_answer": "fuzzy wuzzy had no hair"}
  - model_sample = "fuzzy wuzzy was a bear"

Next Steps

Once you have a working Python grader from your Eval Protocol @evaluation_test, you can:

Swap out the example logic in rapidfuzz_eval for your own evaluation logic (e.g., exact match, rubric-based scoring, or more complex metrics).
Use the validated grader in an OpenAI Reinforcement Fine-Tuning job—see OpenAI’s docs on preparing your dataset and creating a reinforcement fine-tuning job for full details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
example_rapidfuzz.py		example_rapidfuzz.py
test_openai_grader.py		test_openai_grader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eval Protocol OpenAI RFT Quickstart

Installation

Running the Examples

Grader Constraints

How It Works

Next Steps

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

eval-protocol/openai-rft-quickstart

Folders and files

Latest commit

History

Repository files navigation

Eval Protocol OpenAI RFT Quickstart

Installation

Running the Examples

Grader Constraints

How It Works

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages