This repo is a minimal example of how to:
- Write an Eval Protocol
@evaluation_testthat operates on anEvaluationRow - Convert it into an OpenAI RFT Python grader using
build_python_grader_from_evaluation_test - Validate and run that grader via the OpenAI
/fine_tuning/alpha/graders/*HTTP APIs
The OpenAI RFT adapter lets you reuse Eval Protocol evaluation tests as Python graders for OpenAI Reinforcement Fine-Tuning (RFT). Because your grading logic lives in an Eval Protocol @evaluation_test, you can reuse the exact same code as an OpenAI Python grader—making it easy to start with OpenAI RFT and later move to other Eval Protocol–supported training workflows (or vice versa) without rewriting your evals.
The main files are:
example_rapidfuzz.py: defines an@evaluation_testcalledrapidfuzz_evalthat usesrapidfuzz.WRatioto score similarity between a model’s answer and a reference answer.test_openai_grader.py: builds a Python grader fromrapidfuzz_eval, validates it with/graders/validate, and runs it once with/graders/run.
Create and activate a virtual environment (optional but recommended):
python3 -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install eval-protocol rapidfuzzSet your API keys:
export OPENAI_API_KEY="sk-..."
export FIREWORKS_API_KEY="fw_"- Run the Eval Protocol test with pytest
pytest example_rapidfuzz.py -vs- Validate and run the Python grader against OpenAI’s
/graders/*APIs
python test_openai_grader.pyYou should see output similar to:
validate response: {
"grader": {
"type": "python",
"source": "def _ep_eval(row, **kwargs):\n \"\"\"\n Example @evaluation_test that scores a row using rapidfuzz.WRatio and\n attaches an EvaluateResult.\n \"\"\"\n reference = row.ground_truth\n assistant_msgs = [m for m in row.messages if m.role == 'assistant']\n last_assistant_content = assistant_msgs[-1].content if assistant_msgs else ''\n prediction = last_assistant_content if isinstance(last_assistant_content, str) else ''\n from rapidfuzz import fuzz, utils\n score = float(fuzz.WRatio(str(prediction), str(reference), processor=utils.default_process) / 100.0)\n row.evaluation_result = EvaluateResult(score=score)\n return row\n\n\nfrom typing import Any, Dict\nfrom types import SimpleNamespace\n\n\nclass EvaluationRow(SimpleNamespace):\n \"\"\"Minimal duck-typed stand-in for an evaluation row.\n\n Extend this with whatever attributes your eval logic uses.\n \"\"\"\n pass\n\n\nclass EvaluateResult(SimpleNamespace):\n \"\"\"Simple stand-in for Eval Protocol's EvaluateResult.\n\n This lets evaluation-style functions that construct EvaluateResult(score=...)\n run inside the Python grader sandbox without importing eval_protocol.\n \"\"\"\n\n def __init__(self, score: float, **kwargs: Any) -> None:\n super().__init__(score=score, **kwargs)\n\n\nclass Message(SimpleNamespace):\n \"\"\"Duck-typed stand-in for eval_protocol.models.Message (role/content).\"\"\"\n pass\n\n\ndef _build_row(sample: Dict[str, Any], item: Dict[str, Any]) -> EvaluationRow:\n # Start from any item-provided messages (EP-style), defaulting to [].\n raw_messages = item.get(\"messages\") or []\n normalized_messages = []\n for m in raw_messages:\n if isinstance(m, dict):\n normalized_messages.append(\n Message(\n role=m.get(\"role\"),\n content=m.get(\"content\"),\n )\n )\n else:\n # Already Message-like; rely on duck typing (must have role/content)\n normalized_messages.append(m)\n\n reference = item.get(\"reference_answer\")\n prediction = sample.get(\"output_text\")\n\n # EP-style: ensure the model prediction is present as the last assistant message\n if prediction is not None:\n normalized_messages = list(normalized_messages) # shallow copy\n normalized_messages.append(Message(role=\"assistant\", content=prediction))\n\n return EvaluationRow(\n ground_truth=reference,\n messages=normalized_messages,\n item=item,\n sample=sample,\n )\n\n\ndef grade(sample: Dict[str, Any], item: Dict[str, Any]) -> float:\n row = _build_row(sample, item)\n result = _ep_eval(row=row)\n\n # Try to normalize different result shapes into a float score\n try:\n from collections.abc import Mapping\n\n if isinstance(result, (int, float)):\n return float(result)\n\n # EvaluateResult-like object with .score\n if hasattr(result, \"score\"):\n return float(result.score)\n\n # EvaluationRow-like object with .evaluation_result.score\n eval_res = getattr(result, \"evaluation_result\", None)\n if eval_res is not None:\n if isinstance(eval_res, Mapping):\n if \"score\" in eval_res:\n return float(eval_res[\"score\"])\n elif hasattr(eval_res, \"score\"):\n return float(eval_res.score)\n\n # Dict-like with score\n if isinstance(result, Mapping) and \"score\" in result:\n return float(result[\"score\"])\n except Exception:\n pass\n\n return 0.0\n",
"name": "grader-R5FhpA6BFQlo"
}
}
run response: {
"reward": 0.7555555555555555,
"metadata": {
"name": "grader-5XXSBZ9B1OJj",
"type": "python",
"errors": {
"formula_parse_error": false,
"sample_parse_error": false,
"sample_parse_error_details": null,
"truncated_observation_error": false,
"unresponsive_reward_error": false,
"invalid_variable_error": false,
"invalid_variable_error_details": null,
"other_error": false,
"python_grader_server_error": false,
"python_grader_server_error_type": null,
"python_grader_runtime_error": false,
"python_grader_runtime_error_details": null,
"model_grader_server_error": false,
"model_grader_refusal_error": false,
"model_grader_refusal_error_details": null,
"model_grader_parse_error": false,
"model_grader_parse_error_details": null,
"model_grader_exceeded_max_tokens_error": false,
"model_grader_server_error_details": null,
"endpoint_grader_internal_error": false,
"endpoint_grader_internal_error_details": null,
"endpoint_grader_server_error": false,
"endpoint_grader_server_error_details": null,
"endpoint_grader_safety_check_error": false
},
"execution_time": 6.831332206726074,
"scores": {},
"token_usage": null,
"sampled_model_name": null
},
"sub_rewards": {},
"model_grader_token_usage_per_model": {}
}When you convert an @evaluation_test into an OpenAI Python grader, it must satisfy OpenAI’s runtime limits, i.e. no network access, fixed set of packages (e.g., numpy, pandas, rapidfuzz, etc.). For more details, see OpenAI graders documentation.
-
example_rapidfuzz.py:- Defines a tiny inline demo dataset (
DEMO_ROWS) with a user/assistant conversation and aground_truth. - Implements
rapidfuzz_eval(row: EvaluationRow, **kwargs)decorated with@evaluation_test, which:- Pulls the last assistant message from
row.messages. - Compares it to
row.ground_truthusingrapidfuzz.fuzz.WRatio. - Writes the score into
row.evaluation_result = EvaluateResult(score=...).
- Pulls the last assistant message from
- Defines a tiny inline demo dataset (
-
test_openai_grader.py:- Imports
rapidfuzz_evaland builds the grader withbuild_python_grader_from_evaluation_test(rapidfuzz_eval). - Validates the grader via:
POST https://api.openai.com/v1/fine_tuning/alpha/graders/validate
- Runs the grader once via:
POST https://api.openai.com/v1/fine_tuning/alpha/graders/run
- Uses a simple payload with:
item = {"reference_answer": "fuzzy wuzzy had no hair"}model_sample = "fuzzy wuzzy was a bear"
- Imports
Once you have a working Python grader from your Eval Protocol @evaluation_test, you can:
- Swap out the example logic in
rapidfuzz_evalfor your own evaluation logic (e.g., exact match, rubric-based scoring, or more complex metrics). - Use the validated grader in an OpenAI Reinforcement Fine-Tuning job—see OpenAI’s docs on preparing your dataset and creating a reinforcement fine-tuning job for full details.