A suite of tools to collect web page data, train a model and evaluate it
Ocelot is a small toolkit that enables the automated collection of webpage data for model training. The primary use of the codebase is to train the Ocelot summarisation model, although it can easily be extended to other specific tasks.
The codebase contains a small playwright script and data API to allow for the collection of web page data. It also contains training code for supervised and preference fine-tuning with mixed text and image datasets. Results can be tested using the LLM as a judge evaluation module included as well.
The data module enables the automated collection of a set of prompts to train a model via Leo in the brave browser. The browser will be pointed to a simple API that recieves the page content, makes requests to LLMs and then stores inputs and responses. This works with both web page text and images.
Provides an LLM as a judge framework for simultaneously comparing, scoring and ranking, three different responses for a given prompt. This can be used to validate the performance of a trained model.
Provides a CLI to train an LLM on the collected prompts and responses via supervised fine tuning (SFT) and prefence fine tuning.
The packages under src/ are usable on their own and are meant to compose in a simple pipeline: data → training → evaluation.
Setup, CLI usage, configuration, and extension points are documented per module:
| Module | README |
|---|---|
| Data collection & postprocessing | src/data/README.md |
| Data API (FastAPI / OpenAI-compatible gateway) | src/data/api/README.md |
| Training (LoRA, datasets, registry) | src/training/README.md |
| Evaluation (LLM-as-judge) | src/evaluation/README.md |
The summaries below are intentionally short; use the README links for further information.
Further information: src/data/README.md (pipeline, prerequisites, layout). The FastAPI “data API” used in the collection flow lives under src/data/api/ — see src/data/api/README.md for the gateway, config/vllm_config.yaml, and how requests become stored examples.
Further information: src/evaluation/README.md (install, input format, providers, programmatic use).
Further information: src/training/README.md (invoking training, expected dataset shape, registering new methods).
Both the data collection, and evaluation can be run via docker compose. Note that a built version of the Brave Browser is required to run the data collection, and further instructions on how to do this can be seen in brave-core.
It is reccomended to run the training module on a suitable GPU device.
For component-specific setup, follow the Module READMEs table above (data, API, training, or evaluation).
If you wish to contribute, please raise any issues or PRs directly in this repository and we will endeavour to review them as quickly as possible.
Please ensure that any PR also has a linked issue explaining the rationale.
This code is made available under the Mozilla Public License 2.0. Please see the LICENSE for more information.
From the repository root: (note that the dependencies from each module will also need to be installed)
python3 -m pip install -r requirements-dev.txt
python3 -m pytestpyproject.toml sets pythonpath = ["src"] and testpaths = ["test"]. Tests mirror packages under src/: test/data/, test/evaluation/, test/training/.
pytest runs everything that is not skipped. That includes data helpers, evaluation/judge logic, training registry checks, and fast training tests such as Parquet roundtrips (test/training/test_prepare_data.py::test_chunk_to_arrow_roundtrip_through_parquet, needs pyarrow + numpy).
Training integration tests are marked integration and skip unless you opt in (below); they do not fail a normal CI run.
| Path | What it covers |
|---|---|
test/data/ |
Brave path resolution, postprocessing merges |
test/evaluation/ |
Judge config, prompts, LiteLLM judge wiring, CLI smoke |
test/training/ |
Method registry, prepare_data / Parquet helpers, optional GPU E2E |
Filter examples:
python3 -m pytest test/data/ -q
python3 -m pytest test/evaluation/ -q
python3 -m pytest test/training/ -q
python3 -m pytest -m integration # only marked tests (still skip without OCELOT_TRAINING_E2E)Install training dependencies for your machine: src/training/requirements-macos.txt on macOS, src/training/requirements-linux-cuda.txt on Linux with NVIDIA CUDA (src/training/requirements.txt documents the split). On Linux + CUDA, optional flash-attn is installed separately with pip install flash-attn --no-build-isolation (src/training/README.md). You need CUDA or Apple MPS for these.
Set OCELOT_TRAINING_E2E=1 to enable the integration tests. Useful environment variables:
OCELOT_E2E_MODEL_NAME— Hugging Face model id (default in tests is a small Qwen3-VL instruct checkpoint).OCELOT_LOAD_IN_4BIT— on Apple Silicon the tests force0(no bitsandbytes); on CUDA you can keep 4-bit ifbitsandbytesis installed.
test/training/test_one_batch_integration.py — full trainer smoke: one SFT stage, load the saved LoRA checkpoint, then one preference stage (IPO / CPO or DPO, parametrized) on the same tiny JSON dataset.
test/training/test_prepare_data.py — prepare_data.main() writes Parquet; test_prepare_data_parquet_sft_then_ipo_collator_forward_e2e loads it via PREPARED_DATA_DIR, runs one SFT batch through the model, then a preference batch (TokenizedDPOCollator, layout shared by IPO and DPO) with chosen and rejected forwards.
Example:
export OCELOT_TRAINING_E2E=1
python3 -m pytest test/training/test_one_batch_integration.py -v
python3 -m pytest test/training/test_prepare_data.py::test_prepare_data_parquet_sft_then_ipo_collator_forward_e2e -v