PeachTree is the recursive learning-tree dataset engine for CyberViser / 0AI projects.
It is designed to become a shared dependency for Hancock, PeachFuzz/CactusFuzz, and future 0AI model-training pipelines.
PeachTree turns repositories, docs, tests, fuzz reports, issue notes, and architecture plans into traceable, safe, deduplicated JSONL datasets for model training.
flowchart TD
A[Training Goal] --> B[Recursive Learning Tree]
B --> C[Source Collection]
C --> D[Safety + License Gate]
D --> E[Dataset Builder]
E --> F[JSONL Training Dataset]
E --> G[Manifest + Provenance]
G --> H[Gap Analysis]
H --> B
PeachTree does not blindly scrape GitHub.
- local/owned repository ingestion is enabled first
- public GitHub collection is disabled by default
- public collection requires explicit opt-in, license allowlists, rate limits, and provenance
- secret/token/private-key patterns are blocked
- provenance metadata is attached to every record
- generated datasets are ignored by default until reviewed
python3 -m venv ~/venvs/peachtree
source ~/venvs/peachtree/bin/activate
python -m pip install -e ".[dev]"
pytest -q
peachtree policy
peachtree plan --goal "Build PeachFuzz training data" --project peachfuzz
peachtree ingest-local --repo . --repo-name peachtree --output data/raw/peachtree.jsonl
peachtree build --source data/raw/peachtree.jsonl --dataset data/datasets/peachtree.jsonl --manifest data/manifests/peachtree.json --domain peachtree
peachtree audit --dataset data/datasets/peachtree.jsonlcd ~
unzip PeachTree-v0.1.0.zip
cd PeachTree-v0.1.0
git init
git branch -M main
git add .
git commit -m "feat: initial PeachTree recursive dataset engine"
gh repo create 0ai-Cyberviser/PeachTree --public --source=. --remote=origin --pushpeachtree ingest-local --repo ~/peachfuzz --repo-name peachfuzz --output data/raw/peachfuzz.jsonl
peachtree build --source data/raw/peachfuzz.jsonl --dataset data/datasets/peachfuzz-instruct.jsonl --manifest data/manifests/peachfuzz.json --domain peachfuzzpeachtree ingest-local --repo ~/Hancock --repo-name hancock --output data/raw/hancock.jsonl
peachtree build --source data/raw/hancock.jsonl --dataset data/datasets/hancock-instruct.jsonl --manifest data/manifests/hancock.json --domain hancock- v0.1.0: local recursive dataset engine
- v0.2.0: safe GitHub connector for owned repos
- v0.3.0: dependency graph across Hancock, PeachFuzz, PeachTree
- v0.4.0: model exporter profiles for ChatML, Alpaca, ShareGPT
- v0.5.0: CI scheduled dataset update PRs
PeachTree v0.2.x adds a review-first owned GitHub connector.
peachtree github-owned --owner 0ai-Cyberviser --limit 25 --output data/manifests/owned.jsonl
peachtree github-plan --inventory data/manifests/owned.jsonl
bash scripts/clone_owned_repos.sh
bash scripts/build_owned_datasets.shThe connector inventories access-authorized repositories and generates reviewable scripts. Public GitHub-wide collection remains disabled by default.
PeachTree v0.3.0 adds local-only graph and lineage reports.
peachtree graph --inventory data/manifests/owned.jsonl --format mermaid --output reports/ecosystem-graph.mmd
peachtree lineage --dataset data/datasets/peachfuzz-instruct.jsonl --format markdown --output reports/peachfuzz-lineage.md
peachtree ecosystem --inventory data/manifests/owned.jsonl --output reports/ecosystem.jsonThese commands read local inventory, datasets, and manifests. They do not contact GitHub or train models.
PeachTree v0.4.0 exports reviewed PeachTree datasets into ChatML, Alpaca, and ShareGPT JSONL.
peachtree export-formats
peachtree export --source data/datasets/peachfuzz-instruct.jsonl --format chatml --output data/exports/peachfuzz-chatml.jsonl
peachtree validate-export --format chatml --path data/exports/peachfuzz-chatml.jsonlExporters are local-only and preserve provenance metadata by default.
PeachTree v0.5.0 adds review-first scheduled update tooling.
peachtree update-plan --repo ~/peachfuzz --repo-name 0ai-Cyberviser/peachfuzz --output data/manifests/update-plan.json
peachtree diff --baseline data/baseline/old.jsonl --candidate data/datasets/new.jsonl --format markdown
peachtree review-report --plan data/manifests/update-plan.json --output reports/update-review.jsonThe included GitHub Actions workflow opens pull requests for dataset updates. It does not train models, upload datasets, or push directly to main.
PeachTree v0.6.0 adds quality scoring, deterministic deduplication, and training readiness checks.
peachtree score --dataset data/datasets/peachfuzz-instruct.jsonl --markdown-output reports/quality.md
peachtree dedup --source data/datasets/peachfuzz-instruct.jsonl --output data/datasets/peachfuzz-deduped.jsonl
peachtree readiness --dataset data/datasets/peachfuzz-deduped.jsonl --output reports/readiness.jsonThese commands are local-only and do not train models or upload datasets.
PeachTree v0.7.0 adds policy-pack evaluation, license/compliance gates, and model-card generation.
peachtree policy-pack --list
peachtree license-gate --dataset data/datasets/peachfuzz-deduped.jsonl --markdown-output reports/license-gate.md
peachtree model-card --dataset data/datasets/peachfuzz-deduped.jsonl --model-name PeachFuzz-Dataset-v1 --output reports/model-card.mdThese commands are local-only and generate review artifacts before downstream model training.
PeachTree v0.8.0 adds dataset registries, artifact signing metadata, SBOM/provenance manifests, and release bundle creation.
peachtree registry data/datasets reports --output reports/registry.json
peachtree sbom --registry reports/registry.json --output reports/sbom.json
peachtree bundle data/datasets/example.jsonl reports/model-card.md --output dist/example-release.zipThese commands are local-only and do not train models, upload datasets, or scrape public GitHub.
PeachTree v0.9.0 adds trainer handoff manifests, LoRA job cards, and dry-run training launch plans.
peachtree handoff --dataset data/exports/example-chatml.jsonl --model-name Example-Lora --base-model mistralai/Mistral-7B-Instruct-v0.3 --output reports/handoff.json
peachtree lora-card --dataset data/exports/example-chatml.jsonl --job-name example-lora --base-model mistralai/Mistral-7B-Instruct-v0.3 --output-dir outputs/example --output reports/lora-job-card.json
peachtree train-plan --job-card reports/lora-job-card.json --output reports/dry-run-training-plan.jsonThese commands are dry-run only and do not launch training.