Releases: dongkoony/DevOps-Incident-Triage-Model
release-2026.06-rag-preview
Channel
Preview
Summary
This release introduces the first RAG preview layer for the DevOps Incident Triage Model.
The project remains classifier-first. The existing Transformer-based incident classifier is kept intact, while this release adds a preview retrieval layer that can retrieve evidence from domain-specific runbook documents.
LLM-assisted remediation generation is not included yet. That will be introduced in a later incident-assist beta release.
Included
POST /retrieveFastAPI endpoint- Domain-aware runbook retrieval flow
- Local TF-IDF sparse retrieval index for preview use
- Evidence-based retrieval response schema
- Section-level citations from runbook documents
- Retrieval metadata including index type, embedding model placeholder, and latency
- Prometheus-style retrieval metrics:
ditri_retrieval_requests_totalditri_retrieval_latency_seconds
- RAG roadmap documentation
- RAG evaluation plan
- Runbook placeholder document structure
- Release evidence document for
release-2026.06-rag-preview
Validation
GitHub Actions on main: Success
Local validation:
uv run --extra dev --extra api ruff check .
uv run --extra dev --extra api pytest -qResult:
ruff check: passed
pytest: 47 passed, 10 skipped
FastAPI smoke validation:
GET /health -> 200
POST /retrieve -> 200
GET /metrics -> 200
Known Limitations
- This is a preview release, not a production RAG backend.
- Retrieval currently uses a local TF-IDF sparse index, not a production Vector DB.
/assistand LLM-generated remediation guidance are not implemented yet.- Runbooks are portfolio-grade placeholders, not real production incident records.
- The current dataset remains synthetic.
- Retrieval quality still needs formal evaluation with retrieval hit rate, groundedness, citation coverage, and hallucination checks.
Next Release Direction
The next planned release is:
release-2026.07-incident-assist-beta
Planned focus:
- Classifier + RAG integration
POST /assistAPI design and implementation- LLM-generated remediation guidance
- Root cause candidate generation
- Evidence citations from retrieved runbooks
release-2026.05-classifier-core
Channel
Stable
Summary
This release establishes release-2026.05-classifier-core as the stable classifier-core baseline for the DevOps Incident Triage Model.
The current release keeps the project classifier-first. RAG and LLM-assisted incident response are documented as future roadmap work, not implemented in this release.
Included
- Transformer-based DevOps incident classification baseline
- FastAPI inference service
- Batch prediction and async batch job support
- Evaluation and demo showcase workflow
- CI smoke checks for lint/test, showcase, and API health
- Product-style Release Train documentation
- Future RAG roadmap and runbook placeholder structure
- Release evidence document for classifier-core
Validation
- GitHub Actions on
main: Success lint-and-test: passedshowcase-smoke: passedapi-smoke: passed
Local validation:
uv run --extra dev ruff check .
uv run --extra dev --extra api pytest -qResult:
ruff: All checks passed
pytest: 40 passed, 10 skipped
Data And Model Notes
- The current starter dataset is synthetic.
- This release should be treated as a reproducible portfolio-grade classifier baseline.
- It does not claim real production incident generalization.
- Model schema and training parameters are not changed by this release.
Known Limitations
- Docker build validation was not completed locally because the Docker daemon was unavailable.
- RAG retrieval,
/retrieve,/assist, Vector DB integration, and LLM-generated remediation are planned future work. - Hugging Face publishing is intentionally deferred until the model card and publish artifact are release-train aligned.
Next Steps
- Update
docs/model_card.mdfrom legacy version metadata torelease-2026.05-classifier-core. - Confirm the intended Hugging Face model artifact.
- Publish to Hugging Face after the publish gate is met.
v0.3.0
What's Changed
- chore: back-merge main into develop (v0.2.0) by @dongkoony in #8
- feat(api): add /predict/batch endpoint with review summary by @dongkoony in #9
- feat(api): add request tracing and prometheus metrics by @dongkoony in #10
- feat(mlops): add model benchmark matrix automation by @dongkoony in #13
Full Changelog: v0.2.0...v0.3.0
v0.2.0
DevOps Incident Triage Model v0.2.0
Included
- Raw incident ingestion scaffold (
ditri-ingest-raw) - Confidence-threshold based human-review gating
- Evaluation threshold sweep output (
reports/threshold_metrics.json) - Package/API version bump to 0.2.0
Validation
UV_CACHE_DIR=/tmp/uv-cache uv run ruff check .UV_CACHE_DIR=/tmp/uv-cache uv run pytest -q