A web-based incident reasoning platform for understanding system failures through event correlation and timeline reconstruction.
Live Demo: https://blackbox-frontend-hwy0.onrender.com
BLACKBOX is a production-ready incident analysis platform designed to help engineers understand system failures by reconstructing timelines from distributed events. Rather than focusing on detection or alerting, BLACKBOX provides post-event clarity by correlating events across services and presenting them in chronological sequence.
The platform operates on a core principle: sequence before interpretation. By forcing engineers to understand the order of events before jumping to conclusions, BLACKBOX reduces cognitive load during incident response and enables more accurate root cause analysis.
- Deterministic correlation — Three explicit rules, no machine learning
- Immutable event storage — Write-only data model for auditability
- Timeline-focused interface — Chronological ordering as the primary view
- Explainable reasoning — Every correlation decision is documented
- Calm design — Minimal UI to reduce stress during incidents
BLACKBOX consists of three primary components:
Frontend (React)
↓
Backend API (FastAPI)
↓
Database (PostgreSQL)
Backend
- FastAPI (Python 3.11)
- SQLAlchemy ORM
- PostgreSQL 15
- Pydantic validation
Frontend
- React 18
- Vite build system
- Axios for API communication
- React Router for navigation
Infrastructure
- Render (deployment platform)
- Docker (containerization)
- GitHub Actions (CI/CD ready)
Events are immutable records representing something that occurred in a system:
{
"service": "payments",
"environment": "prod",
"level": "error",
"message": "Database timeout after 30s",
"request_id": "req_abc123",
"timestamp": "2026-01-27T10:42:11Z"
}All events are stored permanently without modification or deletion.
Incidents are automatically created when error thresholds are exceeded:
- Threshold: 5 error-level events
- Time Window: 3 minutes (rolling)
- Scope: Per service, per environment
This threshold-based approach is simple, predictable, and configurable.
Events are correlated to incidents using three deterministic rules:
- Same Request ID — Events sharing a request identifier
- Same Service within Time Window — Events from the same service within 10 minutes
- Same Environment during Incident — Events in the environment during an active incident
Each correlation includes the specific rule(s) that matched, making the decision process transparent and debuggable.
The timeline view presents all correlated events in strict chronological order, grouped by service. This preserves the actual sequence of events and enables pattern recognition across distributed systems.
- Python 3.11 or higher
- Node.js 18 or higher
- PostgreSQL 15 or higher
1. Clone the repository
git clone https://github.com/darshhv/blackbox.git
cd blackbox2. Set up the database
createdb blackbox
createuser blackbox --password
# Enter password: blackboxGRANT ALL PRIVILEGES ON DATABASE blackbox TO blackbox;3. Configure environment
export DATABASE_URL=postgresql://blackbox:blackbox@localhost:5432/blackbox4. Start the backend
cd backend
pip install -r requirements.txt
python main.pyBackend runs on http://localhost:8000
5. Start the frontend
cd frontend
npm install
npm run devFrontend runs on http://localhost:3000
BLACKBOX is designed for straightforward deployment on Render.
Database
- Create PostgreSQL database
- Copy the Internal Database URL
Backend Service
- Connect GitHub repository
- Configure service:
- Root Directory:
backend - Build Command:
pip install -r requirements.txt - Start Command:
uvicorn main:app --host 0.0.0.0 --port $PORT
- Root Directory:
- Add environment variable:
DATABASE_URL: [Internal Database URL from step 1]
Frontend Service
- Connect the same GitHub repository
- Configure service:
- Root Directory:
frontend - Build Command:
npm install && npm run build - Publish Directory:
dist
- Root Directory:
- Update
frontend/src/services/api.jswith backend URL
The application can be deployed to any platform supporting:
- Python WSGI applications
- Node.js static site hosting
- PostgreSQL databases
Tested on: Render, Railway, Heroku, Vercel (frontend), AWS
Interactive API documentation is available at /docs when the backend is running.
Example: Creating an Event
curl -X POST https://your-backend-url.com/events \
-H "Content-Type: application/json" \
-d '{
"service": "payments",
"environment": "prod",
"level": "error",
"message": "Database connection timeout",
"request_id": "req_001",
"timestamp": "2026-01-27T10:00:00Z"
}'Example: Listing Incidents
curl https://your-backend-url.com/incidentsExample: Incident Details
curl https://your-backend-url.com/incidents/1To trigger incident detection, send 5 or more error events from the same service within a 3-minute window:
for i in {1..6}; do
curl -X POST https://your-backend-url.com/events \
-H "Content-Type: application/json" \
-d "{
\"service\": \"api-gateway\",
\"environment\": \"prod\",
\"level\": \"error\",
\"message\": \"Service unavailable\",
\"request_id\": \"req_$i\",
\"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"
}"
sleep 1
doneThe incident will appear immediately in the web interface.
Order matters more than error levels. A warning that occurs before an error may be more important than the error itself. BLACKBOX preserves temporal sequence as the primary organizing principle.
Events must be grouped accurately before attempting to explain what happened. The correlation engine uses simple, deterministic rules rather than heuristics or machine learning to ensure grouping decisions are predictable and debuggable.
Clear data structures and explicit logic are more valuable than sophisticated algorithms. Every decision in BLACKBOX is traceable and explainable.
The platform assists engineers in understanding failures but does not attempt to replace human judgment. Root cause summaries are framed as "likely" rather than definitive.
The interface is deliberately minimal to reduce visual noise during high-stress incidents. No charts, no animations, no distractions—just the timeline and the facts.
Located in backend/correlation.py:
ERROR_THRESHOLD = 5 # Errors to trigger incident
TIME_WINDOW_MINUTES = 3 # Rolling detection window
CORRELATION_WINDOW_MINUTES = 10 # Event correlation windowThe database uses three primary tables:
events — Immutable event log
- Indexed by timestamp, service, request_id
- No updates or deletes permitted
incidents — Detected failure windows
- Tracks service, environment, status
- Start and end times
incident_events — Correlation mappings
- Links events to incidents
- Stores correlation reasoning
Ingest a new event.
Request Body
{
"service": "string",
"environment": "string",
"level": "info|warning|error",
"message": "string",
"request_id": "string (optional)",
"timestamp": "ISO 8601 datetime"
}Response: 201 Created
List all incidents.
Query Parameters
status— Filter by open/resolvedenvironment— Filter by environment
Response: Array of incident summaries
Get incident details with full timeline.
Response: Incident object with correlated events
Mark incident as resolved.
Response: Updated incident status
blackbox/
├── backend/
│ ├── main.py # FastAPI application
│ ├── models.py # Database models
│ ├── schemas.py # Pydantic schemas
│ ├── correlation.py # Correlation engine
│ ├── database.py # Database configuration
│ └── requirements.txt # Python dependencies
├── frontend/
│ ├── src/
│ │ ├── pages/
│ │ │ ├── IncidentsList.jsx
│ │ │ └── IncidentDetail.jsx
│ │ ├── services/
│ │ │ └── api.js
│ │ ├── App.jsx
│ │ └── main.jsx
│ ├── package.json
│ └── vite.config.js
├── database/
│ └── generate_sample_data.py
└── README.md
Machine learning would introduce unpredictability in a system where trust is critical. Engineers must be able to verify correlation decisions. Rule-based logic is:
- Completely deterministic
- Fully debuggable
- Easily auditable
- Understandable by all team members
Allowing event modification would undermine trust in the timeline. Immutability provides:
- Clear audit trail
- No possibility of retroactive changes
- Single source of truth
- Simplified debugging
BLACKBOX operates after alerts have already fired. It exists to help engineers understand what happened, not to notify them that something is happening. This focused scope prevents feature creep and maintains clarity of purpose.
Most incident tools focus on aggregation and statistics. BLACKBOX prioritizes sequence because:
- Causal relationships emerge from ordering
- Engineers think in terms of "what happened first"
- Timelines reveal cascade failures
- Chronological views are universally understood
The following features are intentionally not included in version 1.0:
- Authentication and authorization
- Multi-tenant organizations
- Alert/paging integration
- Real-time event streaming
- Metric visualization
- Distributed tracing integration
- Machine learning or AI
- Event search and filtering (UI)
- Automatic incident resolution
These exclusions reflect design restraint and focus on core functionality.
- Timestamp-based detection: Incident detection uses event timestamps, not ingestion time
- Single-environment isolation: Events from different environments never correlate
- No event updates: Once created, events cannot be modified
- Manual resolution: Incidents must be manually marked as resolved
Contributions are welcome. Please follow these guidelines:
- Maintain the philosophy — Any contribution should align with the core design principles
- Preserve explainability — New features must be deterministic and debuggable
- Document reasoning — Explain the "why" behind technical decisions
- Test thoroughly — Include tests for new functionality
- Keep it simple — Favor clarity over cleverness
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Update documentation
- Submit a pull request
Use the sample data generator:
cd database
python generate_sample_data.pyChoose from predefined scenarios:
- Database timeout incident
- Deployment failure
- Cascading service failure
- Multi-environment events
Interactive testing via Swagger UI:
http://localhost:8000/docs
The schema includes indexes on:
events.timestampevents.serviceevents.request_idevents.environment
These support efficient incident detection and correlation queries.
For high-volume deployments:
- Partition events table by timestamp
- Add read replicas for query load
- Implement event batching for ingestion
- Consider time-series database for events
The current implementation is optimized for clarity and maintainability, not maximum throughput.
MIT License
Copyright (c) 2026 BLACKBOX Contributors
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Built to demonstrate systems thinking, data modeling, and production deployment capabilities. Designed with the principle that calm systems produce calm engineers.
Author: Darshan Reddy V
Repository: github.com/darshhv/blackbox
Live Demo: blackbox-frontend-hwy0.onrender.com
For questions, issues, or discussions:
- Issues: GitHub Issues
- Documentation: See
/docsfolder - API Reference:
/docsendpoint when running
BLACKBOX — Understanding failures through sequence.