A self-hosted, AI-powered data management and analytics platform. Upload any file type, get automatic processing, interactive dashboards, AI-generated reports, and chat with an AI assistant that understands your data.
Cloud-only LLM inference — no local GPU required. Uses Groq (free tier) or Ollama Cloud.
- Universal File Upload — Drag & drop any file: CSV, Excel, PDF, JSON, databases, images, audio, video, archives, and 100+ more formats
- Automatic Processing — Files are detected, parsed, profiled, and stored in the optimal database engine
- Smart Dashboards — Auto-generated charts and insights, plus a drag-and-drop dashboard builder with chart, KPI, table, and text widgets
- AI-Powered Reports — Generate professional reports from your data using 5 templates (Executive Summary, Data Deep-Dive, Monthly Report, Comparison Report, Quick Brief) with real data-driven analysis
- AI Chat Assistant — Full-page chat interface with rich content rendering: SQL syntax highlighting, comparative bar charts, metric highlighting, and conversation history
- Multi-Agent Architecture — 20 specialized AI agents with input/output contracts, execution metrics, dispatch timeouts, and dead letter tracking
- File Manager — Visual file browser with thumbnails, tags, search, and database/archive browsing
- Notes System — Markdown notes with auto-save, linked to files or standalone
- Export — Export dashboards and reports as JSON, export data as CSV, JSON, XLSX
- Settings — Profile management, AI model configuration, server-side storage monitoring, and bulk file reprocessing
- Privacy First — All data stays on your server. LLM calls go to free cloud APIs (Groq or Ollama Cloud)
| Layer | Technology |
|---|---|
| Frontend | Next.js 14 (App Router), React 18, Tailwind CSS, shadcn/ui, Radix UI |
| Charts | Apache ECharts |
| State | Zustand |
| Backend API | Next.js API Routes (BFF) + Python FastAPI (AI Service) |
| Database | SQLite (catalog + user data), DuckDB (OLAP analytics) |
| AI / LLM | Groq (free tier), Ollama Cloud |
| Embeddings | OpenAI-compatible embeddings (via Groq or Ollama Cloud) |
| Auth | JWT + bcrypt, cookie-based sessions |
| Deployment | Docker Compose, Oracle Cloud Always-Free Tier |
git clone <repo-url>
cd data-ruler
cp .env.example .env
# Edit .env — add at least one API key (GROQ_API_KEY or OLLAMA_CLOUD_API_KEY)
docker compose up --build -dOpen http://localhost:3000 and create an account.
Get a free API key from Groq (recommended) or use an Ollama Cloud API key.
Oracle Cloud offers an always-free ARM VM with 4 CPUs, 24GB RAM, and 200GB disk — permanently, no time limit. You run Docker Compose on the VM with Caddy for automatic HTTPS.
Step 1 — Create a free Oracle Cloud account
Go to cloud.oracle.com/free and sign up. A credit card is required for identity verification but you will not be charged — the Always Free tier is permanent and separate from any trial credits.
Step 2 — Create an Always Free VM
- In the Oracle Cloud Console, go to Compute → Instances → Create Instance
- Configure:
- Image: Ubuntu 22.04 (or Oracle Linux)
- Shape: Click "Change Shape" → Ampere → VM.Standard.A1.Flex → 4 OCPUs, 24GB RAM
- Networking: Ensure "Assign a public IPv4 address" is checked
- SSH key: Upload your public key or let Oracle generate one (download it)
- Click Create and wait for the instance to be "Running"
- Copy the Public IP address from the instance details page
Step 3 — Open firewall ports
In Oracle Cloud Console:
- Go to Networking → Virtual Cloud Networks → click your VCN → Security Lists → Default Security List
- Click Add Ingress Rules and add:
- Source CIDR:
0.0.0.0/0, Destination Port:80, Protocol: TCP - Source CIDR:
0.0.0.0/0, Destination Port:443, Protocol: TCP
- Source CIDR:
Step 4 — Set up the VM
SSH into the VM and run the setup script:
ssh ubuntu@<your-public-ip>
git clone <your-repo-url> data-ruler
cd data-ruler
./setup-oracle.shThis installs Docker, opens OS-level firewall ports (80, 443), and prints next steps. You may need to log out and back in after install for Docker group changes.
Step 5 — Configure environment
cp .env.example .env
nano .envSet these values:
NEXTAUTH_SECRET=<run: openssl rand -base64 32>
NEXTAUTH_URL=https://yourdomain.com
AI_SERVICE_URL=http://ai-service:8000
DOMAIN=yourdomain.com
GROQ_API_KEY=gsk_your_key_here # or OLLAMA_CLOUD_API_KEYStep 6 — Configure HTTPS
Edit the Caddyfile in the project root — replace your-domain.com with your actual domain:
yourdomain.com {
reverse_proxy web:3000
}
No custom domain? Use a free subdomain from DuckDNS — point it to your Oracle VM's IP and use that in the Caddyfile and
.env.
Step 7 — Deploy
./deploy.shThe deploy script auto-detects production mode when DOMAIN is set in .env and uses Caddy for HTTPS.
Your app is live at https://yourdomain.com with 24GB RAM, persistent storage, and no usage limits.
What you get for free, forever: 4 ARM CPUs, 24GB RAM, 200GB boot volume, 10TB/month outbound data. No sleep, no cold starts, no time limits.
# Required: at least ONE cloud LLM API key
GROQ_API_KEY=gsk_... # Groq (recommended, fastest)
OLLAMA_CLOUD_API_KEY=... # Ollama Cloud (remote Ollama instance)
# Optional: model overrides
GROQ_CHAT_MODEL=llama-3.3-70b-versatile
GROQ_FAST_MODEL=llama-3.1-8b-instant
OLLAMA_CLOUD_BASE_URL=https://ollama.com/v1
# Ollama Cloud model is locked to gemini-3-flash-preview (not configurable)
# Auth
NEXTAUTH_SECRET=your-secret-key-here
# Service URLs
AI_SERVICE_URL=http://localhost:8000
# Production (Oracle Cloud)
DOMAIN=your-domain.comSee .env.example for all options.
The UI follows a dark navy design system with emerald green accents, purple highlights, and a polished data-centric aesthetic. Full RTL support for Hebrew.
Centered authentication card on dark navy background. Email + password form with emerald accent buttons and link to registration. Language switcher in top corner.
Account creation form with display name, email, password (8+ chars). Matches the login design with language switcher.
Project Files page with breadcrumb navigation (Repository / Main Files), list/grid view toggle, and Quantum Upload Gateway drop zone. File table features colored category badges (Behavioral, Spatial, Financial), quality score bars with percentages, status dots (Processed/Processing/Failed/Queued), and pagination. Toolbar includes Select All, Bulk Download, Filters, and Sort controls.
Dashboard overview with stats cards (Total Visuals with trend, Active Streams with live indicator, System Performance with AI Core Utilization bar chart). Filter tabs (All Types, Recently Updated, Shared). Dashboard cards show mini chart previews, visibility badges (Public/Private/Internal), widget counts, and update timestamps. Includes "Create New View" card with pagination.
Full-page AI Chat Assistant with Recent Insights panel showing conversation history. Chat supports rich content rendering: highlighted metrics in emerald green, SQL code blocks with PostgreSQL syntax highlighting, comparative bar charts (Revenue Growth by Category with Q3 vs Q4), and action buttons (Helpful, Regenerate). Input bar with attachment, microphone, and send controls. Footer shows "AI-Powered Insights / Verified Data Models".
Notes Explorer sidebar with synced status badges and file association icons. Editor features auto-save indicator, formatting toolbar (Bold, Italic, List, Code), Linked Assets section, and Preview mode. Bottom stats show Sentiment Shift and Tokens Processed metrics.
Report management with 5 template cards for quick creation. Report cards show template icon, status badge (Draft/Generating/Ready/Error), and metadata. Search and filter by status.
"Report Active / Live Stream Connected" status badges. Schema Structure Analysis table with Dataset ID, Columns, Rows, Storage Size, and Health bars. Volume Allocation visualization with Total Payload and Peak Flow metrics. Anomaly Detection alerts with priority levels (High/Medium/Cleared). Cross-Dataset Correlation coefficient matrix. Advanced Curatorial Insight narrative with Index Health and Risk Score cards.
Comprehensive technical breakdown with schema analysis, volume allocation distribution chart, anomaly detection (Skewed Distribution, Missing Entry Pattern, Outlier Resolved), and cross-dataset Pearson correlation matrix heatmap.
Processing pipeline stats (Ingested, Processed, Errors, Pending), category breakdown table, quality trends, and activity timeline.
Dataset Divergence Analysis with Similarity Score badge. Side-by-side comparison table: Attribute column with File A vs File B showing Format, Category, Size (with Rank badges), Structure (Rows/Columns), Data Quality (Completeness + Consistency bars), and Status (Production Ready / Needs Sanitization). Statistical Difference Analysis cards (Schema Mismatch, Unique Entity Analysis, Outlier Detection, Temporal Drift). Performance Rankings with Quality Leader and Efficiency Score.
Dataset Snapshot with Live Analysis indicator. File metadata card (Format, Total Size, Rows, Cols) alongside a quality ring chart (88%). Metric cards: Integrity (Optimal), Velocity (14.2ms), Outliers (3.1%). AI-Generated Insights with "New Insights Available" badge. Side cards: Ingest Trend bar chart, Top Performing Column, and "Automate Clean?" CTA with Apply Fixes button.
Profile management, dark/light theme toggle, language selector (English/Hebrew with RTL), AI model configuration, server-side storage usage with progress bar, cache clearing, and bulk file reprocessing.
GET /health → System health, cloud LLM status, agent count
GET /api/health → Same (alias)
POST /api/chat/chat → Stream AI chat response via orchestrator pipeline
Body: { message, user_id, context_file_id?, context_id?, conversation_history[] }
Response: SSE stream of { content, intent?, context_id? } chunks
POST /api/files/process → Trigger async file processing
Body: { file_id, user_id, file_path, original_name }
GET /api/files/status/{file_id} → Get processing status
GET /api/agents/ → List all agents with status and execution metrics
GET /api/agents/metrics → Aggregated metrics for all agents
GET /api/agents/bus-stats → Message bus stats + recent dead letters
GET /api/agents/{name} → Agent detail (contract, circuit state, token budget, metrics)
POST /api/agents/reset-circuit → Reset circuit breaker for agent
POST /api/pipelines/orchestrate → Full LLM-powered orchestration
Body: { message, user_id, file_id?, schema_context?, action? }
POST /api/pipelines/query → Natural language → SQL query
Body: { query, user_id, schema_context? }
POST /api/pipelines/analyze → Run analytics pipeline
POST /api/pipelines/visualize → Generate ECharts visualization
POST /api/auth/register → Create account
POST /api/auth/login → Login (sets auth-token cookie)
POST /api/auth/logout → Logout
GET /api/auth/me → Current user
PUT /api/auth/profile → Update user profile (display name)
GET /api/files → List user files (paginated, filterable)
POST /api/files/upload → Upload files (multipart)
GET /api/files/{id} → File details
PATCH /api/files/{id} → Update file metadata
DELETE /api/files/{id} → Delete file
GET /api/files/{id}/preview → Preview file data
GET /api/files/{id}/profile → Data quality profile
POST /api/chat/message → Send message (proxies to AI service, SSE)
GET /api/chat/history → Chat history
GET /api/dashboards → List dashboards
POST /api/dashboards → Create dashboard
GET /api/dashboards/{id} → Get dashboard with widgets
PUT /api/dashboards/{id} → Update dashboard
DELETE /api/dashboards/{id} → Delete dashboard
GET /api/reports → List reports
POST /api/reports → Create report
GET /api/reports/{id} → Get report with content
PUT /api/reports/{id} → Update report
DELETE /api/reports/{id} → Delete report
POST /api/reports/{id}/generate → Generate report content from file data
GET /api/notes → List notes
POST /api/notes → Create note
GET /api/notes/{id} → Get note
PUT /api/notes/{id} → Update note
DELETE /api/notes/{id} → Delete note
GET /api/settings/storage → Server-side storage usage metrics
POST /api/processing/reprocess → Queue all files for reprocessing
GET /api/processing/queue → Processing task queue status
POST /api/data/query → Execute SQL against user data
POST /api/export/data → Export data in various formats
-- Users table
users (id, email, password_hash, display_name, settings, created_at)
-- File registry with full metadata
files (id, user_id, original_name, stored_path, file_type, file_category,
mime_type, size_bytes, content_hash, storage_backend, db_table_name,
schema_snapshot, row_count, column_count, processing_status,
quality_profile, quality_score, ai_summary, tags, created_at)
-- Imported table registry
imported_tables (id, file_id, table_name, schema_snapshot, row_count)
-- Cross-file relationships
file_relationships (id, file_id_a, file_id_b, relationship_type,
column_a, column_b, confidence, confirmed_by_user)
-- Dashboards with widget configs
dashboards (id, user_id, title, description, layout, widgets, is_auto_generated)
-- AI-generated reports
reports (id, user_id, title, description, template, status, file_ids, content, config)
-- Markdown notes
notes (id, user_id, file_id, title, content, content_format)
-- Chat history
chat_messages (id, user_id, role, content, context_file_id, metadata)
-- Processing task queue
processing_tasks (id, user_id, file_id, task_type, status, agent_name, result)
-- Agent performance logs
agent_logs (id, agent_name, task_type, latency_ms, token_count, success)Each uploaded tabular file gets its own table: file_{file_id} with all columns stored as TEXT for maximum compatibility. Schema inference metadata is stored in the catalog.
DataRuler uses 20 specialized AI agents coordinated by an LLM-powered orchestrator. This section explains how they work together.
All requests — including chat — flow through the orchestrator pipeline. The orchestrator determines intent, builds an execution plan, dispatches agents in parallel groups with session context, and synthesizes results via LLM.
HTTP Request (user message, file upload, query)
│
▼
┌──────────────────┐
│ FastAPI Router │ /api/chat, /api/files/process, /api/pipelines/*
└────────┬─────────┘
│
▼
┌──────────────────┐ ┌──────────────────┐
│ Orchestrator │────▶│ Cloud LLM (Groq)│ temperature=0.1
│ Agent │◄────│ json_mode=true │ max_tokens=512
└────────┬─────────┘ └──────────────────┘
│
│ Session context (ContextStore)
│ + Execution plan JSON:
│ { intent, confidence, plan: [{agent, parallel_group}], reasoning }
│
▼
┌──────────────────────────────────────────────────────┐
│ Parallel Execution Engine (with dispatch timeouts) │
│ │
│ Group 0 ──▶ [validation_security] ──────────────────│──▶ accumulate context
│ Group 1 ──▶ [file_detection, schema_inference] ────│──▶ accumulate context
│ Group 2 ──▶ [analytics, visualization] ────────────│──▶ accumulate context
│ Group 3 ──▶ [storage_router] ──────────────────────│──▶ accumulate context
│ │
│ Groups run sequentially (0→1→2→3) │
│ Steps WITHIN a group run concurrently (asyncio.gather)│
│ Failed agents don't block other agents in the group │
└────────┬─────────────────────────────────────────────┘
│
▼
┌──────────────────┐
│ Result Synthesis │ LLM combines all agent outputs into final response
└────────┬─────────┘
│
▼
HTTP Response (streamed SSE or JSON)
All inter-agent communication uses the AgentMessage envelope:
AgentMessage:
message_id: UUID # Unique message identifier
correlation_id: UUID # Tracks request/reply chains
type: REQUEST | RESPONSE | ERROR | STATUS
source_agent: str # Sender agent name
target_agent: str # Recipient agent name
priority: LOW(0) | NORMAL(1) | HIGH(2) | CRITICAL(3)
payload: dict # Arbitrary data (input params, results, errors)
ttl: int # Time-to-live in seconds
created_at: datetime # TimestampEach agent declares an AgentContract specifying its required/optional inputs and guaranteed output keys. The base class validates contracts at dispatch time, returning clear error messages when required inputs are missing.
AgentContract:
required_inputs: tuple[str, ...] # Keys the agent expects in the payload
optional_inputs: tuple[str, ...] # Keys the agent can use but doesn't require
output_keys: tuple[str, ...] # Keys guaranteed in the response on successThe message bus provides async pub/sub with priority-based dispatch:
- Target-based routing — Messages routed to
target_agentvia registered subscriber callbacks - Priority queue —
asyncio.PriorityQueuedequeues highest-priority messages first - Request/reply —
correlation_idmaps toasyncio.Futurefor blocking await with timeout (default 30s) - Fan-out — Multiple subscribers can register for the same agent (all receive the message)
- TTL enforcement — Messages past their TTL are moved to the dead letter queue instead of being delivered
- Dead letter queue — Undeliverable and expired messages are captured with reason codes for operational visibility (
GET /api/agents/bus-stats)
The orchestrator has two paths for deciding which agents to invoke:
Path A — LLM Intent Parsing (primary):
- Constructs prompt with user message + file context + schema context + session state
- Calls LLM with
json_mode=True,temperature=0.1(deterministic) - Returns structured JSON plan
Path B — Keyword Fallback (when LLM parsing fails):
| Keywords | Intent | Agents |
|---|---|---|
query, select, sql, count, average |
query_data | sql_agent |
chart, plot, graph, visualize |
visualize | analytics → visualization |
analyze, statistics, profile |
analyze_data | schema_inference + analytics → visualization |
export, download, save as |
export | export_agent |
relationship, foreign key, join |
find_relationships | relationship_mining |
upload, process, import |
process_file | validation + detection → schema → storage |
| (anything else) | general_chat | document_qa |
Groups execute sequentially (group 0 finishes before group 1 starts). Steps within a group run concurrently via asyncio.gather. Failed agents within a parallel group do not block other agents in the same group.
Per-agent fault tolerance prevents cascading failures:
CLOSED (normal operation)
│
│ failure_count >= 5 (within 10-min window)
▼
OPEN (all calls rejected immediately)
│
│ 60 seconds elapsed
▼
HALF_OPEN (allow exactly 1 probe request)
╱ ╲
success failure
│ │
▼ ▼
CLOSED OPEN
- Threshold: 5 failures within a 10-minute rolling window
- Recovery timeout: 60 seconds before probing
Two-level budget model prevents runaway LLM costs:
- Global Budget: 2,000,000 tokens/hour (all agents combined)
- Per-Agent Budget: 400,000 tokens/hour each
- Rolling window: 1-hour sliding window with lazy pruning
- Pre-check:
has_budget(agent_name)called before dispatch — if exhausted, agent is skipped
Per-session shared state enables agents to collaborate without direct coupling:
- Table Registry — Agents register imported tables with schema info
- Relationship Graph — Discovered foreign keys with confidence scores
- File Catalog — Shared file metadata accessible to all agents
- Cache — Arbitrary key-value store for intermediate results
| Agent | Purpose | Uses LLM? |
|---|---|---|
| orchestrator | LLM-powered intent parsing, execution planning | Yes |
| file_detection | Magic bytes + extension-based file type detection | No |
| tabular_processor | CSV, XLSX, Parquet, TSV, ODS parsing + import | No |
| document_processor | PDF, DOCX, PPTX, TXT, HTML text extraction | No |
| database_importer | SQLite, DuckDB, SQL dump importing | No |
| media_processor | Image metadata, thumbnails, audio/video info | No |
| archive_processor | ZIP, TAR, GZIP extraction (safe, with limits) | No |
| structured_data | JSON, XML, YAML, TOML, INI parsing + flattening | No |
| specialized_format | GeoJSON, Shapefile, HDF5, NetCDF processing | No |
| schema_inference | Column type inference + data quality scoring | Yes |
| relationship_mining | Foreign key + joinable column discovery | Yes |
| storage_router | Route data to SQLite/DuckDB/filesystem | No |
| analytics | Statistical analysis + anomaly detection | Yes |
| visualization | ECharts config generation from data | Yes |
| sql_agent | Natural language → SQL generation + execution | Yes |
| document_qa | RAG-based Q&A over extracted document text | Yes |
| cross_modal | Cross-format queries spanning multiple files | Yes |
| export_agent | Data export (CSV, JSON, XLSX, Markdown) | No |
| validation_security | File security validation + integrity hashing | No |
| scheduler | Recurring task execution with background asyncio loops | No |
data-ruler/
├── apps/
│ ├── web/ # Next.js frontend + BFF API
│ │ ├── app/ # Pages (auth, dashboard, files, notes, reports, chat, settings)
│ │ ├── app/api/ # 30+ API routes
│ │ ├── components/ # 30+ UI components (shadcn/ui based)
│ │ ├── stores/ # 6 Zustand stores (auth, chat, files, dashboard, notes, reports)
│ │ └── lib/ # DB, auth, utils
│ │
│ └── ai-service/ # Python FastAPI AI backend
│ ├── agents/ # 20 specialized AI agents
│ ├── core/ # Agent base, message bus, circuit breaker, token budget, registry
│ ├── services/ # Cloud LLM client, embeddings, RAG, parsers, storage backends
│ ├── routers/ # 5 API routers (health, chat, files, agents, pipelines)
│ └── models/ # Pydantic schemas (15+ models)
│
├── data/ # Runtime data (gitignored)
├── scripts/ # Utility scripts (screenshot generation)
├── Caddyfile # Caddy reverse proxy (HTTPS, production)
├── docker-compose.yml # Docker Compose (base, local dev + production)
├── docker-compose.prod.yml # Production overlay (adds Caddy for HTTPS)
├── deploy.sh # Deploy with auto-detect dev/prod mode
├── setup-oracle.sh # Oracle Cloud VM initial setup
├── start.sh # Local startup (no Docker)
└── .env.example # Configuration template
See LICENSE file.












