Ingest, validate, transform, store, and retrieve your data — whether you're an AI agent talking through MCP or a developer writing config. One platform for both.
Deploy on any cloud provider, on-premise, or locally — with no vendor lock-in. Define your entire data pipeline through simple JSON configuration, or extend it with AI instructions, JavaScript functions, and REST endpoints at every stage of the flow. Built entirely on open-source infrastructure, it runs anywhere Docker does.
Your AI agents are first-class pipeline operators. Datris ships with a native MCP (Model Context Protocol) server — the first open-source data platform natively accessible to AI agents. Claude, Cursor, OpenClaw, and any MCP-compatible agent can register pipelines, upload files, trigger processing, monitor job status, profile data, run semantic searches across vector databases, and query PostgreSQL and MongoDB — all through natural conversation. Supports stdio and SSE transports.
Intelligence at every stage — from ingestion to delivery, Datris makes data engineering accessible through natural language.
- MCP server (AI agent integration) - Built-in MCP server lets AI agents (Claude, Cursor, OpenClaw, custom frameworks) natively interact with the pipeline — register pipelines, upload files, trigger jobs, profile data, run semantic searches, and query databases. Supports stdio and SSE transports
- AI-powered data quality - Validate with plain English rules via
aiRule. The AI model evaluates every row using reasoning and domain knowledge — no regex required. Supports sampling for large files - AI transformations - Describe row transformations in natural language — date format conversion, data categorization, phone number standardization, entity extraction — no code needed
- AI schema generation - Upload any CSV, JSON, or XML file and receive a complete, ready-to-register pipeline configuration — field names and types inferred automatically
- AI data profiling - Upload a file and get summary statistics, quality issues, and suggested validation rules — all powered by AI analysis
- AI error explanation - When jobs fail, AI analyzes the error chain and explains the root cause in plain English. No more digging through stack traces
- AI providers - Anthropic Claude (Opus 4.6, Sonnet 4.6, Haiku), OpenAI (GPT-5, GPT-4.1, o3, embedding models), or local models via Ollama (Llama, Mistral, Phi). No vendor lock-in — switch providers without changing your pipeline config
Full RAG pipeline built in. Extract, chunk, embed, and upsert documents into any major vector database — build retrieval-augmented generation workflows without leaving your pipeline.
- 5 vector databases - Qdrant, Weaviate, Milvus, Chroma, pgvector (PostgreSQL)
- Chunking strategies - Fixed-size, sentence, paragraph, recursive
- Embedding providers - OpenAI or Ollama (local models)
- Document extraction - PDF, Word, PowerPoint, Excel, HTML, email, EPUB, plain text
- Configuration-driven - Define pipelines entirely through JSON, or extend the pipeline with AI instructions, JavaScript functions, REST endpoints, and preprocessors at every stage of the data flow
- Multiple ingestion methods - File upload API, MinIO bucket events, database polling, Kafka streaming
- Data quality - AI rules, regex column checks, JavaScript row rules, REST endpoint row rules, JSON/XML schema validation
- Transformations - AI transformations, deduplication, whitespace trimming, JavaScript row functions
- Multiple destinations - Write to MinIO (Parquet/ORC), PostgreSQL, MongoDB, Kafka, ActiveMQ, REST endpoints, Qdrant, Weaviate, Milvus, Chroma, or pgvector in parallel
- Event notifications - Subscribe to pipeline processing events via ActiveMQ topics
Push and pull — one platform, two interfaces. AI agents and humans ingest data through the pipeline, store it across databases and vector stores, and retrieve it back — via API or MCP.
Self-hosted on proven open-source infrastructure — no proprietary services, no vendor lock-in, no surprise bills:
| Service | Purpose |
|---|---|
| MinIO | S3-compatible object store for file staging and data output |
| MongoDB | Configuration store, job status tracking, metadata |
| ActiveMQ | File notification queue, pipeline event notifications |
| HashiCorp Vault | Secrets management (database credentials, API keys) |
| Apache Kafka | Optional streaming source and destination |
| Apache Spark | Local Spark for writing Parquet/ORC to MinIO |
Source (File Upload / MinIO Event / Database Pull / Kafka)
|
v
Preprocessor (optional REST endpoint)
|
v
Data Quality (AI rules, header validation, column rules, row rules, schema validation)
|
v
Transformation (deduplication, trimming, JavaScript row functions)
|
v
Destinations (executed in parallel)
├── Object Store (MinIO - Parquet, ORC, CSV)
├── PostgreSQL (COPY bulk insert)
├── MongoDB (document upsert)
├── Kafka (topic producer)
├── ActiveMQ (queue)
├── REST Endpoint (HTTP POST)
├── Qdrant (vector database - chunking, embeddings, RAG)
├── Weaviate (vector database - chunking, embeddings, RAG)
├── Milvus (vector database - chunking, embeddings, RAG)
├── Chroma (vector database - chunking, embeddings, RAG)
└── pgvector (PostgreSQL vector database - chunking, embeddings, RAG)
|
v
Notifications (published to ActiveMQ topic)
Client (AI Agent via MCP / Developer via REST API)
|
v
Interface
├── MCP Server (stdio or SSE)
└── REST API (POST /api/v1/query/* and /api/v1/search/*)
|
v
Query Source
├── PostgreSQL (read-only SQL SELECT queries)
├── MongoDB (document queries with filters and projections)
├── Qdrant (semantic search)
├── Weaviate (semantic search)
├── Milvus (semantic search)
├── Chroma (semantic search)
├── pgvector (semantic search via PostgreSQL)
├── Pipeline Configurations (list/get)
├── Job Status (by pipeline token or pipeline name)
├── AI Schema Generation (from uploaded files)
└── AI Data Profiling (statistics, quality issues, suggested rules)
| Format | Input | Output |
|---|---|---|
| CSV | Configurable delimiter, header, encoding | Parquet, ORC, database, Kafka, ActiveMQ |
| JSON | Single object or NDJSON (one per line) | MongoDB, Kafka, REST |
| XML | Single document or one per line | Database, Kafka, REST |
| Excel (XLS) | Worksheet selection, auto-CSV conversion | Same as CSV |
| Unstructured | PDF, Word, PowerPoint, Excel, HTML, email, EPUB, text | Object store, Qdrant, Weaviate, pgvector |
| Archives | .zip, .tar, .gz, .jar | Extracted and processed individually |
git clone https://github.com/datris/datris-platform-oss.git
cd datris-platform-oss
cp .env.example .env # Add your ANTHROPIC_API_KEY and/or OPENAI_API_KEY
docker compose up -dThe platform is running at http://localhost:4200 (UI) and http://localhost:8080 (API).
See Installation for details on API keys, vector databases, and building from source.
- Installation - Get running with Docker Compose
- Quick Start - End-to-end walkthrough
- Pipeline Configuration - Full JSON configuration reference
- Schemas - Schema definition and auto-generation
- Ingestion
- File Upload - API file upload
- Object Store - MinIO bucket notifications
- Database Pull - PostgreSQL, MySQL, MSSQL scheduled pulls
- Kafka - Kafka topic consumption
- Data Types - Supported data types
- Preprocessor - External preprocessing via REST endpoints
- Data Quality
- Header Validation - CSV header matching
- Column Rules & AI Rules - Regex column validation and AI-powered natural language rules
- Row Rules - JavaScript and REST endpoint rules
- Schema Validation - JSON/XML schema validation
- Transformations
- Deduplication - Row deduplication
- Column Trimming - Whitespace trimming
- Dropping Columns - Column filtering via destination schema
- AI Transformation - Natural language row transformations
- Row Functions - JavaScript row transformations
- Destinations
- Object Store - Spark writes to MinIO (Parquet, ORC)
- PostgreSQL - COPY bulk insert
- MongoDB - Document upserts
- Kafka - Topic producer
- ActiveMQ - Queue destination
- REST Endpoint - HTTP POST destination
- Qdrant - Vector database for RAG — ingest PDFs, text docs with chunking and embeddings
- Weaviate - Vector database for RAG — ingest PDFs, Word docs, text with chunking and embeddings
- Milvus - Vector database for RAG — scalable similarity search
- Chroma - Vector database for RAG — lightweight, single container
- pgvector - PostgreSQL vector database for RAG — no separate server required
- Notifications - Pipeline event notifications and subscriptions
- Monitoring - Job status and pipeline tokens
- API Reference
- Pipeline API - CRUD for pipeline configurations
- Ingestion API - File upload and generation
- AI Schema Generation - Generate pipeline configs from files using AI
- Status API - Job status and monitoring
- Query API - Query PostgreSQL and MongoDB
- Search API - Semantic search across vector databases
- AI Data Profiling - Profile data files and get recommended rules
- AI Error Explanation - Automatic plain-English error analysis
- AI Configuration - Configure AI providers (Anthropic, OpenAI, Ollama)
- Version API - Version endpoint
- OpenAPI Spec - OpenAPI 3.0 spec for Postman, code generation, and non-MCP integrations
- Configuration Reference - Full application.yaml reference
- MCP Server - AI agent integration via Model Context Protocol
- Helper Applications - Vector store chat, Kafka loader, preprocessor, and more