Datris — The First AI Agent-Native Data Platform

Ingest, validate, transform, store, and retrieve your data — whether you're an AI agent talking through MCP or a developer writing config. One platform for both.

Deploy on any cloud provider, on-premise, or locally — with no vendor lock-in. Define your entire data pipeline through simple JSON configuration, or extend it with AI instructions, JavaScript functions, and REST endpoints at every stage of the flow. Built entirely on open-source infrastructure, it runs anywhere Docker does.

Agent-Ready: Built-In MCP Server

Your AI agents are first-class pipeline operators. Datris ships with a native MCP (Model Context Protocol) server — the first open-source data platform natively accessible to AI agents. Claude, Cursor, OpenClaw, and any MCP-compatible agent can register pipelines, upload files, trigger processing, monitor job status, profile data, run semantic searches across vector databases, and query PostgreSQL and MongoDB — all through natural conversation. Supports stdio and SSE transports.

AI-Powered Features

Intelligence at every stage — from ingestion to delivery, Datris makes data engineering accessible through natural language.

MCP server (AI agent integration) - Built-in MCP server lets AI agents (Claude, Cursor, OpenClaw, custom frameworks) natively interact with the pipeline — register pipelines, upload files, trigger jobs, profile data, run semantic searches, and query databases. Supports stdio and SSE transports
AI-powered data quality - Validate with plain English rules via aiRule. The AI model evaluates every row using reasoning and domain knowledge — no regex required. Supports sampling for large files
AI transformations - Describe row transformations in natural language — date format conversion, data categorization, phone number standardization, entity extraction — no code needed
AI schema generation - Upload any CSV, JSON, or XML file and receive a complete, ready-to-register pipeline configuration — field names and types inferred automatically
AI data profiling - Upload a file and get summary statistics, quality issues, and suggested validation rules — all powered by AI analysis
AI error explanation - When jobs fail, AI analyzes the error chain and explains the root cause in plain English. No more digging through stack traces
AI providers - Anthropic Claude (Opus 4.6, Sonnet 4.6, Haiku), OpenAI (GPT-5, GPT-4.1, o3, embedding models), or local models via Ollama (Llama, Mistral, Phi). No vendor lock-in — switch providers without changing your pipeline config

RAG Pipeline

Full RAG pipeline built in. Extract, chunk, embed, and upsert documents into any major vector database — build retrieval-augmented generation workflows without leaving your pipeline.

5 vector databases - Qdrant, Weaviate, Milvus, Chroma, pgvector (PostgreSQL)
Chunking strategies - Fixed-size, sentence, paragraph, recursive
Embedding providers - OpenAI or Ollama (local models)
Document extraction - PDF, Word, PowerPoint, Excel, HTML, email, EPUB, plain text

Key Features

Configuration-driven - Define pipelines entirely through JSON, or extend the pipeline with AI instructions, JavaScript functions, REST endpoints, and preprocessors at every stage of the data flow
Multiple ingestion methods - File upload API, MinIO bucket events, database polling, Kafka streaming
Data quality - AI rules, regex column checks, JavaScript row rules, REST endpoint row rules, JSON/XML schema validation
Transformations - AI transformations, deduplication, whitespace trimming, JavaScript row functions
Multiple destinations - Write to MinIO (Parquet/ORC), PostgreSQL, MongoDB, Kafka, ActiveMQ, REST endpoints, Qdrant, Weaviate, Milvus, Chroma, or pgvector in parallel
Event notifications - Subscribe to pipeline processing events via ActiveMQ topics

Architecture

Push and pull — one platform, two interfaces. AI agents and humans ingest data through the pipeline, store it across databases and vector stores, and retrieve it back — via API or MCP.

Self-hosted on proven open-source infrastructure — no proprietary services, no vendor lock-in, no surprise bills:

Service	Purpose
MinIO	S3-compatible object store for file staging and data output
MongoDB	Configuration store, job status tracking, metadata
ActiveMQ	File notification queue, pipeline event notifications
HashiCorp Vault	Secrets management (database credentials, API keys)
Apache Kafka	Optional streaming source and destination
Apache Spark	Local Spark for writing Parquet/ORC to MinIO

Processing Flow

Source (File Upload / MinIO Event / Database Pull / Kafka)
  |
  v
Preprocessor (optional REST endpoint)
  |
  v
Data Quality (AI rules, header validation, column rules, row rules, schema validation)
  |
  v
Transformation (deduplication, trimming, JavaScript row functions)
  |
  v
Destinations (executed in parallel)
  ├── Object Store (MinIO - Parquet, ORC, CSV)
  ├── PostgreSQL (COPY bulk insert)
  ├── MongoDB (document upsert)
  ├── Kafka (topic producer)
  ├── ActiveMQ (queue)
  ├── REST Endpoint (HTTP POST)
  ├── Qdrant (vector database - chunking, embeddings, RAG)
  ├── Weaviate (vector database - chunking, embeddings, RAG)
  ├── Milvus (vector database - chunking, embeddings, RAG)
  ├── Chroma (vector database - chunking, embeddings, RAG)
  └── pgvector (PostgreSQL vector database - chunking, embeddings, RAG)
  |
  v
Notifications (published to ActiveMQ topic)

Retrieval Flow

Client (AI Agent via MCP / Developer via REST API)
  |
  v
Interface
  ├── MCP Server (stdio or SSE)
  └── REST API (POST /api/v1/query/* and /api/v1/search/*)
  |
  v
Query Source
  ├── PostgreSQL (read-only SQL SELECT queries)
  ├── MongoDB (document queries with filters and projections)
  ├── Qdrant (semantic search)
  ├── Weaviate (semantic search)
  ├── Milvus (semantic search)
  ├── Chroma (semantic search)
  ├── pgvector (semantic search via PostgreSQL)
  ├── Pipeline Configurations (list/get)
  ├── Job Status (by pipeline token or pipeline name)
  ├── AI Schema Generation (from uploaded files)
  └── AI Data Profiling (statistics, quality issues, suggested rules)

Supported Data Formats

Format	Input	Output
CSV	Configurable delimiter, header, encoding	Parquet, ORC, database, Kafka, ActiveMQ
JSON	Single object or NDJSON (one per line)	MongoDB, Kafka, REST
XML	Single document or one per line	Database, Kafka, REST
Excel (XLS)	Worksheet selection, auto-CSV conversion	Same as CSV
Unstructured	PDF, Word, PowerPoint, Excel, HTML, email, EPUB, text	Object store, Qdrant, Weaviate, pgvector
Archives	.zip, .tar, .gz, .jar	Extracted and processed individually

Getting Started

git clone https://github.com/datris/datris-platform-oss.git
cd datris-platform-oss
cp .env.example .env       # Add your ANTHROPIC_API_KEY and/or OPENAI_API_KEY
docker compose up -d

The platform is running at http://localhost:4200 (UI) and http://localhost:8080 (API).

See Installation for details on API keys, vector databases, and building from source.

Documentation

Installation - Get running with Docker Compose
Quick Start - End-to-end walkthrough
Pipeline Configuration - Full JSON configuration reference
Schemas - Schema definition and auto-generation
Ingestion
- File Upload - API file upload
- Object Store - MinIO bucket notifications
- Database Pull - PostgreSQL, MySQL, MSSQL scheduled pulls
- Kafka - Kafka topic consumption
- Data Types - Supported data types
Preprocessor - External preprocessing via REST endpoints
Data Quality
- Header Validation - CSV header matching
- Column Rules & AI Rules - Regex column validation and AI-powered natural language rules
- Row Rules - JavaScript and REST endpoint rules
- Schema Validation - JSON/XML schema validation
Transformations
- Deduplication - Row deduplication
- Column Trimming - Whitespace trimming
- Dropping Columns - Column filtering via destination schema
- AI Transformation - Natural language row transformations
- Row Functions - JavaScript row transformations
Destinations
- Object Store - Spark writes to MinIO (Parquet, ORC)
- PostgreSQL - COPY bulk insert
- MongoDB - Document upserts
- Kafka - Topic producer
- ActiveMQ - Queue destination
- REST Endpoint - HTTP POST destination
- Qdrant - Vector database for RAG — ingest PDFs, text docs with chunking and embeddings
- Weaviate - Vector database for RAG — ingest PDFs, Word docs, text with chunking and embeddings
- Milvus - Vector database for RAG — scalable similarity search
- Chroma - Vector database for RAG — lightweight, single container
- pgvector - PostgreSQL vector database for RAG — no separate server required
Notifications - Pipeline event notifications and subscriptions
Monitoring - Job status and pipeline tokens
API Reference
- Pipeline API - CRUD for pipeline configurations
- Ingestion API - File upload and generation
- AI Schema Generation - Generate pipeline configs from files using AI
- Status API - Job status and monitoring
- Query API - Query PostgreSQL and MongoDB
- Search API - Semantic search across vector databases
AI Data Profiling - Profile data files and get recommended rules
AI Error Explanation - Automatic plain-English error analysis
AI Configuration - Configure AI providers (Anthropic, OpenAI, Ollama)
- Version API - Version endpoint
OpenAPI Spec - OpenAPI 3.0 spec for Postman, code generation, and non-MCP integrations
Configuration Reference - Full application.yaml reference
MCP Server - AI agent integration via Model Context Protocol
Helper Applications - Vector store chat, Kafka loader, preprocessor, and more

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
datrisserver/src/main		datrisserver/src/main
docker		docker
docs		docs
helpers		helpers
mcp-server		mcp-server
project		project
release-notes		release-notes
test-scripts		test-scripts
ui		ui
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml
docker-init.sh		docker-init.sh
release-notes.md		release-notes.md
server.json		server.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datris — The First AI Agent-Native Data Platform

Agent-Ready: Built-In MCP Server

AI-Powered Features

RAG Pipeline

Key Features

Architecture

Processing Flow

Retrieval Flow

Supported Data Formats

Getting Started

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Datris — The First AI Agent-Native Data Platform

Agent-Ready: Built-In MCP Server

AI-Powered Features

RAG Pipeline

Key Features

Architecture

Processing Flow

Retrieval Flow

Supported Data Formats

Getting Started

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages