Lightweight, serverless data catalog for DataFusion and modern lakehouse pipelines
MetaFuse captures dataset schemas, lineage, and operational metadata automatically from your data pipelines without requiring Kafka, MySQL, or Elasticsearch. Just a SQLite file on object storage.
Status: v0.3.0 Cloud-Ready — Production-hardened with GCS and S3 backends. Supports local SQLite, Google Cloud Storage, and AWS S3 with optimistic concurrency and caching.
- Native DataFusion integration - Emit metadata directly from pipelines
- Serverless-friendly - $0-$5/month on object storage (GCS/S3 supported)
- Zero infrastructure - No databases, no clusters to maintain
- Multi-cloud - Works on local filesystem, Google Cloud Storage, and AWS S3
- Automatic lineage capture - Track data flow through transformations
- Full-text search with FTS5 - Fast search with automatic trigger maintenance
- Optimistic concurrency - Safe concurrent writes with version-based locking
- Tags and glossary - Organize with business context
- Operational metadata - Track row counts, sizes, partition keys
MetaFuse fills the gap between expensive, complex enterprise catalogs (DataHub, Collibra) and the needs of small-to-medium data teams.
curl -fsSL https://raw.githubusercontent.com/ethan-tyler/MetaFuse/main/install.sh | bashThis downloads pre-built binaries for your platform (Linux/macOS, x86_64/ARM64).
git clone https://github.com/ethan-tyler/MetaFuse.git
cd MetaFuse
# Local-only (default)
cargo build --release
# With GCS support
cargo build --release --features gcs
# With S3 support
cargo build --release --features s3
# With all cloud backends
cargo build --release --features clouddocker pull ghcr.io/ethan-tyler/metafuse-api:latest
docker run -p 8080:8080 -v $(pwd)/data:/data ghcr.io/ethan-tyler/metafuse-apiMetaFuse supports multiple storage backends:
| Backend | URI Format | Authentication | Features Required |
|---|---|---|---|
| Local SQLite | file://catalog.db or catalog.db |
None | local (default) |
| Google Cloud Storage | gs://bucket/path/catalog.db |
ADC, GOOGLE_APPLICATION_CREDENTIALS | gcs |
| Amazon S3 | s3://bucket/key?region=us-east-1 |
AWS credential chain | s3 |
Storage:
METAFUSE_CATALOG_URI: Override default catalog locationMETAFUSE_CACHE_TTL_SECS: Cache TTL for cloud backends (default: 60, 0 to disable)GOOGLE_APPLICATION_CREDENTIALS: Path to GCS service account JSON (GCS only)AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY: AWS credentials (S3 only)
API Server:
METAFUSE_PORT: API server port (default: 8080)METAFUSE_RUN_MIGRATIONS: Set totrueto auto-run migrations on startup
Enterprise Features (v0.6.0+):
METAFUSE_AUDIT_BATCH_SIZE: Max events per audit DB write batch (default: 100)METAFUSE_AUDIT_FLUSH_INTERVAL_MS: Audit flush interval in ms (default: 5000)METAFUSE_USAGE_FLUSH_INTERVAL_MS: Usage counter flush interval in ms (default: 60000)
# Local
export METAFUSE_CATALOG_URI="file://my_catalog.db"
metafuse init
# Google Cloud Storage
export METAFUSE_CATALOG_URI="gs://my-bucket/catalogs/prod.db"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
metafuse init
# Amazon S3
export METAFUSE_CATALOG_URI="s3://my-bucket/catalogs/prod.db?region=us-west-2"
metafuse init./target/release/metafuse inituse datafusion::prelude::*;
use metafuse_catalog_core::OperationalMeta;
use metafuse_catalog_emitter::Emitter;
use metafuse_catalog_storage::LocalSqliteBackend;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
let df = ctx.read_parquet("input.parquet", Default::default()).await?;
let result = df.filter(col("status").eq(lit("active")))?.select_columns(&["id", "name"])?;
// Get schema and row count before writing
let schema = result.schema().inner().clone();
let batches = result.collect().await?;
let row_count: usize = batches.iter().map(|b| b.num_rows()).sum();
// Write output
ctx.register_batches("active_records", batches.clone())?;
ctx.sql("SELECT * FROM active_records")
.await?
.write_parquet("output.parquet", Default::default())
.await?;
// Emit metadata to catalog
let backend = LocalSqliteBackend::new("metafuse_catalog.db");
let emitter = Emitter::new(backend);
emitter.emit_dataset(
"active_records",
"output.parquet",
"parquet",
Some("Filtered active records"), // Description
Some("prod"), // Tenant
Some("analytics"), // Domain
Some("team@example.com"), // Owner
schema, // Arrow schema
Some(OperationalMeta {
row_count: Some(row_count as i64),
size_bytes: None,
partition_keys: vec![],
}),
vec!["raw_records".to_string()], // Upstream lineage
vec!["active".to_string()], // Tags
)?;
Ok(())
}# List datasets
metafuse list
# Show details with lineage
metafuse show active_records --lineage
# Search
metafuse search "analytics"
# Statistics
metafuse statsMetaFuse uses a versioned migration system to evolve the database schema. Migrations are forward-only and idempotent.
# Check migration status
metafuse migrate status
# Run pending migrations
metafuse migrate run
# View migration history
metafuse migrate historyProgrammatic Usage:
use metafuse_catalog_core::init_catalog;
use rusqlite::Connection;
let conn = Connection::open("catalog.db")?;
// Initialize schema and run all migrations
let migrations_applied = init_catalog(&conn, true)?;
println!("Applied {} migrations", migrations_applied);Migration Notes:
- Migrations are tracked in
schema_migrationstable - Advisory locking prevents concurrent migrations
- Version format:
MAJOR*1_000_000 + MINOR*1_000 + PATCH(e.g., v1.0.0 = 1_000_000) - Rollbacks are not supported - plan migrations carefully
cargo run --bin metafuse-api
# Query endpoints
curl http://localhost:8080/api/v1/datasets
curl http://localhost:8080/api/v1/datasets/active_records
curl "http://localhost:8080/api/v1/search?q=analytics"# Simple pipeline - basic metadata emission
cargo run --example simple_pipeline
# Lineage tracking - 3-stage ETL with dependencies
cargo run --example lineage_trackingSee examples/README.md for detailed walkthrough and expected output.
MetaFuse includes comprehensive integration tests for GCS and S3 backends using Docker-based emulators. These tests validate cloud-specific behavior without requiring cloud credentials.
# Run GCS emulator tests (requires Docker)
RUN_CLOUD_TESTS=1 cargo test --features gcs --test gcs_emulator_tests
# Run S3 emulator tests (requires Docker)
RUN_CLOUD_TESTS=1 cargo test --features s3 --test s3_emulator_testsWhat's tested: Versioning (generations/ETags), concurrent writes, retry logic, cache behavior, metadata preservation.
For comprehensive documentation, see Cloud Emulator Testing Guide.
# Run all local tests (no cloud dependencies)
cargo test
# Run cache tests with GCS/S3 features
cargo test -p metafuse-catalog-storage --features gcs,s3MetaFuse includes comprehensive integration tests for GCS and S3 backends using Docker emulators. These tests validate cloud backend behavior without requiring actual cloud credentials.
Requirements:
- Docker installed and running
testcontainers-rsdependency (included in dev-dependencies)
Note: Tests are skipped by default. Set RUN_CLOUD_TESTS=1 to run. In CI, tests run on Linux only (Docker service containers not supported on macOS/Windows runners).
Running GCS Emulator Tests:
# Requires Docker (skipped by default)
RUN_CLOUD_TESTS=1 cargo test --features gcs --test gcs_emulator_testsUses fake-gcs-server Docker image. Tests cover:
- Catalog initialization and existence checks
- Upload/download roundtrips with generation-based versioning
- Concurrent write detection and conflict handling
- Retry logic with exponential backoff
- Cache behavior (enabled/disabled)
- Metadata preservation across uploads
Running S3 Emulator Tests:
# Requires Docker (skipped by default)
RUN_CLOUD_TESTS=1 cargo test --features s3 --test s3_emulator_testsUses MinIO Docker image. Tests cover:
- Catalog initialization and existence checks
- Upload/download roundtrips with ETag-based versioning
- Concurrent write detection and conflict handling
- Retry logic with exponential backoff
- Cache behavior (enabled/disabled)
- Metadata preservation and region configuration
Note: Emulator tests automatically:
- Start Docker containers via
testcontainers-rs - Wait for container readiness (30-second timeout)
- Clean up containers after test completion
- Run in isolated environments (no credential leakage)
- Getting Started - 10-minute tutorial
- Architecture - How MetaFuse works
- API Reference - REST API endpoints
- Deployment (GCP) - Cloud Run/GKE setup, auth, cost notes
- Deployment (AWS) - ECS/EKS/Lambda setup, auth, cost notes
- Authentication / auth-aws.md - Credential patterns and best practices
- Migration: Local → Cloud - Step-by-step move to object storage
- Troubleshooting - Common issues and fixes
- Roadmap - What's coming next
MetaFuse/
|-- crates/
| |-- catalog-core/ # Core types and SQLite schema
| |-- catalog-storage/ # Local + cloud catalog backends
| |-- catalog-emitter/ # DataFusion integration
| |-- catalog-api/ # REST API (Axum)
| `-- catalog-cli/ # CLI for catalog operations
|-- docs/ # Documentation
|-- examples/ # Usage examples
`-- tests/ # Integration tests
- Data Discovery: Find datasets with full-text search
- Data Lineage: Track dependencies and impact analysis
- Data Governance: Know what exists, where it lives, and who owns it
- Team Collaboration: Share knowledge with tags and glossary terms
| Feature | MetaFuse | DataHub | AWS Glue |
|---|---|---|---|
| Cost | $0-$5/mo | High (self-hosted) | Pay-per-use |
| Setup | < 30 min | Days | Hours |
| Infrastructure | None | Kafka, MySQL, ES | AWS only |
| DataFusion Integration | Native | Via connector | No |
| Local Development | Yes | No | No |
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Community Guide: docs/COMMUNITY.md - How to participate and get help
- Issues: Report bugs or request features
- Discussions: Ask questions
- Contributing: CONTRIBUTING.md - Development guidelines
Apache License 2.0 - See LICENSE for details.
Built with Apache Arrow, DataFusion, SQLite, and Rust.
Built for data teams who want catalogs, not complexity.