Skip to content

daxis-io/metafuse

MetaFuse

Lightweight, serverless data catalog for DataFusion and modern lakehouse pipelines

CI codecov License Rust

MetaFuse captures dataset schemas, lineage, and operational metadata automatically from your data pipelines without requiring Kafka, MySQL, or Elasticsearch. Just a SQLite file on object storage.

Status: v0.3.0 Cloud-Ready — Production-hardened with GCS and S3 backends. Supports local SQLite, Google Cloud Storage, and AWS S3 with optimistic concurrency and caching.

Why MetaFuse?

  • Native DataFusion integration - Emit metadata directly from pipelines
  • Serverless-friendly - $0-$5/month on object storage (GCS/S3 supported)
  • Zero infrastructure - No databases, no clusters to maintain
  • Multi-cloud - Works on local filesystem, Google Cloud Storage, and AWS S3
  • Automatic lineage capture - Track data flow through transformations
  • Full-text search with FTS5 - Fast search with automatic trigger maintenance
  • Optimistic concurrency - Safe concurrent writes with version-based locking
  • Tags and glossary - Organize with business context
  • Operational metadata - Track row counts, sizes, partition keys

MetaFuse fills the gap between expensive, complex enterprise catalogs (DataHub, Collibra) and the needs of small-to-medium data teams.

Quick Start

Install

Option 1: Install Script (Recommended)

curl -fsSL https://raw.githubusercontent.com/ethan-tyler/MetaFuse/main/install.sh | bash

This downloads pre-built binaries for your platform (Linux/macOS, x86_64/ARM64).

Option 2: Build from Source

git clone https://github.com/ethan-tyler/MetaFuse.git
cd MetaFuse

# Local-only (default)
cargo build --release

# With GCS support
cargo build --release --features gcs

# With S3 support
cargo build --release --features s3

# With all cloud backends
cargo build --release --features cloud

Option 3: Docker

docker pull ghcr.io/ethan-tyler/metafuse-api:latest
docker run -p 8080:8080 -v $(pwd)/data:/data ghcr.io/ethan-tyler/metafuse-api

Backend Options

MetaFuse supports multiple storage backends:

Backend URI Format Authentication Features Required
Local SQLite file://catalog.db or catalog.db None local (default)
Google Cloud Storage gs://bucket/path/catalog.db ADC, GOOGLE_APPLICATION_CREDENTIALS gcs
Amazon S3 s3://bucket/key?region=us-east-1 AWS credential chain s3

Environment Variables

Storage:

  • METAFUSE_CATALOG_URI: Override default catalog location
  • METAFUSE_CACHE_TTL_SECS: Cache TTL for cloud backends (default: 60, 0 to disable)
  • GOOGLE_APPLICATION_CREDENTIALS: Path to GCS service account JSON (GCS only)
  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY: AWS credentials (S3 only)

API Server:

  • METAFUSE_PORT: API server port (default: 8080)
  • METAFUSE_RUN_MIGRATIONS: Set to true to auto-run migrations on startup

Enterprise Features (v0.6.0+):

  • METAFUSE_AUDIT_BATCH_SIZE: Max events per audit DB write batch (default: 100)
  • METAFUSE_AUDIT_FLUSH_INTERVAL_MS: Audit flush interval in ms (default: 5000)
  • METAFUSE_USAGE_FLUSH_INTERVAL_MS: Usage counter flush interval in ms (default: 60000)

Examples

# Local
export METAFUSE_CATALOG_URI="file://my_catalog.db"
metafuse init

# Google Cloud Storage
export METAFUSE_CATALOG_URI="gs://my-bucket/catalogs/prod.db"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
metafuse init

# Amazon S3
export METAFUSE_CATALOG_URI="s3://my-bucket/catalogs/prod.db?region=us-west-2"
metafuse init

Initialize Catalog

./target/release/metafuse init

Emit Metadata from DataFusion

use datafusion::prelude::*;
use metafuse_catalog_core::OperationalMeta;
use metafuse_catalog_emitter::Emitter;
use metafuse_catalog_storage::LocalSqliteBackend;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    let df = ctx.read_parquet("input.parquet", Default::default()).await?;
    let result = df.filter(col("status").eq(lit("active")))?.select_columns(&["id", "name"])?;

    // Get schema and row count before writing
    let schema = result.schema().inner().clone();
    let batches = result.collect().await?;
    let row_count: usize = batches.iter().map(|b| b.num_rows()).sum();

    // Write output
    ctx.register_batches("active_records", batches.clone())?;
    ctx.sql("SELECT * FROM active_records")
        .await?
        .write_parquet("output.parquet", Default::default())
        .await?;

    // Emit metadata to catalog
    let backend = LocalSqliteBackend::new("metafuse_catalog.db");
    let emitter = Emitter::new(backend);
    emitter.emit_dataset(
        "active_records",
        "output.parquet",
        "parquet",
        Some("Filtered active records"),  // Description
        Some("prod"),                       // Tenant
        Some("analytics"),                  // Domain
        Some("team@example.com"),          // Owner
        schema,                            // Arrow schema
        Some(OperationalMeta {
            row_count: Some(row_count as i64),
            size_bytes: None,
            partition_keys: vec![],
        }),
        vec!["raw_records".to_string()],   // Upstream lineage
        vec!["active".to_string()],        // Tags
    )?;

    Ok(())
}

Query the Catalog

# List datasets
metafuse list

# Show details with lineage
metafuse show active_records --lineage

# Search
metafuse search "analytics"

# Statistics
metafuse stats

Schema Migrations

MetaFuse uses a versioned migration system to evolve the database schema. Migrations are forward-only and idempotent.

# Check migration status
metafuse migrate status

# Run pending migrations
metafuse migrate run

# View migration history
metafuse migrate history

Programmatic Usage:

use metafuse_catalog_core::init_catalog;
use rusqlite::Connection;

let conn = Connection::open("catalog.db")?;

// Initialize schema and run all migrations
let migrations_applied = init_catalog(&conn, true)?;
println!("Applied {} migrations", migrations_applied);

Migration Notes:

  • Migrations are tracked in schema_migrations table
  • Advisory locking prevents concurrent migrations
  • Version format: MAJOR*1_000_000 + MINOR*1_000 + PATCH (e.g., v1.0.0 = 1_000_000)
  • Rollbacks are not supported - plan migrations carefully

Run the REST API

cargo run --bin metafuse-api

# Query endpoints
curl http://localhost:8080/api/v1/datasets
curl http://localhost:8080/api/v1/datasets/active_records
curl "http://localhost:8080/api/v1/search?q=analytics"

Try the Examples

# Simple pipeline - basic metadata emission
cargo run --example simple_pipeline

# Lineage tracking - 3-stage ETL with dependencies
cargo run --example lineage_tracking

See examples/README.md for detailed walkthrough and expected output.

Cloud Emulator Tests (Optional)

MetaFuse includes comprehensive integration tests for GCS and S3 backends using Docker-based emulators. These tests validate cloud-specific behavior without requiring cloud credentials.

# Run GCS emulator tests (requires Docker)
RUN_CLOUD_TESTS=1 cargo test --features gcs --test gcs_emulator_tests

# Run S3 emulator tests (requires Docker)
RUN_CLOUD_TESTS=1 cargo test --features s3 --test s3_emulator_tests

What's tested: Versioning (generations/ETags), concurrent writes, retry logic, cache behavior, metadata preservation.

For comprehensive documentation, see Cloud Emulator Testing Guide.

Testing

Running Tests

# Run all local tests (no cloud dependencies)
cargo test

# Run cache tests with GCS/S3 features
cargo test -p metafuse-catalog-storage --features gcs,s3

Emulator-Based Integration Tests

MetaFuse includes comprehensive integration tests for GCS and S3 backends using Docker emulators. These tests validate cloud backend behavior without requiring actual cloud credentials.

Requirements:

  • Docker installed and running
  • testcontainers-rs dependency (included in dev-dependencies)

Note: Tests are skipped by default. Set RUN_CLOUD_TESTS=1 to run. In CI, tests run on Linux only (Docker service containers not supported on macOS/Windows runners).

Running GCS Emulator Tests:

# Requires Docker (skipped by default)
RUN_CLOUD_TESTS=1 cargo test --features gcs --test gcs_emulator_tests

Uses fake-gcs-server Docker image. Tests cover:

  • Catalog initialization and existence checks
  • Upload/download roundtrips with generation-based versioning
  • Concurrent write detection and conflict handling
  • Retry logic with exponential backoff
  • Cache behavior (enabled/disabled)
  • Metadata preservation across uploads

Running S3 Emulator Tests:

# Requires Docker (skipped by default)
RUN_CLOUD_TESTS=1 cargo test --features s3 --test s3_emulator_tests

Uses MinIO Docker image. Tests cover:

  • Catalog initialization and existence checks
  • Upload/download roundtrips with ETag-based versioning
  • Concurrent write detection and conflict handling
  • Retry logic with exponential backoff
  • Cache behavior (enabled/disabled)
  • Metadata preservation and region configuration

Note: Emulator tests automatically:

  • Start Docker containers via testcontainers-rs
  • Wait for container readiness (30-second timeout)
  • Clean up containers after test completion
  • Run in isolated environments (no credential leakage)

Documentation

Project Structure

MetaFuse/
|-- crates/
|   |-- catalog-core/       # Core types and SQLite schema
|   |-- catalog-storage/    # Local + cloud catalog backends
|   |-- catalog-emitter/    # DataFusion integration
|   |-- catalog-api/        # REST API (Axum)
|   `-- catalog-cli/        # CLI for catalog operations
|-- docs/                   # Documentation
|-- examples/               # Usage examples
`-- tests/                  # Integration tests

Use Cases

  • Data Discovery: Find datasets with full-text search
  • Data Lineage: Track dependencies and impact analysis
  • Data Governance: Know what exists, where it lives, and who owns it
  • Team Collaboration: Share knowledge with tags and glossary terms

Comparison

Feature MetaFuse DataHub AWS Glue
Cost $0-$5/mo High (self-hosted) Pay-per-use
Setup < 30 min Days Hours
Infrastructure None Kafka, MySQL, ES AWS only
DataFusion Integration Native Via connector No
Local Development Yes No No

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Community

License

Apache License 2.0 - See LICENSE for details.

Acknowledgments

Built with Apache Arrow, DataFusion, SQLite, and Rust.


Built for data teams who want catalogs, not complexity.

About

Lightweight, serverless data catalog for DataFusion and modern lakehouse pipelines

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages