MetaFuse

Lightweight, serverless data catalog for DataFusion and modern lakehouse pipelines

MetaFuse captures dataset schemas, lineage, and operational metadata automatically from your data pipelines without requiring Kafka, MySQL, or Elasticsearch. Just a SQLite file on object storage.

Status: v0.3.0 Cloud-Ready — Production-hardened with GCS and S3 backends. Supports local SQLite, Google Cloud Storage, and AWS S3 with optimistic concurrency and caching.

Why MetaFuse?

Native DataFusion integration - Emit metadata directly from pipelines
Serverless-friendly - $0-$5/month on object storage (GCS/S3 supported)
Zero infrastructure - No databases, no clusters to maintain
Multi-cloud - Works on local filesystem, Google Cloud Storage, and AWS S3
Automatic lineage capture - Track data flow through transformations
Full-text search with FTS5 - Fast search with automatic trigger maintenance
Optimistic concurrency - Safe concurrent writes with version-based locking
Tags and glossary - Organize with business context
Operational metadata - Track row counts, sizes, partition keys

MetaFuse fills the gap between expensive, complex enterprise catalogs (DataHub, Collibra) and the needs of small-to-medium data teams.

Quick Start

Install

Option 1: Install Script (Recommended)

curl -fsSL https://raw.githubusercontent.com/ethan-tyler/MetaFuse/main/install.sh | bash

This downloads pre-built binaries for your platform (Linux/macOS, x86_64/ARM64).

Option 2: Build from Source

git clone https://github.com/ethan-tyler/MetaFuse.git
cd MetaFuse

# Local-only (default)
cargo build --release

# With GCS support
cargo build --release --features gcs

# With S3 support
cargo build --release --features s3

# With all cloud backends
cargo build --release --features cloud

Option 3: Docker

docker pull ghcr.io/ethan-tyler/metafuse-api:latest
docker run -p 8080:8080 -v $(pwd)/data:/data ghcr.io/ethan-tyler/metafuse-api

Backend Options

MetaFuse supports multiple storage backends:

Backend	URI Format	Authentication	Features Required
Local SQLite	`file://catalog.db` or `catalog.db`	None	`local` (default)
Google Cloud Storage	`gs://bucket/path/catalog.db`	ADC, GOOGLE_APPLICATION_CREDENTIALS	`gcs`
Amazon S3	`s3://bucket/key?region=us-east-1`	AWS credential chain	`s3`

Environment Variables

Storage:

METAFUSE_CATALOG_URI: Override default catalog location
METAFUSE_CACHE_TTL_SECS: Cache TTL for cloud backends (default: 60, 0 to disable)
GOOGLE_APPLICATION_CREDENTIALS: Path to GCS service account JSON (GCS only)
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY: AWS credentials (S3 only)

API Server:

METAFUSE_PORT: API server port (default: 8080)
METAFUSE_RUN_MIGRATIONS: Set to true to auto-run migrations on startup

Enterprise Features (v0.6.0+):

METAFUSE_AUDIT_BATCH_SIZE: Max events per audit DB write batch (default: 100)
METAFUSE_AUDIT_FLUSH_INTERVAL_MS: Audit flush interval in ms (default: 5000)
METAFUSE_USAGE_FLUSH_INTERVAL_MS: Usage counter flush interval in ms (default: 60000)

Examples

# Local
export METAFUSE_CATALOG_URI="file://my_catalog.db"
metafuse init

# Google Cloud Storage
export METAFUSE_CATALOG_URI="gs://my-bucket/catalogs/prod.db"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
metafuse init

# Amazon S3
export METAFUSE_CATALOG_URI="s3://my-bucket/catalogs/prod.db?region=us-west-2"
metafuse init

Initialize Catalog

./target/release/metafuse init

Emit Metadata from DataFusion

use datafusion::prelude::*;
use metafuse_catalog_core::OperationalMeta;
use metafuse_catalog_emitter::Emitter;
use metafuse_catalog_storage::LocalSqliteBackend;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    let df = ctx.read_parquet("input.parquet", Default::default()).await?;
    let result = df.filter(col("status").eq(lit("active")))?.select_columns(&["id", "name"])?;

    // Get schema and row count before writing
    let schema = result.schema().inner().clone();
    let batches = result.collect().await?;
    let row_count: usize = batches.iter().map(|b| b.num_rows()).sum();

    // Write output
    ctx.register_batches("active_records", batches.clone())?;
    ctx.sql("SELECT * FROM active_records")
        .await?
        .write_parquet("output.parquet", Default::default())
        .await?;

    // Emit metadata to catalog
    let backend = LocalSqliteBackend::new("metafuse_catalog.db");
    let emitter = Emitter::new(backend);
    emitter.emit_dataset(
        "active_records",
        "output.parquet",
        "parquet",
        Some("Filtered active records"),  // Description
        Some("prod"),                       // Tenant
        Some("analytics"),                  // Domain
        Some("team@example.com"),          // Owner
        schema,                            // Arrow schema
        Some(OperationalMeta {
            row_count: Some(row_count as i64),
            size_bytes: None,
            partition_keys: vec![],
        }),
        vec!["raw_records".to_string()],   // Upstream lineage
        vec!["active".to_string()],        // Tags
    )?;

    Ok(())
}

Query the Catalog

# List datasets
metafuse list

# Show details with lineage
metafuse show active_records --lineage

# Search
metafuse search "analytics"

# Statistics
metafuse stats

Schema Migrations

MetaFuse uses a versioned migration system to evolve the database schema. Migrations are forward-only and idempotent.

# Check migration status
metafuse migrate status

# Run pending migrations
metafuse migrate run

# View migration history
metafuse migrate history

Programmatic Usage:

use metafuse_catalog_core::init_catalog;
use rusqlite::Connection;

let conn = Connection::open("catalog.db")?;

// Initialize schema and run all migrations
let migrations_applied = init_catalog(&conn, true)?;
println!("Applied {} migrations", migrations_applied);

Migration Notes:

Migrations are tracked in schema_migrations table
Advisory locking prevents concurrent migrations
Version format: MAJOR*1_000_000 + MINOR*1_000 + PATCH (e.g., v1.0.0 = 1_000_000)
Rollbacks are not supported - plan migrations carefully

Run the REST API

cargo run --bin metafuse-api

# Query endpoints
curl http://localhost:8080/api/v1/datasets
curl http://localhost:8080/api/v1/datasets/active_records
curl "http://localhost:8080/api/v1/search?q=analytics"

Try the Examples

# Simple pipeline - basic metadata emission
cargo run --example simple_pipeline

# Lineage tracking - 3-stage ETL with dependencies
cargo run --example lineage_tracking

See examples/README.md for detailed walkthrough and expected output.

Cloud Emulator Tests (Optional)

MetaFuse includes comprehensive integration tests for GCS and S3 backends using Docker-based emulators. These tests validate cloud-specific behavior without requiring cloud credentials.

# Run GCS emulator tests (requires Docker)
RUN_CLOUD_TESTS=1 cargo test --features gcs --test gcs_emulator_tests

# Run S3 emulator tests (requires Docker)
RUN_CLOUD_TESTS=1 cargo test --features s3 --test s3_emulator_tests

What's tested: Versioning (generations/ETags), concurrent writes, retry logic, cache behavior, metadata preservation.

For comprehensive documentation, see Cloud Emulator Testing Guide.

Testing

Running Tests

# Run all local tests (no cloud dependencies)
cargo test

# Run cache tests with GCS/S3 features
cargo test -p metafuse-catalog-storage --features gcs,s3

Emulator-Based Integration Tests

MetaFuse includes comprehensive integration tests for GCS and S3 backends using Docker emulators. These tests validate cloud backend behavior without requiring actual cloud credentials.

Requirements:

Docker installed and running
testcontainers-rs dependency (included in dev-dependencies)

Note: Tests are skipped by default. Set RUN_CLOUD_TESTS=1 to run. In CI, tests run on Linux only (Docker service containers not supported on macOS/Windows runners).

Running GCS Emulator Tests:

# Requires Docker (skipped by default)
RUN_CLOUD_TESTS=1 cargo test --features gcs --test gcs_emulator_tests

Uses fake-gcs-server Docker image. Tests cover:

Catalog initialization and existence checks
Upload/download roundtrips with generation-based versioning
Concurrent write detection and conflict handling
Retry logic with exponential backoff
Cache behavior (enabled/disabled)
Metadata preservation across uploads

Running S3 Emulator Tests:

# Requires Docker (skipped by default)
RUN_CLOUD_TESTS=1 cargo test --features s3 --test s3_emulator_tests

Uses MinIO Docker image. Tests cover:

Catalog initialization and existence checks
Upload/download roundtrips with ETag-based versioning
Concurrent write detection and conflict handling
Retry logic with exponential backoff
Cache behavior (enabled/disabled)
Metadata preservation and region configuration

Note: Emulator tests automatically:

Start Docker containers via testcontainers-rs
Wait for container readiness (30-second timeout)
Clean up containers after test completion
Run in isolated environments (no credential leakage)

Documentation

Getting Started - 10-minute tutorial
Architecture - How MetaFuse works
API Reference - REST API endpoints
Deployment (GCP) - Cloud Run/GKE setup, auth, cost notes
Deployment (AWS) - ECS/EKS/Lambda setup, auth, cost notes
Authentication / auth-aws.md - Credential patterns and best practices
Migration: Local → Cloud - Step-by-step move to object storage
Troubleshooting - Common issues and fixes
Roadmap - What's coming next

Project Structure

MetaFuse/
|-- crates/
|   |-- catalog-core/       # Core types and SQLite schema
|   |-- catalog-storage/    # Local + cloud catalog backends
|   |-- catalog-emitter/    # DataFusion integration
|   |-- catalog-api/        # REST API (Axum)
|   `-- catalog-cli/        # CLI for catalog operations
|-- docs/                   # Documentation
|-- examples/               # Usage examples
`-- tests/                  # Integration tests

Use Cases

Data Discovery: Find datasets with full-text search
Data Lineage: Track dependencies and impact analysis
Data Governance: Know what exists, where it lives, and who owns it
Team Collaboration: Share knowledge with tags and glossary terms

Comparison

Feature	MetaFuse	DataHub	AWS Glue
Cost	$0-$5/mo	High (self-hosted)	Pay-per-use
Setup	< 30 min	Days	Hours
Infrastructure	None	Kafka, MySQL, ES	AWS only
DataFusion Integration	Native	Via connector	No
Local Development	Yes	No	No

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Community

Community Guide: docs/COMMUNITY.md - How to participate and get help
Issues: Report bugs or request features
Discussions: Ask questions
Contributing: CONTRIBUTING.md - Development guidelines

License

Apache License 2.0 - See LICENSE for details.

Acknowledgments

Built with Apache Arrow, DataFusion, SQLite, and Rust.

Built for data teams who want catalogs, not complexity.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.cargo		.cargo
.github		.github
benches		benches
crates		crates
docs		docs
examples		examples
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
install.sh		install.sh
run_examples.sh		run_examples.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetaFuse

Why MetaFuse?

Quick Start

Install

Option 1: Install Script (Recommended)

Option 2: Build from Source

Option 3: Docker

Backend Options

Environment Variables

Examples

Initialize Catalog

Emit Metadata from DataFusion

Query the Catalog

Schema Migrations

Run the REST API

Try the Examples

Cloud Emulator Tests (Optional)

Testing

Running Tests

Emulator-Based Integration Tests

Documentation

Project Structure

Use Cases

Comparison

Contributing

Community

License

Acknowledgments

About

Uh oh!

Releases 4

Packages

Contributors 2

Uh oh!

Languages

License

daxis-io/metafuse

Folders and files

Latest commit

History

Repository files navigation

MetaFuse

Why MetaFuse?

Quick Start

Install

Option 1: Install Script (Recommended)

Option 2: Build from Source

Option 3: Docker

Backend Options

Environment Variables

Examples

Initialize Catalog

Emit Metadata from DataFusion

Query the Catalog

Schema Migrations

Run the REST API

Try the Examples

Cloud Emulator Tests (Optional)

Testing

Running Tests

Emulator-Based Integration Tests

Documentation

Project Structure

Use Cases

Comparison

Contributing

Community

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Uh oh!

Languages

Packages