Skip to content

burnside-project/pg-warehouse

pg-warehouse

Mirror PostgreSQL → DuckDB. Build versioned analytical releases. Export Parquet. No cloud required.

CI Release License Go Stars

Why pg-warehouse?

Getting data out of PostgreSQL for analytics or ML usually means stitching together Python scripts, cron jobs, and a cloud warehouse you don't need. pg-warehouse replaces that with a single binary: sync tables into an embedded DuckDB, build versioned analytical releases, and export to Parquet or CSV. Everything runs locally, on your machine, with no external dependencies.

What makes pg-warehouse a local-first Data Warehouse?

Features Exist ?
Analytical Storage (Columnar + Optimized for Reads)
Separation from OLTP (Workload Isolation)
SQL Analytical Engine
Data Transformation Layer (ETL / ELT)
Durable Analytical Storage (Files or Tables)

Quick comparison

pg-warehouse pg_dump Airbyte dbt
PostgreSQL sync Full, incremental, CDC Full only Full, incremental, CDC --
Local analytics DuckDB (columnar) -- -- DuckDB adapter
Parquet/CSV export Built-in -- Via connectors Via packages
Model graph (DAG) Built-in -- -- Core strength
Contracts Built-in -- -- Via packages
Infrastructure Single binary Single binary Docker + Java Python + adapter
Cloud required No No Optional Optional

What does it solve?

A local-first Data Warehouse engine that mirrors PostgreSQL data into DuckDB using native PostgreSQL CDC. Best for teams that want CDC + SQL transforms + Parquet without standing up Kafka, Spark, or a warehouse.

How does it work?

pg-warehouse uses three DuckDB files following Medallion Architecture:

File Layer Purpose
raw.duckdb Bronze CDC black box. Deduped PostgreSQL mirror. CDC owns it exclusively.
silver.duckdb Silver Development platform. v0 = raw snapshot, v1 = your models.
feature.duckdb Gold Analytics output. v0 = silver snapshot, v1 = your features.

Models use ref() for dependencies. pg-warehouse resolves the DAG and builds in the correct order.


Refresh — snapshot raw CDC data into silver v0 (14 tables, ~12s)

1.png

Build — resolve DAG and build all 6 models: silver → feature → Parquet

2.png

Validate — check contracts, models, graph, releases (zero errors = green)

3.png

Graph — show model dependency DAG (who depends on whom)

4.png

Contracts — list registered data shape contracts

5.png

Releases — list versioned release definitions

6.png

Promote — deploy a release to an environment (dev/staging/prod)

7.png

Inspect — list all tables across all DuckDB files

8.png

CDC Status — check replication health (lag, slot, streaming)

9.png

Repair — fix orphaned builds and stale locks

10.png

Pipeline History

11.png

Documentation

Doc Description
Architecture Hexagonal design, layers, port interfaces
Quickstart Full walkthrough with examples
State Database SQLite schema and semantics
CDC Guide Logical replication setup and lifecycle
Sync Modes Full vs. incremental vs. CDC
Configuration YAML reference
Open-Core Strategy OSS vs. commercial boundary
Development Workflow Models, contracts, DAG-resolved builds
Multi-DuckDB Architecture Zero-downtime CDC with three DuckDB files
pg-warehouse refresh     # snapshot raw → silver v0
pg-warehouse build       # build all models in DAG order

https://github.com/burnside-project/pg-warehouse/blob/main/docs/08-development-workflow.md

Install

Homebrew (macOS / Linux):

brew install burnside-project/tap/pg-warehouse

Go install:

go install github.com/burnside-project/pg-warehouse/cmd/pg-warehouse@latest

Download binary — see Releases for Linux, macOS, and Windows (amd64/arm64).

Build from source:

git clone https://github.com/burnside-project/pg-warehouse.git
cd pg-warehouse
make build

Deployment Layout

pg-warehouse runs from a single working directory. All paths in pg-warehouse.yml are relative to this directory.

~/pg-warehouse/                  # Working directory
├── pg-warehouse                 # Binary
├── pg-warehouse.yml             # Configuration
├── raw.duckdb                   # CDC black box
├── silver.duckdb                # Silver development platform
├── feature.duckdb               # Feature analytics output
├── .pgwh/
│   └── state.db                 # SQLite state (sync/CDC/builds)
├── models/
│   ├── silver/                  # Silver layer SQL models
│   └── features/                # Feature layer SQL models
├── contracts/                   # Data contracts (YAML)
├── releases/                    # Release definitions (YAML)
├── out/                         # Parquet/CSV exports
└── cdc.log                      # CDC log (when running via nohup)

Quickstart (2 minutes)

1. Initialize — creates DuckDB files, state DB, and scaffolds directories:

$ mkdir -p ~/pg-warehouse && cd ~/pg-warehouse
$ pg-warehouse init --config pg-warehouse.yml

2. Start CDC — stream changes from PostgreSQL:

$ pg-warehouse cdc setup --config pg-warehouse.yml
$ nohup pg-warehouse cdc start --config pg-warehouse.yml > cdc.log 2>&1 &

3. Refresh — snapshot raw data into silver v0:

$ pg-warehouse refresh

4. Build — build all models in DAG order:

$ pg-warehouse build

5. Validate — check contracts, models, DAG, releases:

$ pg-warehouse validate

Features

Sync

  • Full table snapshots
  • Incremental sync via watermark columns
  • CDC streaming via PostgreSQL logical replication (pglogrepl)
  • Automatic sync mode detection per table

Analytical Releases

  • Models with ref() and source() dependency declarations
  • DAG-resolved build ordering (topological sort, cycle detection)
  • Data contracts (YAML) for schema validation
  • Named releases with versioned model bundles
  • Terraform-style --plan with SQL validation via EXPLAIN
  • Build history and promotion tracking in _meta
  • Partial builds: build --select model_name

Multi-DuckDB Architecture

  • 3-file isolation: raw (CDC), silver (transforms), feature (analytics)
  • Zero-downtime CDC — pipeline never stops CDC
  • Reserved schema protection (raw, stage, v0, _meta) with DANGER enforcement
  • CDC guardrails: max_lag_bytes, drop_slot_on_exit, health checks

Export

  • Parquet export
  • CSV export

Developer Experience

  • Single binary, zero external dependencies
  • validate command for contracts, models, DAG, and SQL
  • graph command to visualize model dependencies
  • history command for build and promotion audit trail
  • SQLite state tracking that survives warehouse rebuilds
  • repair command for fixing orphaned builds

Commands

Command What it does
refresh Snapshot raw.duckdb → silver.duckdb v0
validate Check contracts, models, DAG, releases, SQL syntax
build Build all models in DAG order
build --release X --version Y Build a specific release
build --select model_name Build one model + its dependencies
graph Show model dependency DAG
history Build + promotion history
contracts list List data contracts
release list List releases
promote --release X --version Y --env E Promote to environment
inspect tables List all DuckDB tables
cdc status Check CDC health
repair Fix orphaned builds, stale locks

E-Commerce Recipe

A complete working example with 14 source tables, 6 models, contracts, and releases:

pg-warehouse refresh
pg-warehouse build

See examples/ecommerce-recipe/README.md for full details including dashboard and AI Q&A.

Architecture

pg-warehouse uses hexagonal architecture with clean port/adapter separation. The CLI layer (Cobra) calls services that depend only on port interfaces. Adapters for PostgreSQL, DuckDB, SQLite, Parquet, and CSV implement those interfaces. New sources, warehouses, and exporters plug in without changing business logic.

See docs/01-architecture.md for the full design.

Open Core

The open-source edition covers the full developer workflow: sync, CDC, DuckDB, model graph, contracts, releases, and local export. Production operations -- scheduling, cloud storage export (S3/GCS/Iceberg), remote state, RBAC, and lineage -- are commercial.

See docs/07-open-core.md for the boundary details.

Community

License

Apache License 2.0 -- Copyright 2025-2026 Burnside Project

Releases

No releases published

Packages

 
 
 

Contributors

Languages