digoal
2024-01-30
PostgreSQL , PolarDB , DuckDB , 超融合 , zero-ETL , pg_analytics , 计算存储分离
想象一下企业的数据可能分布在很多的数据源中, 例如不同业务的数据库、对象存储中的文件形式存在, 企业要进行全面的数据分析, 有一种方法的将所有数据源统一同步到大数据平台, 这种方法比较常见, 实际上还有更廉价、更实时、更简单的方法, 就是超融合计算. 超融合计算可以简单理解为“计算+数据访问管道+各种数据源”的架构, 例如LotuseeData 大数据平台的超融合产品与PolarDB结合, 将PolarDB作为计算节点, 通过配置管道, 实时访问任意数据源, 并进行实时的全域数据计算.
超融合计算的计算节点可以是duckdb,postgresql,polardb,greenplum等, 目前PostgreSQL开源插件pg_analytics就是一款开源的超融合计算插件.
https://github.com/paradedb/paradedb/tree/dev/pg_analytics
https://docs.paradedb.com/blog/introducing_analytics
pg_analytics 插件架构
- embeds Arrow, Parquet, and DataFusion
- 采用PostgreSQL存储remote数据的catalog,
- 使用table access method api访问远端数据, 表达为 Parquet 文件, 通过Delta Lake管理parquet, which provides ACID transactions.
- 使用executor hook 将请求路由到DataFusion, 产生AP场景更优的执行计划, 并执行请求
- 最终将结果返回postgresql
这些套件参考:
- 《DuckDB ADBC - 通过 Arrow 数据库连接进行 零复制|零格式转换 数据传输 VS ODBC/JDBC》
- 《hydra, 一款基于PostgreSQL的开源HTAP数据库. 支持列存,向量化,物化,冷热分离存储,cloud 等特性》
- 《单机部署体验 - 开源AWS Aurora for PostgreSQL: neon , 存算分离,server less. program by RUST》
- 《开源AWS Aurora for PostgreSQL: neon , 存算分离,server less. program by RUST》
- 《DuckDB 0.8.0 发布, 支持pivot语法, ASOF JOIN, 并行导入导出性能提升, 递归通配符解析文件, arrow 连接器等》
- 《一款兼容mysql,clickhouse 使用rust写的数据湖产品databend(号称开源版snowflake) - 适合"时序、IoT、feed?、分析、数据归档"等场景》
- 《将 "数据结构、数据存储" 从 "数据库管理系统" 剥离后 - 造就了大量大数据产品(DataFusion, arrow-rs, databend等)》
- 《PostgreSQL 大数据场景存储生态: apache arrow - by pg-strom》
- 《PolarDB-PG | PostgreSQL + duckdb_fdw + 阿里云OSS 实现高效低价的海量数据冷热存储分离》
- 《DuckDB DataLake 场景使用举例 - aliyun OSS对象存储parquet》
- 《什么是 Delta Lake (数据湖)》
- 《DuckDB parquet 分区表 / Delta Lake(数据湖) 应用》
- 《《开源大咖说》第1期《为什么PolarDB选择计算存储分离的分布式架构》》
- 《PolarDB for PostgreSQL 开源版 - 计算存储分离版(类似Oracle RAC架构) 部署指南》
- 《PostgreSQL 应用开发解决方案最佳实践系列课程 - 9. 数据存储冷热分离》
- 《PostgreSQL deltaLake 数据湖用法 - arrow + parquet fdw》
pg_analytics
is an extension that accelerates analytical query processing inside Postgres. The performance of analytical queries that leverage pg_analytics
is comparable to the performance of dedicated OLAP databases — without the need to extract, transform, and load (ETL) the data from your Postgres instance into another system. The purpose of pg_analytics
is to be a drop-in solution for fast analytics in Postgres with zero ETL.
The primary dependencies are:
- Apache Arrow for column-oriented memory format
- Apache DataFusion for vectorized query execution with SIMD
- Apache Parquet for persistence
- Delta Lake as a storage framework with ACID properties
- pgrx, the framework for creating Postgres extensions in Rust
With pg_analytics
installed, ParadeDB is the fastest Postgres-based analytical database and outperforms many specialized OLAP systems. On Clickbench, ParadeDB is 94x faster than regular Postgres, 8x faster than Elasticsearch, and almost ties Clickhouse.
For an apples-to-apples comparison, these benchmarks were run on a c6a.4xlarge with 500GB storage. None of the databases were tuned. The (Parquet, single) Clickhouse variant was selected because it most closely matches ParadeDB's Parquet storage.
You can view ParadeDB ClickBench results, including how we compare against other Postgres-compatible systems here.
This toy example demonstrates how to get started.
CREATE EXTENSION pg_analytics;
-- Create a deltalake table
CREATE TABLE t (a int) USING deltalake;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a deltalake table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;
You can interact with deltalake
tables the same way as with normal Postgres tables. However, there are a few operations specific to deltalake
tables.
When deltalake
tables are dropped, they remain on disk until VACUUM
is run. This operation physically
deletes the Parquet files of dropped tables.
The VACUUM FULL <table_name>
command is used to optimize a table's storage by bin-packing small Parquet
files into larger files, which can significantly improve query time and compression. It also deletes
Parquet files belonging to dropped data.
pg_analytics
is currently in beta.
-
deltalake
tables behave like regular Postgres tables and support most Postgres queries (JOINs, CTEs, window functions, etc.) - Vacuum and Parquet storage optimization
-
INSERT
,TRUNCATE
, andCOPY
As pg_analytics
becomes production-ready, many of these will be resolved.
-
UPDATE
andDELETE
- Partitioning tables by column
- Some Postgres types like arrays, JSON, time, and timestamp with time zone
- User-defined functions, aggregations, or types
- Referencing
deltalake
and regular Postgresheap
tables in the same query - Write-ahead-log (WAL) support and
ROLLBACK
- Foreign keys
- Index scans
-
TEMP
tables - Using an external data lake as a table storage provider
- Full text search over
deltalake
tables withpg_bm25
pg_analytics
introduces column-oriented storage and vectorized query execution to Postgres via Apache Parquet, Arrow, and DataFusion. These libraries are the building blocks of many modern analytical databases.
Regular Postgres tables, known as heap tables, are row-oriented. While this makes sense for operational data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the columns in a table. As a result, most dedicated analytical (i.e. OLAP) database systems use a column-oriented layout so that scans only need to access the data from the relevant columns. Column-oriented systems have other advantages for analytics such as improved compression and are more amenable to vectorized execution.
Vectorized query execution is a technique that takes advantage of modern CPUs to break column-oriented data into batches and process the batches in parallel.
pg_analytics
embeds Arrow, Parquet, and DataFusion inside Postgres via executor hooks and the table access method API. Executor hooks intercept queries to these tables and reroute them to DataFusion, which generates an optimized query plan, executes the query, and sends the results back to Postgres. The table access method persists Postgres tables as Parquet files and registers them with Postgres' system catalogs. The Parquet files are managed by Delta Lake, which provides ACID transactions.
To develop the extension, first install Rust v1.73.0 using rustup
. We will soon make the extension compatible with newer versions of Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install 1.73.0
# We recommend setting the default version to 1.73.0 for consistency across your system
rustup default 1.73.0
Note: While it is possible to install Rust via your package manager, we recommend using rustup
as we've observed inconcistencies with Homebrew's Rust installation on macOS.
Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project:
# macOS
brew install postgresql@16
# Ubuntu
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt-get update && sudo apt-get install -y postgresql-16 postgresql-server-dev-16
If you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the pg_config
binary to your path before continuing:
export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"
Then, install and initialize pgrx
:
# Note: Replace --pg16 with your version of Postgres, if different (i.e. --pg15, --pg14, etc.)
cargo install --locked cargo-pgrx --version 0.11.2
# macOS arm64
cargo pgrx init --pg16=/opt/homebrew/opt/postgresql@16/bin/pg_config
# macOS amd64
cargo pgrx init --pg16=/usr/local/opt/postgresql@16/bin/pg_config
# Ubuntu
cargo pgrx init --pg16=/usr/lib/postgresql/16/bin/pg_config
If you prefer to use a different version of Postgres, update the --pg
flag accordingly.
Note: While it is possible to develop using pgrx's own Postgres installation(s), via cargo pgrx init
without specifying a pg_config
path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's.
This extension uses Postgres hooks to intercept Postgres queries. In order to enable these hooks, the extension
must be added to shared_preload_libraries
inside postgresql.conf
. If you are using Postgres 16, this file can be found under ~/.pgrx/data-16
.
# Inside postgresql.conf
shared_preload_libraries = 'pg_analytics'
The extension can be developed with or without an optimized build. An optimized build improves query times by 10-20x but also significantly increases build times.
To launch the extension without an optimized build, run
cargo pgrx run
First, switch to latest Rust Nightly (as of writing, 1.77) via:
rustup update nightly
rustup override set nightly
Then, reinstall pgrx
for the new version of Rust:
cargo install --locked cargo-pgrx --version 0.11.2 --force
Finally, run to build in release mode with SIMD:
cargo pgrx run --release
Note that this may take several minutes to execute.
To revert back to the stable version of Rust, run:
rustup override unset
To run benchmarks locally, enter the pg_analytics/
directory and run cargo clickbench
. This runs a minified version of the ClickBench benchmark suite on pg_analytics
.