PostgreSQL zero-ETL 超融合计算插件 pg_analytics

作者

digoal

日期

2024-01-30

背景

想象一下企业的数据可能分布在很多的数据源中, 例如不同业务的数据库、对象存储中的文件形式存在, 企业要进行全面的数据分析, 有一种方法的将所有数据源统一同步到大数据平台, 这种方法比较常见, 实际上还有更廉价、更实时、更简单的方法, 就是超融合计算. 超融合计算可以简单理解为“计算+数据访问管道+各种数据源”的架构, 例如LotuseeData 大数据平台的超融合产品与PolarDB结合, 将PolarDB作为计算节点, 通过配置管道, 实时访问任意数据源, 并进行实时的全域数据计算.

超融合计算的计算节点可以是duckdb,postgresql,polardb,greenplum等, 目前PostgreSQL开源插件pg_analytics就是一款开源的超融合计算插件.

https://github.com/paradedb/paradedb/tree/dev/pg_analytics

https://docs.paradedb.com/blog/introducing_analytics

pg_analytics 插件架构

embeds Arrow, Parquet, and DataFusion
采用PostgreSQL存储remote数据的catalog,
使用table access method api访问远端数据, 表达为 Parquet 文件, 通过Delta Lake管理parquet, which provides ACID transactions.
使用executor hook 将请求路由到DataFusion, 产生AP场景更优的执行计划, 并执行请求
最终将结果返回postgresql

这些套件参考:

《DuckDB ADBC - 通过 Arrow 数据库连接进行零复制|零格式转换数据传输 VS ODBC/JDBC》
《hydra, 一款基于PostgreSQL的开源HTAP数据库. 支持列存,向量化,物化,冷热分离存储,cloud 等特性》
《单机部署体验 - 开源AWS Aurora for PostgreSQL: neon , 存算分离,server less. program by RUST》
《开源AWS Aurora for PostgreSQL: neon , 存算分离,server less. program by RUST》
《DuckDB 0.8.0 发布, 支持pivot语法, ASOF JOIN, 并行导入导出性能提升, 递归通配符解析文件, arrow 连接器等》
《一款兼容mysql,clickhouse 使用rust写的数据湖产品databend(号称开源版snowflake) - 适合"时序、IoT、feed?、分析、数据归档"等场景》
《将 "数据结构、数据存储" 从 "数据库管理系统" 剥离后 - 造就了大量大数据产品(DataFusion, arrow-rs, databend等)》
《PostgreSQL 大数据场景存储生态: apache arrow - by pg-strom》
《PolarDB-PG | PostgreSQL + duckdb_fdw + 阿里云OSS 实现高效低价的海量数据冷热存储分离》
《DuckDB DataLake 场景使用举例 - aliyun OSS对象存储parquet》
《什么是 Delta Lake (数据湖)》
《DuckDB parquet 分区表 / Delta Lake(数据湖) 应用》
《《开源大咖说》第1期《为什么PolarDB选择计算存储分离的分布式架构》》
《PolarDB for PostgreSQL 开源版 - 计算存储分离版(类似Oracle RAC架构) 部署指南》
《PostgreSQL 应用开发解决方案最佳实践系列课程 - 9. 数据存储冷热分离》
《PostgreSQL deltaLake 数据湖用法 - arrow + parquet fdw》

部署pg_analytics安装参考

Overview

pg_analytics is an extension that accelerates analytical query processing inside Postgres. The performance of analytical queries that leverage pg_analytics is comparable to the performance of dedicated OLAP databases — without the need to extract, transform, and load (ETL) the data from your Postgres instance into another system. The purpose of pg_analytics is to be a drop-in solution for fast analytics in Postgres with zero ETL.

The primary dependencies are:

Apache Arrow for column-oriented memory format
Apache DataFusion for vectorized query execution with SIMD
Apache Parquet for persistence
Delta Lake as a storage framework with ACID properties
pgrx, the framework for creating Postgres extensions in Rust

Benchmarks

With pg_analytics installed, ParadeDB is the fastest Postgres-based analytical database and outperforms many specialized OLAP systems. On Clickbench, ParadeDB is 94x faster than regular Postgres, 8x faster than Elasticsearch, and almost ties Clickhouse.

For an apples-to-apples comparison, these benchmarks were run on a c6a.4xlarge with 500GB storage. None of the databases were tuned. The (Parquet, single) Clickhouse variant was selected because it most closely matches ParadeDB's Parquet storage.

You can view ParadeDB ClickBench results, including how we compare against other Postgres-compatible systems here.

Getting Started

This toy example demonstrates how to get started.

CREATE EXTENSION pg_analytics;  
-- Create a deltalake table  
CREATE TABLE t (a int) USING deltalake;  
-- pg_analytics supercharges the performance of any  
-- Postgres query run on a deltalake table  
INSERT INTO t VALUES (1), (2), (3);  
SELECT COUNT(*) FROM t;

Deltalake Tables

You can interact with deltalake tables the same way as with normal Postgres tables. However, there are a few operations specific to deltalake tables.

Storage Optimization

When deltalake tables are dropped, they remain on disk until VACUUM is run. This operation physically
deletes the Parquet files of dropped tables.

The VACUUM FULL <table_name> command is used to optimize a table's storage by bin-packing small Parquet
files into larger files, which can significantly improve query time and compression. It also deletes
Parquet files belonging to dropped data.

Roadmap

pg_analytics is currently in beta.

Features Supported

deltalake tables behave like regular Postgres tables and support most Postgres queries (JOINs, CTEs, window functions, etc.)
Vacuum and Parquet storage optimization
INSERT, TRUNCATE, and COPY

Known Limitations

As pg_analytics becomes production-ready, many of these will be resolved.

UPDATE and DELETE
Partitioning tables by column
Some Postgres types like arrays, JSON, time, and timestamp with time zone
User-defined functions, aggregations, or types
Referencing deltalake and regular Postgres heap tables in the same query
Write-ahead-log (WAL) support and ROLLBACK
Foreign keys
Index scans
TEMP tables
Using an external data lake as a table storage provider
Full text search over deltalake tables with pg_bm25

How It Works

pg_analytics introduces column-oriented storage and vectorized query execution to Postgres via Apache Parquet, Arrow, and DataFusion. These libraries are the building blocks of many modern analytical databases.

Column-Oriented Storage

Regular Postgres tables, known as heap tables, are row-oriented. While this makes sense for operational data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the columns in a table. As a result, most dedicated analytical (i.e. OLAP) database systems use a column-oriented layout so that scans only need to access the data from the relevant columns. Column-oriented systems have other advantages for analytics such as improved compression and are more amenable to vectorized execution.

Vectorized Query Execution

Vectorized query execution is a technique that takes advantage of modern CPUs to break column-oriented data into batches and process the batches in parallel.

Postgres Integration

pg_analytics embeds Arrow, Parquet, and DataFusion inside Postgres via executor hooks and the table access method API. Executor hooks intercept queries to these tables and reroute them to DataFusion, which generates an optimized query plan, executes the query, and sends the results back to Postgres. The table access method persists Postgres tables as Parquet files and registers them with Postgres' system catalogs. The Parquet files are managed by Delta Lake, which provides ACID transactions.

Development

Install Rust

To develop the extension, first install Rust v1.73.0 using rustup. We will soon make the extension compatible with newer versions of Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh  
rustup install 1.73.0  
  
# We recommend setting the default version to 1.73.0 for consistency across your system  
rustup default 1.73.0

Note: While it is possible to install Rust via your package manager, we recommend using rustup as we've observed inconcistencies with Homebrew's Rust installation on macOS.

Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project:

Install Postgres

# macOS  
brew install postgresql@16  
  
# Ubuntu  
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -  
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'  
sudo apt-get update && sudo apt-get install -y postgresql-16 postgresql-server-dev-16

If you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the pg_config binary to your path before continuing:

export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"

Install pgrx

Then, install and initialize pgrx:

# Note: Replace --pg16 with your version of Postgres, if different (i.e. --pg15, --pg14, etc.)  
cargo install --locked cargo-pgrx --version 0.11.2  
  
# macOS arm64  
cargo pgrx init --pg16=/opt/homebrew/opt/postgresql@16/bin/pg_config  
  
# macOS amd64  
cargo pgrx init --pg16=/usr/local/opt/postgresql@16/bin/pg_config  
  
# Ubuntu  
cargo pgrx init --pg16=/usr/lib/postgresql/16/bin/pg_config

If you prefer to use a different version of Postgres, update the --pg flag accordingly.

Note: While it is possible to develop using pgrx's own Postgres installation(s), via cargo pgrx init without specifying a pg_config path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's.

Configure Shared Preload Libraries

This extension uses Postgres hooks to intercept Postgres queries. In order to enable these hooks, the extension
must be added to shared_preload_libraries inside postgresql.conf. If you are using Postgres 16, this file can be found under ~/.pgrx/data-16.

# Inside postgresql.conf  
shared_preload_libraries = 'pg_analytics'

Run Without Optimized Build

The extension can be developed with or without an optimized build. An optimized build improves query times by 10-20x but also significantly increases build times.

To launch the extension without an optimized build, run

cargo pgrx run

Run With Optimized Build

First, switch to latest Rust Nightly (as of writing, 1.77) via:

rustup update nightly  
rustup override set nightly

Then, reinstall pgrx for the new version of Rust:

cargo install --locked cargo-pgrx --version 0.11.2 --force

Finally, run to build in release mode with SIMD:

cargo pgrx run --release

Note that this may take several minutes to execute.

To revert back to the stable version of Rust, run:

rustup override unset

Run Benchmarks

To run benchmarks locally, enter the pg_analytics/ directory and run cargo clickbench. This runs a minified version of the ClickBench benchmark suite on pg_analytics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240130_03.md

20240130_03.md

PostgreSQL zero-ETL 超融合计算插件 pg_analytics

作者

日期

标签

背景

Overview

Benchmarks

Getting Started

Deltalake Tables

Storage Optimization

Roadmap

Features Supported

Known Limitations

How It Works

Column-Oriented Storage

Vectorized Query Execution

Postgres Integration

Development

Install Rust

Install Postgres

Install pgrx

Configure Shared Preload Libraries

Run Without Optimized Build

Run With Optimized Build

Run Benchmarks

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 开源数据库

PolarDB 学习图谱

购买PolarDB云服务折扣活动进行中, 55元起

PostgreSQL 解决方案集合

德哥 / digoal's Github - 公益是一辈子的事.

About 德哥

Files

20240130_03.md

Latest commit

History

20240130_03.md

File metadata and controls

PostgreSQL zero-ETL 超融合计算 插件 pg_analytics

作者

日期

标签

背景

Overview

Benchmarks

Getting Started

Deltalake Tables

Storage Optimization

Roadmap

Features Supported

Known Limitations

How It Works

Column-Oriented Storage

Vectorized Query Execution

Postgres Integration

Development

Install Rust

Install Postgres

Install pgrx

Configure Shared Preload Libraries

Run Without Optimized Build

Run With Optimized Build

Run Benchmarks

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 开源数据库

PolarDB 学习图谱

购买PolarDB云服务折扣活动进行中, 55元起

PostgreSQL 解决方案集合

德哥 / digoal's Github - 公益是一辈子的事.

About 德哥

PostgreSQL zero-ETL 超融合计算插件 pg_analytics