Skip to content

Latest commit

 

History

History
275 lines (176 loc) · 17.3 KB

20240130_03.md

File metadata and controls

275 lines (176 loc) · 17.3 KB

PostgreSQL zero-ETL 超融合计算 插件 pg_analytics

作者

digoal

日期

2024-01-30

标签

PostgreSQL , PolarDB , DuckDB , 超融合 , zero-ETL , pg_analytics , 计算存储分离


背景

想象一下企业的数据可能分布在很多的数据源中, 例如不同业务的数据库、对象存储中的文件形式存在, 企业要进行全面的数据分析, 有一种方法的将所有数据源统一同步到大数据平台, 这种方法比较常见, 实际上还有更廉价、更实时、更简单的方法, 就是超融合计算. 超融合计算可以简单理解为“计算+数据访问管道+各种数据源”的架构, 例如LotuseeData 大数据平台的超融合产品与PolarDB结合, 将PolarDB作为计算节点, 通过配置管道, 实时访问任意数据源, 并进行实时的全域数据计算.

超融合计算的计算节点可以是duckdb,postgresql,polardb,greenplum等, 目前PostgreSQL开源插件pg_analytics就是一款开源的超融合计算插件.

https://github.com/paradedb/paradedb/tree/dev/pg_analytics

https://docs.paradedb.com/blog/introducing_analytics

pg_analytics 插件架构

  • embeds Arrow, Parquet, and DataFusion
  • 采用PostgreSQL存储remote数据的catalog,
  • 使用table access method api访问远端数据, 表达为 Parquet 文件, 通过Delta Lake管理parquet, which provides ACID transactions.
  • 使用executor hook 将请求路由到DataFusion, 产生AP场景更优的执行计划, 并执行请求
  • 最终将结果返回postgresql

这些套件参考:

部署pg_analytics安装参考

Overview

pg_analytics is an extension that accelerates analytical query processing inside Postgres. The performance of analytical queries that leverage pg_analytics is comparable to the performance of dedicated OLAP databases — without the need to extract, transform, and load (ETL) the data from your Postgres instance into another system. The purpose of pg_analytics is to be a drop-in solution for fast analytics in Postgres with zero ETL.

The primary dependencies are:

Benchmarks

With pg_analytics installed, ParadeDB is the fastest Postgres-based analytical database and outperforms many specialized OLAP systems. On Clickbench, ParadeDB is 94x faster than regular Postgres, 8x faster than Elasticsearch, and almost ties Clickhouse.

Clickbench Results

For an apples-to-apples comparison, these benchmarks were run on a c6a.4xlarge with 500GB storage. None of the databases were tuned. The (Parquet, single) Clickhouse variant was selected because it most closely matches ParadeDB's Parquet storage.

You can view ParadeDB ClickBench results, including how we compare against other Postgres-compatible systems here.

Getting Started

This toy example demonstrates how to get started.

CREATE EXTENSION pg_analytics;  
-- Create a deltalake table  
CREATE TABLE t (a int) USING deltalake;  
-- pg_analytics supercharges the performance of any  
-- Postgres query run on a deltalake table  
INSERT INTO t VALUES (1), (2), (3);  
SELECT COUNT(*) FROM t;  

Deltalake Tables

You can interact with deltalake tables the same way as with normal Postgres tables. However, there are a few operations specific to deltalake tables.

Storage Optimization

When deltalake tables are dropped, they remain on disk until VACUUM is run. This operation physically
deletes the Parquet files of dropped tables.

The VACUUM FULL <table_name> command is used to optimize a table's storage by bin-packing small Parquet
files into larger files, which can significantly improve query time and compression. It also deletes
Parquet files belonging to dropped data.

Roadmap

pg_analytics is currently in beta.

Features Supported

  • deltalake tables behave like regular Postgres tables and support most Postgres queries (JOINs, CTEs, window functions, etc.)
  • Vacuum and Parquet storage optimization
  • INSERT, TRUNCATE, and COPY

Known Limitations

As pg_analytics becomes production-ready, many of these will be resolved.

  • UPDATE and DELETE
  • Partitioning tables by column
  • Some Postgres types like arrays, JSON, time, and timestamp with time zone
  • User-defined functions, aggregations, or types
  • Referencing deltalake and regular Postgres heap tables in the same query
  • Write-ahead-log (WAL) support and ROLLBACK
  • Foreign keys
  • Index scans
  • TEMP tables
  • Using an external data lake as a table storage provider
  • Full text search over deltalake tables with pg_bm25

How It Works

pg_analytics introduces column-oriented storage and vectorized query execution to Postgres via Apache Parquet, Arrow, and DataFusion. These libraries are the building blocks of many modern analytical databases.

Column-Oriented Storage

Regular Postgres tables, known as heap tables, are row-oriented. While this makes sense for operational data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the columns in a table. As a result, most dedicated analytical (i.e. OLAP) database systems use a column-oriented layout so that scans only need to access the data from the relevant columns. Column-oriented systems have other advantages for analytics such as improved compression and are more amenable to vectorized execution.

Vectorized Query Execution

Vectorized query execution is a technique that takes advantage of modern CPUs to break column-oriented data into batches and process the batches in parallel.

Postgres Integration

pg_analytics embeds Arrow, Parquet, and DataFusion inside Postgres via executor hooks and the table access method API. Executor hooks intercept queries to these tables and reroute them to DataFusion, which generates an optimized query plan, executes the query, and sends the results back to Postgres. The table access method persists Postgres tables as Parquet files and registers them with Postgres' system catalogs. The Parquet files are managed by Delta Lake, which provides ACID transactions.

Development

Install Rust

To develop the extension, first install Rust v1.73.0 using rustup. We will soon make the extension compatible with newer versions of Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh  
rustup install 1.73.0  
  
# We recommend setting the default version to 1.73.0 for consistency across your system  
rustup default 1.73.0  

Note: While it is possible to install Rust via your package manager, we recommend using rustup as we've observed inconcistencies with Homebrew's Rust installation on macOS.

Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project:

Install Postgres

# macOS  
brew install postgresql@16  
  
# Ubuntu  
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -  
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'  
sudo apt-get update && sudo apt-get install -y postgresql-16 postgresql-server-dev-16  

If you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the pg_config binary to your path before continuing:

export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"  

Install pgrx

Then, install and initialize pgrx:

# Note: Replace --pg16 with your version of Postgres, if different (i.e. --pg15, --pg14, etc.)  
cargo install --locked cargo-pgrx --version 0.11.2  
  
# macOS arm64  
cargo pgrx init --pg16=/opt/homebrew/opt/postgresql@16/bin/pg_config  
  
# macOS amd64  
cargo pgrx init --pg16=/usr/local/opt/postgresql@16/bin/pg_config  
  
# Ubuntu  
cargo pgrx init --pg16=/usr/lib/postgresql/16/bin/pg_config  

If you prefer to use a different version of Postgres, update the --pg flag accordingly.

Note: While it is possible to develop using pgrx's own Postgres installation(s), via cargo pgrx init without specifying a pg_config path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's.

Configure Shared Preload Libraries

This extension uses Postgres hooks to intercept Postgres queries. In order to enable these hooks, the extension
must be added to shared_preload_libraries inside postgresql.conf. If you are using Postgres 16, this file can be found under ~/.pgrx/data-16.

# Inside postgresql.conf  
shared_preload_libraries = 'pg_analytics'  

Run Without Optimized Build

The extension can be developed with or without an optimized build. An optimized build improves query times by 10-20x but also significantly increases build times.

To launch the extension without an optimized build, run

cargo pgrx run  

Run With Optimized Build

First, switch to latest Rust Nightly (as of writing, 1.77) via:

rustup update nightly  
rustup override set nightly  

Then, reinstall pgrx for the new version of Rust:

cargo install --locked cargo-pgrx --version 0.11.2 --force  

Finally, run to build in release mode with SIMD:

cargo pgrx run --release  

Note that this may take several minutes to execute.

To revert back to the stable version of Rust, run:

rustup override unset  

Run Benchmarks

To run benchmarks locally, enter the pg_analytics/ directory and run cargo clickbench. This runs a minified version of the ClickBench benchmark suite on pg_analytics.

digoal's wechat