Skip to content

Latest commit

 

History

History
176 lines (128 loc) · 8.29 KB

20231130_01.md

File metadata and controls

176 lines (128 loc) · 8.29 KB

如何将PostgreSQL query/table 快速导出为parquet文件或导入DuckDB table?

作者

digoal

日期

2023-11-30

标签

PostgreSQL , PolarDB , duckdb , 存储 , parquet , 数据科学家 , 中小企业 , 数据湖 , 数据挖掘


背景

duckdb通常被用于对共享数据进行高速分析, 数据科学家或个人爱好者进行数据挖掘分析. 中小企业用于高效的分析多个业务库的数据, 省钱又高效. 有了duckdb还要啥自行车?

将数据导出到parquet后, 使用DuckDB读取是非常方便的, 这也是duckdb小巧玲珑性能吊炸天的地方.

使用我的学习镜像可以很方便的做这个实验.

《amd64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
《arm64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》

docker run -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts_arm64  
  
docker exec -ti pg bash    
  
su - postgres  

例子, 将PostgreSQL query导出到parquet.

psql -c "copy (select id, md5(random()::text) as info, clock_timestamp() ts from generate_series(1,10000) id) to stdout with (format csv, header on)" | ./duckdb -c "COPY (SELECT * FROM read_csv('/dev/stdin', delim=',', header=true, columns={'id': 'INTEGER', 'info': 'VARCHAR', 'ts': 'timestamp'})) TO '/tmp/test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);"  
postgres@e77ef660623f:~$ ll /tmp/test.parquet   
-rw-r--r-- 1 postgres postgres 211K Nov 30 02:26 /tmp/test.parquet  

语法解释:

PostgreSQL:

copy (query or table) to stdout  

DuckDB:

copy (select csv from stdin) to (table or file)  

导出到parquet后, 使用DuckDB读取是非常方便的, 这也是duckdb小巧玲珑性能吊炸天的地方.

postgres@e77ef660623f:~$ ./duckdb   
v0.9.2 3c695d7ba9  
Enter ".help" for usage hints.  
Connected to a transient in-memory database.  
Use ".open FILENAME" to reopen on a persistent database.  
D select count(*) from '/tmp/test.parquet';  
┌──────────────┐  
│ count_star() │  
│    int64     │  
├──────────────┤  
│        10000 │  
└──────────────┘  
D select * from '/tmp/test.parquet' limit 5;  
┌───────┬──────────────────────────────────┬────────────────────────────┐  
│  id   │               info               │             ts             │  
│ int32 │             varchar              │         timestamp          │  
├───────┼──────────────────────────────────┼────────────────────────────┤  
│     1 │ ee356eca72a6cb8b51402e3f76808f9e │ 2023-11-30 02:26:47.230087 │  
│     2 │ 917689435db54c521d8434283c156a60 │ 2023-11-30 02:26:47.230102 │  
│     3 │ a9d8a98c65b73fb286de1f1f5286e770 │ 2023-11-30 02:26:47.230104 │  
│     4 │ 5ca54f4cf5d63a62a3279c05df6ecca4 │ 2023-11-30 02:26:47.230106 │  
│     5 │ e048d0fa3a713d038bf518970c94c9ec │ 2023-11-30 02:26:47.230107 │  
└───────┴──────────────────────────────────┴────────────────────────────┘  
D .quit  

详细用法:

postgres=# \h copy  
Command:     COPY  
Description: copy data between a file and a table  
Syntax:  
COPY table_name [ ( column_name [, ...] ) ]  
    FROM { 'filename' | PROGRAM 'command' | STDIN }  
    [ [ WITH ] ( option [, ...] ) ]  
    [ WHERE condition ]  
  
COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }  
    TO { 'filename' | PROGRAM 'command' | STDOUT }  
    [ [ WITH ] ( option [, ...] ) ]  
  
where option can be one of:  
  
    FORMAT format_name  
    FREEZE [ boolean ]  
    DELIMITER 'delimiter_character'  
    NULL 'null_string'  
    HEADER [ boolean ]  
    QUOTE 'quote_character'  
    ESCAPE 'escape_character'  
    FORCE_QUOTE { ( column_name [, ...] ) | * }  
    FORCE_NOT_NULL ( column_name [, ...] )  
    FORCE_NULL ( column_name [, ...] )  
    ENCODING 'encoding_name'  
  
URL: https://www.postgresql.org/docs/14/sql-copy.html  

DuckDB:

https://duckdb.org/docs/archive/0.9.2/data/parquet/overview

-- read a single parquet file  
SELECT * FROM 'test.parquet';  
-- figure out which columns/types are in a parquet file  
DESCRIBE SELECT * FROM 'test.parquet';  
-- create a table from a parquet file  
CREATE TABLE test AS SELECT * FROM 'test.parquet';  
-- if the file does not end in ".parquet", use the read_parquet function  
SELECT * FROM read_parquet('test.parq');  
-- use list parameter to read 3 parquet files and treat them as a single table  
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);  
-- read all files that match the glob pattern  
SELECT * FROM 'test/*.parquet';  
-- read all files that match the glob pattern, and include a "filename" column that specifies which file each row came from  
SELECT * FROM read_parquet('test/*.parquet', filename=true);  
-- use a list of globs to read all parquet files from 2 specific folders  
SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);  
-- query the metadata of a parquet file  
SELECT * FROM parquet_metadata('test.parquet');  
-- query the schema of a parquet file  
SELECT * FROM parquet_schema('test.parquet');  
  
-- write the results of a query to a parquet file  
COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT 'parquet');  
-- write the results from a query to a parquet file with specific compression and row_group_size  
COPY (FROM generate_series(100000)) TO 'test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);  
  
-- export the table contents of the entire database as parquet  
EXPORT DATABASE 'target_directory' (FORMAT PARQUET);  

digoal's wechat