digoal
2023-11-30
PostgreSQL , PolarDB , duckdb , 存储 , parquet , 数据科学家 , 中小企业 , 数据湖 , 数据挖掘
duckdb通常被用于对共享数据进行高速分析, 数据科学家或个人爱好者进行数据挖掘分析. 中小企业用于高效的分析多个业务库的数据, 省钱又高效. 有了duckdb还要啥自行车?
将数据导出到parquet后, 使用DuckDB读取是非常方便的, 这也是duckdb小巧玲珑性能吊炸天的地方.
使用我的学习镜像可以很方便的做这个实验.
《amd64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
《arm64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》
docker run -d -it -P --cap-add=SYS_PTRACE --cap-add SYS_ADMIN --privileged=true --name pg --shm-size=1g registry.cn-hangzhou.aliyuncs.com/digoal/opensource_database:pg14_with_exts_arm64
docker exec -ti pg bash
su - postgres
例子, 将PostgreSQL query导出到parquet.
psql -c "copy (select id, md5(random()::text) as info, clock_timestamp() ts from generate_series(1,10000) id) to stdout with (format csv, header on)" | ./duckdb -c "COPY (SELECT * FROM read_csv('/dev/stdin', delim=',', header=true, columns={'id': 'INTEGER', 'info': 'VARCHAR', 'ts': 'timestamp'})) TO '/tmp/test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);"
postgres@e77ef660623f:~$ ll /tmp/test.parquet
-rw-r--r-- 1 postgres postgres 211K Nov 30 02:26 /tmp/test.parquet
语法解释:
PostgreSQL:
copy (query or table) to stdout
DuckDB:
copy (select csv from stdin) to (table or file)
导出到parquet后, 使用DuckDB读取是非常方便的, 这也是duckdb小巧玲珑性能吊炸天的地方.
postgres@e77ef660623f:~$ ./duckdb
v0.9.2 3c695d7ba9
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from '/tmp/test.parquet';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 10000 │
└──────────────┘
D select * from '/tmp/test.parquet' limit 5;
┌───────┬──────────────────────────────────┬────────────────────────────┐
│ id │ info │ ts │
│ int32 │ varchar │ timestamp │
├───────┼──────────────────────────────────┼────────────────────────────┤
│ 1 │ ee356eca72a6cb8b51402e3f76808f9e │ 2023-11-30 02:26:47.230087 │
│ 2 │ 917689435db54c521d8434283c156a60 │ 2023-11-30 02:26:47.230102 │
│ 3 │ a9d8a98c65b73fb286de1f1f5286e770 │ 2023-11-30 02:26:47.230104 │
│ 4 │ 5ca54f4cf5d63a62a3279c05df6ecca4 │ 2023-11-30 02:26:47.230106 │
│ 5 │ e048d0fa3a713d038bf518970c94c9ec │ 2023-11-30 02:26:47.230107 │
└───────┴──────────────────────────────────┴────────────────────────────┘
D .quit
详细用法:
postgres=# \h copy
Command: COPY
Description: copy data between a file and a table
Syntax:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]
COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }
TO { 'filename' | PROGRAM 'command' | STDOUT }
[ [ WITH ] ( option [, ...] ) ]
where option can be one of:
FORMAT format_name
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
HEADER [ boolean ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL ( column_name [, ...] )
FORCE_NULL ( column_name [, ...] )
ENCODING 'encoding_name'
URL: https://www.postgresql.org/docs/14/sql-copy.html
DuckDB:
https://duckdb.org/docs/archive/0.9.2/data/parquet/overview
-- read a single parquet file
SELECT * FROM 'test.parquet';
-- figure out which columns/types are in a parquet file
DESCRIBE SELECT * FROM 'test.parquet';
-- create a table from a parquet file
CREATE TABLE test AS SELECT * FROM 'test.parquet';
-- if the file does not end in ".parquet", use the read_parquet function
SELECT * FROM read_parquet('test.parq');
-- use list parameter to read 3 parquet files and treat them as a single table
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
-- read all files that match the glob pattern
SELECT * FROM 'test/*.parquet';
-- read all files that match the glob pattern, and include a "filename" column that specifies which file each row came from
SELECT * FROM read_parquet('test/*.parquet', filename=true);
-- use a list of globs to read all parquet files from 2 specific folders
SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
-- query the metadata of a parquet file
SELECT * FROM parquet_metadata('test.parquet');
-- query the schema of a parquet file
SELECT * FROM parquet_schema('test.parquet');
-- write the results of a query to a parquet file
COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT 'parquet');
-- write the results from a query to a parquet file with specific compression and row_group_size
COPY (FROM generate_series(100000)) TO 'test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);
-- export the table contents of the entire database as parquet
EXPORT DATABASE 'target_directory' (FORMAT PARQUET);