Skip to content

Latest commit

 

History

History
220 lines (163 loc) · 15.9 KB

20230518_02.md

File metadata and controls

220 lines (163 loc) · 15.9 KB

DuckDB 0.8.0 发布, 支持pivot语法, ASOF JOIN, 并行导入导出性能提升, 递归通配符解析文件, arrow 连接器等

作者

digoal

日期

2023-05-18

标签

PostgreSQL , PolarDB , DuckDB , arrow , pivot , unpivot , 时序 , 就近JOIN , ASOF JOIN , fuzzy join , 通配符递归解析文件 , 并行导入 , 并行导出


背景

https://duckdb.org/2023/05/17/announcing-duckdb-080.html

0.8.0主要增强:

  • 新的 pivot 和 unpivot 语句 (行列转换)
  • 并行数据导入/导出的性能改进
  • 时间序列 时间就近fuzzy JOIN
  • 递归通配
  • 延迟加载存储元数据以缩短启动时间
  • Python 的用户定义函数
  • Arrow 数据库连接 (ADBC) 支持
  • 新的 Swift 集成

例子

1、PIVOT, 行列变换:

CREATE TABLE sales(year INT, amount INT);  
INSERT INTO sales VALUES (2021, 42), (2022, 100), (2021, 42);  
PIVOT sales ON year USING SUM(amount);  
2021 2022
84 100

https://duckdb.org/docs/sql/statements/pivot.html

2、时间字段 fuzzy JOIN, ASOF JOIN:

匹配最相近的一条记录, 而不需要匹配出所有的记录. 通常用在时序数据中, 弥补两段数据的GAP.

CREATE TABLE a(ts TIMESTAMP);  
CREATE TABLE b(ts TIMESTAMP);  
INSERT INTO a VALUES (TIMESTAMP '2023-05-15 10:31:00'), (TIMESTAMP '2023-05-15 11:31:00');  
INSERT INTO b VALUES (TIMESTAMP '2023-05-15 10:30:00'), (TIMESTAMP '2023-05-15 11:30:00');  
  
FROM a ASOF JOIN b ON a.ts >= b.ts;  
a.ts b.ts
2023-05-15 10:31:00 2023-05-15 10:30:00
2023-05-15 11:31:00 2023-05-15 11:30:00

https://duckdb.org/docs/guides/sql_features/asof_join.html

3、Default Parallel CSV Reader:

CREATE TABLE lineitem AS FROM lineitem.csv;  
v0.7.1 v0.8.0
4.1s 1.2s

4、Parallel Parquet, CSV and JSON Writing:

COPY lineitem TO 'lineitem.csv';  
COPY lineitem TO 'lineitem.parquet';  
COPY lineitem TO 'lineitem.json';  
Format v0.7.1 v0.8.0
CSV 3.9s 0.6s
Parquet 8.1s 1.2s
JSON 4.4s 1.1s

5、Recursive File Globbing using **

This release adds support for recursive globbing where an arbitrary number of subdirectories can be matched using the ** operator (double-star).

**表示可以有任意个目录

FROM 'data/glob/crawl/stackoverflow/**/*.csv';  

https://duckdb.org/docs/data/multiple_files/overview

6、Lazy-Loading Table Metadata

按实际查询的字段, 按需(查到对应文件时)再加载 parquet元数据:

DuckDB’s internal storage format stores metadata for every row group in a table, such as min-max indices and where in the file every row group is stored. In the past, DuckDB would load this metadata immediately once the database was opened. However, once the data gets very big, the metadata can also get quite large, leading to a noticeable delay on database startup. In this release, we have optimized the metadata handling of DuckDB to only read table metadata as its being accessed. As a result, startup is near-instantaneous even for large databases, and metadata is only loaded for columns that are actually used in queries. The benchmarks below are for a database file containing a single large TPC-H lineitem table (120x SF1) with ~770 million rows and 16 columns:

Query v0.6.1 v0.7.1 v0.8.0 Parquet
SELECT 42 1.60s 0.31s 0.02s -
FROM lineitem LIMIT 1; 1.62s 0.32s 0.03s 0.27s

7、Arrow Database Connectivity Support (ADBC)

duckdb/duckdb#7086

ADBC is a database API standard for database access libraries that uses Apache Arrow to transfer query result sets and to ingest data. Using Arrow for this is particularly beneficial for columnar data management systems which traditionally suffered a performance hit by emulating row-based APIs such as JDBC/ODBC. From this release, DuckDB natively supports ADBC. We’re happy to be one of the first systems to offer native support, and DuckDB’s in-process design fits nicely with ADBC.

实际上asof join不仅仅限于时间字段:

git clone --depth 1 -b v0.8.0 https://github.com/duckdb/duckdb.git
cd duckdb
env CC=clang CXX=clang++ CMAKE_BUILD_PARALLEL_LEVEL=4 EXTENSION_STATIC_BUILD=1 BUILD_ICU=1 BUILD_TPCH=1 BUILD_TPCDS=1 BUILD_FTS=1 BUILD_VISUALIZER=1 BUILD_HTTPFS=1 BUILD_JSON=1 BUILD_JEMALLOC=1 BUILD_EXCEL=1 BUILD_INET=1 STATIC_OPENSSL=1 BUILD_SQLSMITH=1 BUILD_TPCE=1 make benchmark -j 4 
cd build/release/
./duckdb 


create table a (id int, info text, ts timestamp); 
create table b (id int, info text, ts timestamp); 
insert into a select id, md5(random()::text), now()+(id||' second')::interval from range(0,20) as t (id);
insert into b select id, md5(random()::text), now()+(id||' second')::interval from range(-10,10) as t (id);

D select * from a limit 10;
┌───────┬──────────────────────────────────┬─────────────────────────┐
│  id   │               info               │           ts            │
│ int32 │             varchar              │        timestamp        │
├───────┼──────────────────────────────────┼─────────────────────────┤
│     0 │ 0a8d26ba0733eae2a837837d83dfc923 │ 2023-05-18 10:50:27.295 │
│     1 │ 691088b52403c57e357e66bc93bda156 │ 2023-05-18 10:50:28.295 │
│     2 │ 931c869da887c00b54e33027ff696eb3 │ 2023-05-18 10:50:29.295 │
│     3 │ ab5093ab06dccebb76a6a0fb1459a133 │ 2023-05-18 10:50:30.295 │
│     4 │ 7f88957772ae69af164b523577912835 │ 2023-05-18 10:50:31.295 │
│     5 │ 66b3f99b4673d340ee4d919d72e3f646 │ 2023-05-18 10:50:32.295 │
│     6 │ afea6c151aae4479c05bdaa89aa0a327 │ 2023-05-18 10:50:33.295 │
│     7 │ c282a6103c185e4e922a4c1203f242c5 │ 2023-05-18 10:50:34.295 │
│     8 │ 1c64b4b6f31870c9a87a2be20c76623b │ 2023-05-18 10:50:35.295 │
│     9 │ 121ab519d77570ded665a066038faa47 │ 2023-05-18 10:50:36.295 │
├───────┴──────────────────────────────────┴─────────────────────────┤
│ 10 rows                                                  3 columns │
└────────────────────────────────────────────────────────────────────┘
D select * from b limit 10;
┌───────┬──────────────────────────────────┬─────────────────────────┐
│  id   │               info               │           ts            │
│ int32 │             varchar              │        timestamp        │
├───────┼──────────────────────────────────┼─────────────────────────┤
│  -100 │ f77064d394c675802806c7afe6c68a1e │ 2023-05-18 10:48:47.511 │
│   -99 │ be63cc120dc07ecf4df8ce93c086f045 │ 2023-05-18 10:48:48.511 │
│   -98 │ f4b81d4a3395a870bf0e1956a22d1c1e │ 2023-05-18 10:48:49.511 │
│   -97 │ 420c04ff196f616cf50334d7c98268f1 │ 2023-05-18 10:48:50.511 │
│   -96 │ ea5973616f5a224c6259c9d130072337 │ 2023-05-18 10:48:51.511 │
│   -95 │ a4f7ef6906492beff8e677439050bf01 │ 2023-05-18 10:48:52.511 │
│   -94 │ 0dc6226f4910ee071b9ae6ba50125da1 │ 2023-05-18 10:48:53.511 │
│   -93 │ 45fc24972cad96928bf3bd4a8ad1c0b4 │ 2023-05-18 10:48:54.511 │
│   -92 │ 0a5474a56b6230ffb5f5b6f476e74e92 │ 2023-05-18 10:48:55.511 │
│   -91 │ 62b2253dec6e3fba180addd092af4c55 │ 2023-05-18 10:48:56.511 │
├───────┴──────────────────────────────────┴─────────────────────────┤
│ 10 rows                                                  3 columns │
└────────────────────────────────────────────────────────────────────┘

D select a.*, b.* from a asof join b on a.id>=b.id order by b.id limit 20; 
┌───────┬──────────────────────────────────┬─────────────────────────┬───────┬──────────────────────────────────┬─────────────────────────┐
│  id   │               info               │           ts            │  id   │               info               │           ts            │
│ int32 │             varchar              │        timestamp        │ int32 │             varchar              │        timestamp        │
├───────┼──────────────────────────────────┼─────────────────────────┼───────┼──────────────────────────────────┼─────────────────────────┤
│     0 │ 5f1bf367cf0e4e2672f10ea4d65353ca │ 2023-05-18 10:52:49.953 │     0 │ 752d283d01125dc2fecf172382532005 │ 2023-05-18 10:52:50.105 │
│     1 │ ac971c6674308d62864763c6ad933d57 │ 2023-05-18 10:52:50.953 │     1 │ 7666a1c40ee743b58d0b7ff323545961 │ 2023-05-18 10:52:51.105 │
│     2 │ ff9e1922ec92c4aa0ad5bc8c530997b5 │ 2023-05-18 10:52:51.953 │     2 │ c380583bdf252e7296f3e396b5bb21a3 │ 2023-05-18 10:52:52.105 │
│     3 │ 27e7f143ab038c53cae8c137d95ef627 │ 2023-05-18 10:52:52.953 │     3 │ 8fdc01b93e48884bc7d498b512c58cd1 │ 2023-05-18 10:52:53.105 │
│     4 │ 9ed64d61a93d97248f1168d247a38be7 │ 2023-05-18 10:52:53.953 │     4 │ 4ffefae6e83dd6014686c7cd8d8b9ed7 │ 2023-05-18 10:52:54.105 │
│     5 │ b0f5efbe91f04faad8d4d958cb88c64c │ 2023-05-18 10:52:54.953 │     5 │ c29d920f56b1b05db0eba3d36e52f37e │ 2023-05-18 10:52:55.105 │
│     6 │ d6377deb173023632706f94f49b8c90d │ 2023-05-18 10:52:55.953 │     6 │ 4c92d0f42f45baf59544934e69f48e18 │ 2023-05-18 10:52:56.105 │
│     7 │ 03571148016362afe2200d40407e3922 │ 2023-05-18 10:52:56.953 │     7 │ 8e59279250cb8079aeaab33a6b4429d0 │ 2023-05-18 10:52:57.105 │
│     8 │ 9eb47b01d835bf55830776bc0787fe0a │ 2023-05-18 10:52:57.953 │     8 │ 4e8a43298ba4ee310606eb13d6311352 │ 2023-05-18 10:52:58.105 │
│     9 │ 8c62f0205c469fac306db3d8e63ddfa3 │ 2023-05-18 10:52:58.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    10 │ 057e233aeb577ee81fc9fc302071a5b3 │ 2023-05-18 10:52:59.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    11 │ 0bdbc62254136ac6777eee52d76b3e25 │ 2023-05-18 10:53:00.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    12 │ 7a33f95cc4f64f22b377a408c7a283c8 │ 2023-05-18 10:53:01.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    13 │ f55273e5b0520f5b1d86a42063ed975a │ 2023-05-18 10:53:02.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    14 │ 5191f6b57d1e64b224f0cd251537f24a │ 2023-05-18 10:53:03.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    15 │ 46cdc89141c4d92d4dac64f8ced92a5f │ 2023-05-18 10:53:04.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    16 │ 6802520d34a27b0ee3147626fd97f9c0 │ 2023-05-18 10:53:05.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    17 │ 15b5e4c33ce8fe99e2dc8d870efb8de8 │ 2023-05-18 10:53:06.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    18 │ 56122d9defdc4cbd448e4696f29c6afa │ 2023-05-18 10:53:07.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
│    19 │ 0f302b35dacc9a731ed0eb15905a3ded │ 2023-05-18 10:53:08.953 │     9 │ 55bc6bccb620bfe6c2725088e42d0653 │ 2023-05-18 10:52:59.105 │
├───────┴──────────────────────────────────┴─────────────────────────┴───────┴──────────────────────────────────┴─────────────────────────┤
│ 20 rows                                                                                                                       6 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

digoal's wechat