DuckDB 对比 PolarDB for PostgreSQL 全文检索功能

作者

digoal

日期

2023-06-14

背景

PolarDB for PostgreSQL 全文检索的优势: 1、可以通过插件支持几乎所有语言; 2、支持数据写入后自动实时更新全文检索索引; 3、全文检索具有非常完整的语义和函数支持功能;

DuckDB的fts插件目前仅支持英文等西方语言、不支持实时更新全文检索索引. 相对来说功能比较弱.

下面简单介绍一下duckdb的fts使用, PolarDB for PostgreSQL的用法见后面的参考部分的内容.

https://duckdb.org/docs/guides/sql_features/full_text_search

https://duckdb.org/docs/extensions/full_text_search

加载fts插件

D install fts;  
D load fts;

创建测试表和测试数据

D create table test (id int primary key, ct text);  
D insert into test values (1, '你好 中国 你好德哥,我是来自阿里巴巴数据库部门的高级数据库产品专家,负责阿里云开源数据库PolarDB的社区运营.');

创建全文索引

D PRAGMA create_fts_index('test', 'id', 'ct');

全文索引被放到fts_$schema_$tablename的一个MACRO match_bm25中.

This PRAGMA builds the index under a newly created schema. The schema will be named after the input table: if an index is created on table 'main.table_name', then the schema will be named 'fts_main_table_name'.

我们可以通过duckdb_functions() 搜索到对应的macro match_bm25

D .maxwidth -1  
D select * from duckdb_functions() where function_name like '%match%';  
  
  
│ memory        │ fts_main_test │ match_bm25        │ macro         │                                                                                                                          │             │ [docname, query_string, conjunctive, b, k, fields] │ [NULL, NULL, NULL, NULL, NULL, NULL] │         │ (WITH subscores AS (SELECT docs.docid, len, term_tf.termid, tf, df, (main.log(((((SELECT num_docs FROM fts_main_test.stats) - df) + 0.5) / (df + 0.5))) * ((tf * (k + 1)) / (tf + (k * ((1 - b) + (b * (len / (SELECT avgdl FROM fts_main_test.stats)))))))) AS subscore FROM (SELECT termid, docid, main.count_star() AS tf FROM qterms GROUP BY docid, termid) AS term_tf INNER JOIN (SELECT docid FROM qterms GROUP BY docid HAVING CASE  WHEN (conjunctive) THEN ((main.count(DISTINCT termid) = (SELECT main.count_star() FROM tokens))) ELSE 1 END) AS cdocs ON ((term_tf.docid = cdocs.docid)) INNER JOIN fts_main_test.docs AS docs ON ((term_tf.docid = docs.docid)) INNER JOIN fts_main_test.dict AS dict ON ((term_tf.termid = dict.termid))), qterms AS (SELECT termid, docid FROM fts_main_test.terms AS terms WHERE (CASE  WHEN ((fields IS NULL)) THEN (1) ELSE (fieldid = ANY(SELECT * FROM fieldids)) END AND (termid = ANY(SELECT qtermids.termid FROM qtermids)))), fieldids AS (SELECT fieldid FROM fts_main_test.fields WHERE CASE  WHEN ((fields IS NULL)) THEN (1) ELSE (field = ANY(SELECT * FROM (SELECT unnest(main.string_split(fields, ','))) AS fsq)) END), qtermids AS (SELECT termid FROM fts_main_test.dict AS dict , tokens WHERE (dict.term = tokens.t)), tokens AS (SELECT DISTINCT main.stem(unnest(fts_main_test.tokenize(query_string)), 'porter') AS t)SELECT score FROM (SELECT docid, main.sum(subscore) AS score FROM subscores GROUP BY docid) AS scores INNER JOIN fts_main_test.docs AS docs ON (((scores.docid = docs.docid) AND (docs."name" = docname)))) │                  │ false    │         1444 │                        │

使用fts_main_test.match_bm25获取要搜索的内容的score.

D  SELECT fts_main_test.match_bm25(id, '德哥') as score,* from test order by score;  
┌────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
│ score  │  id   │                                                    ct                                                    │  
│ double │ int32 │                                                 varchar                                                  │  
├────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
│        │     1 │ 你好 中国 你好德哥,我是来自阿里巴巴数据库部门的高级数据库产品专家,负责阿里云开源数据库PolarDB的社区运营. │  
└────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘  
D SELECT fts_main_test.match_bm25(id, '阿里云') as score,* from test order by score;  
┌────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
│ score  │  id   │                                                    ct                                                    │  
│ double │ int32 │                                                 varchar                                                  │  
├────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
│        │     1 │ 你好 中国 你好德哥,我是来自阿里巴巴数据库部门的高级数据库产品专家,负责阿里云开源数据库PolarDB的社区运营. │  
└────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘  
D SELECT fts_main_test.match_bm25(id, '华为') as score,* from test order by score;  
┌────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
│ score  │  id   │                                                    ct                                                    │  
│ double │ int32 │                                                 varchar                                                  │  
├────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
│        │     1 │ 你好 中国 你好德哥,我是来自阿里巴巴数据库部门的高级数据库产品专家,负责阿里云开源数据库PolarDB的社区运营. │  
└────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘  
D SELECT fts_main_test.match_bm25(id, '中国') as score,* from test order by score;  
┌────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
│ score  │  id   │                                                    ct                                                    │  
│ double │ int32 │                                                 varchar                                                  │  
├────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
│        │     1 │ 你好 中国 你好德哥,我是来自阿里巴巴数据库部门的高级数据库产品专家,负责阿里云开源数据库PolarDB的社区运营. │  
└────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘  
D SELECT fts_main_test.match_bm25(id, 'PolarDB') as score,* from test order by score;  
┌──────────────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
│        score         │  id   │                                                    ct                                                    │  
│        double        │ int32 │                                                 varchar                                                  │  
├──────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤  
│ -0.47712125471966244 │     1 │ 你好 中国 你好德哥,我是来自阿里巴巴数据库部门的高级数据库产品专家,负责阿里云开源数据库PolarDB的社区运营. │  
└──────────────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

中文没有被记录到索引中, score is NULL.

Unlike standard indexes, full text indexes don’t auto-update as the underlying data is changed, so you need to PRAGMA drop_fts_index(my_fts_index) and recreate it when appropriate.

相比之下, PostgreSQL全文检索真的好用多了, 而且还支持多种类型的索引(btree, hash, gin, brin, gist, spgist, bloom, rum 等)共同使用.

《PostgreSQL 9种索引的原理和应用场景》

《PostgreSQL SQL自动优化案例 - 极简，自动推荐索引》

《自动选择正确索引访问接口(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法》

参考

《配置 jieba结巴分词 for PolarDB 实现数据库高性能文本分词搜索》

《如何加快PostgreSQL结巴分词pg_jieba加载速度》

PolarDB for PostgreSQL 还支持模糊查询索引加速. (通过pg_trgm, pgbigm等插件, 对字符串进行切分, 使用gin索引构建token倒排索引进行模糊查询加速.)

《PolarDB 开源版通过 pg_trgm GIN 索引实现高效率 like '%xxx%' 模糊查询》
《PolarDB 开源版通过 parray_gin 实现高效率数组、JSON 内元素的模糊搜索》
《PostgreSQL 数组或JSON内容的模糊匹配索引插件: parray_gin》
《重新发现PostgreSQL之美 - 16 like '%西出函谷关%' 模糊查询》
《PostgreSQL 应用开发解决方案最佳实践系列课程 - 1. 中文分词与模糊查询》
《[直播]在数据库中跑全文检索、模糊查询SQL会不会被开除?》
《PostgreSQL 模糊查询、相似查询 (like '%xxx%') pg_bigm 比 pg_trgm 优势在哪?》
《PostgreSQL 模糊查询增强插件pgroonga , pgbigm (含单字、双字、多字、多字节字符) - 支持JSON模糊查询等》
《PostgreSQL ghtree实现的海明距离排序索引, 性能不错(模糊图像) - pg-knn_hamming - bit string 比特字符串相似度搜索》
《PostgreSQL VagueGeometry vague spatial data - VASA (Vague Spatial Algebra) for PG - 模糊空间数据》
《PostgreSQL bktree 索引using gist例子 - 海明距离检索 - 短文相似、模糊图像搜索 - bit string 比特字符串相似度搜索》
《PostgreSQL+MySQL 联合解决方案 - 第12课视频 - 全文检索、中文分词、模糊查询、相似文本查询》
《PostgreSQL 数组里面的元素，模糊搜索，模糊查询，like，前后百分号，正则查询，倒排索引》
《PostgreSQL 一复合查询SQL优化例子 - (多个exists , 范围检索 , IN检索 , 模糊检索组合)》
《PostgreSQL 模糊查询+大量重复值匹配实践 - 分区索引 = any (array())》
《PostgreSQL 模糊查询与正则匹配性能差异与SQL优化建议》
《用PostgreSQL 做实时高效搜索引擎 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询》
《HTAP数据库 PostgreSQL 场景与性能测试之 12 - (OLTP) 字符串搜索 - 前后模糊查询》
《HTAP数据库 PostgreSQL 场景与性能测试之 9 - (OLTP) 字符串模糊查询 - 含索引实时写入》
《多国语言字符串的加密、全文检索、模糊查询的支持》
《Greenplum 模糊查询实践》
《PostgreSQL 模糊查询最佳实践 - (含单字、双字、多字模糊查询方法)》
《PostgreSQL 全表全字段模糊查询的毫秒级高效实现 - 搜索引擎颤抖了》
《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》
《中文模糊查询性能优化 by PostgreSQL trgm》
《PostgreSQL 百亿数据秒级响应正则及模糊查询》

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20230614_01.md

20230614_01.md

DuckDB 对比 PolarDB for PostgreSQL 全文检索功能

作者

日期

标签

背景

参考

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 云原生分布式开源数据库

PolarDB 学习图谱: 训练营、培训认证、在线互动实验、解决方案、内核开发公开课、生态合作、写心得拿奖品

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

购买PolarDB云服务折扣活动进行中, 55元起

About 德哥

Files

20230614_01.md

Latest commit

History

20230614_01.md

File metadata and controls

DuckDB 对比 PolarDB for PostgreSQL 全文检索功能

作者

日期

标签

背景

参考

期望 PostgreSQL|开源PolarDB 增加什么功能?

PolarDB 云原生分布式开源数据库

PolarDB 学习图谱: 训练营、培训认证、在线互动实验、解决方案、内核开发公开课、生态合作、写心得拿奖品

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

购买PolarDB云服务折扣活动进行中, 55元起

About 德哥