Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): match function support multiple fields with boost #15196

Merged
merged 4 commits into from Apr 16, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Apr 9, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  • match function support multiple fields with boost and only support query text without syntax
  • query function support search with query syntax
  • add some tests for phrase terms and boolean operator

query syntax support following forms

  1. simple terms, like title:quick
  2. bool operator terms, like title:fox AND dog OR cat
  3. must and negative operator terms, like title:+fox -cat
  4. phrase terms, like title:"quick brown fox"
  5. multiple field with boost terms, like title:fox^5 content:dog^2

The functions and syntax reference the design of elastic search
https://www.elastic.co/guide/en/elasticsearch/reference/8.13/sql-functions-search.html
https://www.elastic.co/guide/en/elasticsearch/reference/8.13/query-dsl-query-string-query.html

for example

mysql> CREATE TABLE books(
    ->   id int,
    ->   title string,
    ->   author string,
    ->   description string
    -> );
Query OK, 0 rows affected (0.11 sec)

mysql> CREATE INVERTED INDEX IF NOT EXISTS idx1 ON books(title, author, description) tokenizer = 'chinese';
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO books VALUES
    -> (1, 'BERT基础教程:Transformer大模型实战', '[印] 苏达哈尔桑·拉维昌迪兰(Sudharsan Ravichandiran)', '本书聚焦谷歌公司开发的BERT自然语言处理模型,由浅入深地介绍了BERT的工作原理、BERT的各种变体及其应用。本书呈现了大量示意图、代码和实例,详细解析了如何训练BERT模型、如何使用BERT模型执行自然语言推理任务、文本摘要任务、问答任务、命名实体识别任务等各种下游任务,以及如何将BERT模型应用于多种语言。通读本书后,读者不仅能够全面了解有关BERT的各种概念、术语和原理,还能够使用BERT模型及其变体执行各种自然语言处理任务。'),
    -> (2, 'Flask Web开发:基于Python的Web应用开发实战(第2版)', '[美]米格尔•格林贝格(Miguel Grinberg)', '本书共分三部分, 全面介绍如何基于Python微框架Flask进行Web开发。第一部分是Flask简介,介绍使用Flask框架及扩展开发Web程序的必备基础知识。第二部分 则给出一个实例,真正带领大家一步步开发完整的博客和社交应用Flasky,从而将前述知识融会贯通,付诸实践。第三部分介绍了发布应用之前必须考虑的事项,如单元测试策略、性能分析技术、Flask程序的部署方式等。第2版针对Python 3.6全面修订。'),
    -> (3, 'Apache Pulsar实战', '[美]戴维·克杰鲁姆加德(David Kjerrumgaard)', 'Apache Pulsar被誉为下一代分布式消息系统,旨在 打通发布/ 订阅式消息传递和流数据分析。本书作者既与Pulsar项目创始成员共事多年,又有在生产环境中使用Pulsar 的丰富经验。正是这些 宝贵的经验成就了这本Pulsar“避坑指南”,为想轻松上手Pulsar的读者铺平了学习之路。本书分为三大部分,共有12章。第一部分概述Pulsar的设计理念和用途。第二部分介绍Pulsar的特性。第三部分以一个虚构的外卖应用程序为例,详细地介绍Pulsar Functions框架的用法,并展示如何用它实现常见的微服务设计模式。本书示例采用Java语言,并同时提供Python实现。'),
    -> (4, 'Rust程序设计(第2版)', '[美]吉姆 • 布兰迪(Jim Blandy)', '本书是Rust领域经典参考书,由业内资深系统程序员编写,广受读者好评。书中全面介绍了Rust这种新型系统编程语言——具有无与伦比的安全性,兼具C和C++的高性能,并大大简化了并发程序的编写。第2 版对上一版内容进行了重组和完善,新增了对“异步编程”的介绍。借助书中的大量案例,你也能用Rust编写出兼顾安全性与高性能的程序。本书内容包括基本数据类型、所有权、引用、表达式、错误处理、crate与模块、结构、枚举与模式等基础知识,以及特型与泛型、闭包、迭代器、 集合、字符串与文本、输入与输出、并发、异步编程、宏等进阶知识。'),
    -> (5, 'Vue.js设计与实现', '霍春阳(HcySunYang)', '本书基于Vue.js 3,从规范出发,以源码为基础,并结合大量直观的配图,循序渐进地讲解Vue.js中各个功能模块的实现,细致剖析框架设计原理。全书共18章,分为六篇,主要内容包括:框架设计概览、响应系统、渲染器、组件化、编译器和服务端渲染等。通过阅读本书,对Vue.js 2/3具有上手经验的开发人员能够进一步理解Vue.js框架的实现细节,没有Vue.js使用经验但对框架设计感兴趣的前端开发人员,能够快速掌握Vue.js的设计原理。'),
    -> (6, '前端架构设计', '[美]迈卡·高保特(Micah Godbolt)', '本书展示了一名成熟的前端架构师对前端开发全面而深刻的理解。作者结合自己在Red Hat公司的项目实战经历,探讨了前端架构原则和前端架构的核心内容,包括工作流程、测试流程和文档记录,以及作为前端架 构师所要承担的具体开发工作,包括HTML、JavaScript和CSS等。');
Query OK, 6 rows affected (0.20 sec)

# match multiple fields with boosts
mysql> SELECT id, score(), title FROM books WHERE match('title^5, description^1.2', '设计 实现') ORDER BY score() DESC;
+------+-----------+-------------------------------+
| id   | score()   | title                         |
+------+-----------+-------------------------------+
|    5 | 16.537666 | Vue.js设计与实现              |
|    6 | 3.9548686 | 前端架构设计                  |
|    4 | 3.4990246 | Rust程序设计(第2版)         |
|    3 |  3.209204 | Apache Pulsar实战             |
+------+-----------+-------------------------------+
4 rows in set (0.15 sec)
Read 6 rows, 281.00 B in 0.070 sec., 86.23 rows/sec., 3.94 KiB/sec.

# query simple terms
mysql> SELECT id, score(), title FROM books WHERE query('title:实战') ORDER BY score() DESC;
+------+------------+---------------------------------------------------------------------+
| id   | score()    | title                                                               |
+------+------------+---------------------------------------------------------------------+
|    3 |  0.8460867 | Apache Pulsar实战                                                   |
|    1 | 0.66167223 | BERT基础教程:Transformer大模型实战                                 |
|    2 |  0.4986443 | Flask Web开发:基于Python的Web应用开发实战(第2版)                 |
+------+------------+---------------------------------------------------------------------+
3 rows in set (0.16 sec)
Read 6 rows, 281.00 B in 0.064 sec., 93.33 rows/sec., 4.27 KiB/sec.

# query phrase terms
mysql> SELECT id, score(), title FROM books WHERE query('title:"设计与实现"') ORDER BY score() DESC;
+------+-----------+-----------------------+
| id   | score()   | title                 |
+------+-----------+-----------------------+
|    5 | 4.3066816 | Vue.js设计与实现      |
+------+-----------+-----------------------+
1 row in set (0.17 sec)
Read 6 rows, 281.00 B in 0.079 sec., 76.27 rows/sec., 3.49 KiB/sec.

# query bool operator terms
mysql> SELECT id, score(), title FROM books WHERE query('title:设计 AND 实现 OR 教程') ORDER BY score() DESC;
+------+-----------+-----------------------------------------------+
| id   | score()   | title                                         |
+------+-----------+-----------------------------------------------+
|    5 | 2.5488276 | Vue.js设计与实现                              |
|    1 | 1.4704955 | BERT基础教程:Transformer大模型实战           |
+------+-----------+-----------------------------------------------+
2 rows in set (0.16 sec)
Read 6 rows, 281.00 B in 0.073 sec., 82.43 rows/sec., 3.77 KiB/sec.

# query must and negative operator terms
mysql> SELECT id, score(), title FROM books WHERE query('title:+设计 -实现') ORDER BY score() DESC;
+------+------------+-------------------------------+
| id   | score()    | title                         |
+------+------------+-------------------------------+
|    6 |  0.7909737 | 前端架构设计                  |
|    4 | 0.69980496 | Rust程序设计(第2版)         |
+------+------------+-------------------------------+
2 rows in set (0.09 sec)
Read 6 rows, 281.00 B in 0.047 sec., 128.51 rows/sec., 5.88 KiB/sec.

# query multiple fields with boosts
mysql> SELECT id, score(), title FROM books WHERE query('title:教程^5 description:(指南 流程)^1.2') ORDER BY score() DESC;
+------+-----------+-----------------------------------------------+
| id   | score()   | title                                         |
+------+-----------+-----------------------------------------------+
|    1 | 7.3524776 | BERT基础教程:Transformer大模型实战           |
|    6 |  2.859898 | 前端架构设计                                  |
|    3 | 1.7030048 | Apache Pulsar实战                             |
+------+-----------+-----------------------------------------------+
3 rows in set (0.17 sec)
Read 6 rows, 281.00 B in 0.086 sec., 70.15 rows/sec., 3.21 KiB/sec.

part of #14825

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@b41sh b41sh requested a review from sundy-li April 9, 2024 10:31
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 9, 2024
@b41sh b41sh requested a review from BohuTANG April 9, 2024 10:31
@b41sh b41sh marked this pull request as draft April 9, 2024 11:58
@b41sh b41sh marked this pull request as ready for review April 15, 2024 10:01
@b41sh b41sh added this pull request to the merge queue Apr 16, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Apr 16, 2024
@BohuTANG BohuTANG merged commit 1c5286b into datafuselabs:main Apr 16, 2024
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants