Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/en/docs/data-table/index/inverted-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,15 @@ The features for inverted index is as follows:
- missing stands for no parser, the whole field is considered to be a term
- "english" stands for english parser
- "chinese" stands for chinese parser
- "unicode" stands for mixed-type word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters into 1-gram.
- "unicode" stands for muti-language mixed word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters one by one.

- "parser_mode" is utilized to set the tokenizer/parser type for Chinese word segmentation.
- in "fine_grained" mode, the system will meticulously tokenize each possible segment.
- in "coarse_grained" mode, the system follows the maximization principle, performing accurate and comprehensive tokenization.
- in "fine_grained" mode, the system tend to generate short words, eg. 6 words '武汉' '武汉市' '市长' '长江' '长江大桥' '大桥' for '武汉长江大桥'.
- in "coarse_grained" mode, the system tend to generate long words, eg. 2 words '武汉市' '市长' '长江大桥' for '武汉长江大桥'.
- default mode is "coarse_grained".
- "support_phrase" is utilized to specify if the index requires support for phrase mode.
- "true" indicates that support is needed.
- "false" indicates that support is not needed.
- "support_phrase" is utilized to specify if the index requires support for phrase mode query MATCH_PHRASE
- "true" indicates that support is needed, but needs more storage for index.
- "false" indicates that support is not needed, and less storage for index. MATCH_ALL can be used for matching multi words without order.
- default mode is "false".
- COMMENT is optional

Expand Down
2 changes: 1 addition & 1 deletion docs/en/docs/data-table/index/ngram-bloomfilter-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ under the License.
<version since="2.0.0">
</version>

In order to improve the like query performance, the NGram BloomFilter index was implemented, which referenced to the ClickHouse's ngrambf skip indices;
In order to improve the like query performance, the NGram BloomFilter index was implemented.

## Create Column With NGram BloomFilter Index

Expand Down
24 changes: 12 additions & 12 deletions docs/zh-CN/docs/data-table/index/inverted-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Doris倒排索引的功能简要介绍如下:
- 增加了字符串类型的全文检索
- 支持字符串全文检索,包括同时匹配多个关键字MATCH_ALL、匹配任意一个关键字MATCH_ANY、匹配短语词组MATCH_PHRASE
- 支持字符串数组类型的全文检索
- 支持英文、中文以及混合类型分词
- 支持英文、中文以及Unicode多语言分词
- 加速普通等值、范围查询,覆盖bitmap索引的功能,未来会代替bitmap索引
- 支持字符串、数值、日期时间类型的 =, !=, >, >=, <, <= 快速过滤
- 支持字符串、数字、日期时间数组类型的 =, !=, >, >=, <, <=
Expand All @@ -72,16 +72,16 @@ Doris倒排索引的功能简要介绍如下:
- parser指定分词器
- 默认不指定代表不分词
- english是英文分词,适合被索引列是英文的情况,用空格和标点符号分词,性能高
- chinese是中文分词,适合被索引列有中文或者中英文混合的情况,性能比english分词低
- unicode是混合类型分词,适用于中英文混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文字符进行1-gram分词
- parser_mode用于指定中文分词的模式
- fine_grained模式,系统将对可以进行分词的部分都进行详尽的分词处理
- coarse_grained模式,系统则依据最大化原则,执行精确且全面的分词操作
- 默认coarse_grained模式
- support_phrase用于指定索引是否需要支持短语模式
- true为需要
- false为不需要
- 默认false不需要
- chinese是中文分词,适合被索引列主要是中文的情况,性能比english分词低
- unicode是多语言混合类型分词,适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文按字符分词
- parser_mode用于指定分词的模式,目前parser = chinese时支持如下几种模式:
- fine_grained:细粒度模式,倾向于分出比较短的词,比如 '武汉长江大桥' 会分成 '武汉', '武汉市', '市长', '长江', '长江大桥', '大桥' 6个词
- coarse_grained:粗粒度模式,倾向于分出比较长的词,,比如 '武汉长江大桥' 会分成 '武汉市' '长江大桥' 2个词
- 默认coarse_grained
- support_phrase用于指定索引是否支持MATCH_PHRASE短语查询加速
- true为支持,但是索引需要更多的存储空间
- false为不支持,更省存储空间,可以用MATCH_ALL查询多个关键字
- 默认false
- COMMENT 是可选的,用于指定注释

```sql
Expand Down Expand Up @@ -150,7 +150,7 @@ USE test_inverted_index;

-- 创建表的同时创建了comment的倒排索引idx_comment
-- USING INVERTED 指定索引类型是倒排索引
-- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文混合分词,如果不指定"parser"参数表示不分词
-- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文多语言混合分词,如果不指定"parser"参数表示不分词
CREATE TABLE hackernews_1m
(
`id` BIGINT,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ under the License.
<version since="2.0.0">
</version>

为了提升like的查询性能,增加了NGram BloomFilter索引,其实现主要参照了ClickHouse的ngrambf
为了提升like的查询性能,增加了NGram BloomFilter索引。

## NGram BloomFilter创建

Expand Down