Support fulltext index #3915

wubx · 2022-01-20T01:40:05Z

Discussed in #3899

^{Originally posted by wubx January 19, 2022}
Why we need fulltest index ：

In APM a lot of query like ： where url like "%/user/%"
The Game system of the service a lot query like : where action like "%win%"
Sample fulltext search replace Es

Clickhouse support fulltext index use :

ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

Stores a Bloom filter that contains all ngrams from a block of data. Works only with datatypes: String, FixedString and Map. Can be used for optimization of EQUALS, LIKE and IN expressions.

n — ngram size,
size_of_bloom_filter_in_bytes — Bloom filter size in bytes (you can use large values here, for example, 256 or 512, because it can be compressed well).
number_of_hash_functions — The number of hash functions used in the Bloom filter.
random_seed — The seed for Bloom filter hash functions.

tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

The same as ngrambf_v1, but stores tokens instead of ngrams. Tokens are sequences separated by non-alphanumeric characters.

The text was updated successfully, but these errors were encountered:

b41sh · 2022-02-17T06:49:28Z

/assignme

BohuTANG · 2022-06-01T04:06:08Z

Seems we can use tanivy, and store the full text in our fuse engine?
https://github.com/quickwit-oss/tantivy

Example:
https://github.com/quickwit-oss/tantivy/tree/main/examples

cc @b41sh

b41sh · 2022-06-01T13:02:19Z

Seems we can use tanivy, and store the full text in our fuse engine? https://github.com/quickwit-oss/tantivy

Example: https://github.com/quickwit-oss/tantivy/tree/main/examples

cc @b41sh

It's a good idea, we can try it.

sunisdown · 2022-06-08T06:41:30Z

@b41sh @BohuTANG Any plans for development?

wubx · 2022-06-08T09:57:15Z

https://github.com/zhihu/rucene

b41sh · 2022-06-08T13:05:03Z

@b41sh @BohuTANG Any plans for development?

I will start working on this issue this week.

BohuTANG · 2022-06-10T11:16:58Z

Also cc @zhang2014

BohuTANG · 2022-06-11T07:22:12Z

Benchmark

Databend

Table:

create table liket(c STRING);
insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824  rows
insert into liket select * from liket;

select count(*) from liket where c like '%kd%';
+------------+
| count()    |
+------------+
| 1073741824 |
+------------+
1 row in set (22.50 sec)
Read 1073741824 rows, 38.00 GiB in 22.392 sec., 47.95 million rows/sec., 1.70 GiB/sec.

ClickHouse with ngram index

Table:

CREATE TABLE liket
(
    `c` String,
    INDEX fulltext_index c TYPE ngrambf_v1(3, 1024, 2, 0) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY c;

insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824  rows
insert into liket select * from liket;

thinkpad :) select count(*) from liket where c like '%kd%';

SELECT count(*)
FROM liket
WHERE c LIKE '%kd%'

Query id: dce045a0-175b-4b6d-9751-0702fb3cb115

┌────count()─┐
│ 1073741824 │
└────────────┘

1 rows in set. Elapsed: 4.904 sec. Processed 1.07 billion rows, 41.88 GB (218.97 million rows/s., 8.54 GB/s.)

cc @sundy-li @b41sh

sundy-li · 2022-06-11T11:45:24Z

How about this sql :
select ignore( c) from liket;

I think the bottleneck is mostly on IO read.

BohuTANG · 2022-06-11T14:40:56Z

@sundy-li

select ignore(c) from liket where c like '%kd%';

or

select ignore( c) from liket;

have a hang there :/

This test is easy to run on your local, I think there are some issues with the query.

sundy-li · 2022-06-11T14:42:15Z

Test on my laptap, the bottleneck is mostly on IO read & regexp match I think.
ClickHouse:

CREATE TABLE liket_noindex (c String) ENGINE = MergeTree ORDER BY c;


insert into liket  select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648);  -- 30.81 MB/s
insert into liket_noindex  select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648); -- 245.13 MB/s


-- Actually the full index did not help this query, because it can't prune any block.
select count(*) from liket where c like '%kd%';  -- 0.846 sec
select count(*) from liket_noindex where c like '%kd%'; -- 1.053 sec 


select count(ignore(c)) from liket; -- 0.434 sec
select count(ignore(c)) from liket_noindex; -- 0.552 sec

sundy-li · 2022-06-11T23:17:32Z

In memory bench.

databend :) select sum( to_string(number) like '98%') from numbers(5000000000);
+-------------------------------------+
| sum((to_string(number) like '98%')) |
+-------------------------------------+
|                            11111111 |
+-------------------------------------+
1 row in set (6.52 sec)
Read 5000000000 rows, 37.25 GiB in 6.517 sec., 767.27 million rows/sec., 5.72 GiB/sec.

databend :) select sum( to_string(number) like '%98%') from numbers(5000000000);

+--------------------------------------+
| sum((to_string(number) like '%98%')) |
+--------------------------------------+
|                            389599750 |
+--------------------------------------+
1 row in set (17.32 sec)
Read 5000000000 rows, 37.25 GiB in 17.314 sec., 288.78 million rows/sec., 2.15 GiB/sec.

clickhouse :) select sum( toString(number) like '98%') from numbers_mt(5000000000);

SELECT sum(toString(number) LIKE '98%')
FROM numbers_mt(5000000000)

Query id: de75caf9-57b2-4e37-b258-489b76250241

┌─sum(like(toString(number), '98%'))─┐
│                           11111111 │
└────────────────────────────────────┘

1 rows in set. Elapsed: 7.602 sec. Processed 5.00 billion rows, 40.00 GB (657.74 million rows/s., 5.26 GB/s.)

clickhouse :)  select sum( toString(number) like '%98%') from numbers_mt(5000000000);

┌─sum(like(toString(number), '%98%'))─┐
│                           389599750 │
└─────────────────────────────────────┘

1 rows in set. Elapsed: 6.529 sec. Processed 5.00 billion rows, 40.00 GB (765.77 million rows/s., 6.13 GB/s.)

BohuTANG · 2022-06-12T00:42:43Z

Good.
Does the like function in databend have some improvement room?

sundy-li · 2022-06-12T00:48:01Z

Yes, need profile investigation.

sunisdown · 2022-06-14T09:42:07Z

Do we have RFC for the fulltext index? tanivy is a good choice and we can discuss about the design.

b41sh · 2022-06-14T10:42:46Z

Do we have RFC for the fulltext index? tanivy is a good choice and we can discuss about the design.

tantivy only supports fs storage of data, we cannot use it directly.
Maybe we need to implement an inverted index ourselves to support full-text search.

sundy-li · 2022-06-14T12:41:19Z

Do we have RFC for the fulltext index?

A simple RFC in Chinese.

Databend 全文索引.pdf

sunisdown · 2022-06-15T03:07:50Z

Do we have RFC for the fulltext index?

A simple RFC in Chinese.

Databend 全文索引.pdf

Prefer the second solution, the first solution looks like another quickwit.

ariesdevil · 2022-10-04T18:33:58Z

ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/

sunisdown · 2022-10-05T01:22:02Z

The implementation is interesting in that it uses an ID to join the data from the Clickhouse query to the data from the Quickwit query. I think it can be used with a small amount of data.

ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/

BohuTANG · 2023-01-31T10:11:08Z

Tracking at #9811

ariesdevil · 2023-04-10T07:48:19Z

ClickHouse uses quickwit implements full-text search. clickhouse.com/docs/en/guides/developer/full-text-search↳

ClickHouse delete this content and redirect this link to the new full-text-indexing feature, lol.

BohuTANG · 2023-04-14T06:43:41Z

qdrant supports both fulltext index and vector index:
https://qdrant.tech/documentation/indexing/

BohuTANG · 2024-01-29T05:53:23Z

Move to #14505

BohuTANG added C-feature Category: feature A-query Area: databend query labels Jan 20, 2022

BohuTANG changed the title ~~support fulltext index~~ Support fulltext index Jan 20, 2022

BohuTANG mentioned this issue Jan 20, 2022

Roadmap 2022 #3706

Closed

4 tasks

zbtzbtzbt mentioned this issue Feb 15, 2022

Todo items for index #4163

Open

4 tasks

databend-bot assigned b41sh Feb 17, 2022

databend-bot added the community-take label Feb 17, 2022

BohuTANG mentioned this issue Jun 14, 2022

When running SELECT NULLIF(1, 1); and SELECT NULLIF('a', 'a');, the mysql service will hang. #5959

Closed

BohuTANG mentioned this issue Jan 21, 2023

Roadmap 2023 #9448

Open

9 tasks

BohuTANG mentioned this issue Mar 3, 2023

Release proposal: Nightly v1.1 #10334

Closed

6 tasks

BohuTANG mentioned this issue Apr 14, 2023

Release proposal: Nightly v1.2 #11073

Closed

7 tasks

BohuTANG closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fulltext index #3915

Support fulltext index #3915

wubx commented Jan 20, 2022

b41sh commented Feb 17, 2022

BohuTANG commented Jun 1, 2022 •

edited

Loading

b41sh commented Jun 1, 2022

sunisdown commented Jun 8, 2022

wubx commented Jun 8, 2022

b41sh commented Jun 8, 2022

BohuTANG commented Jun 10, 2022

BohuTANG commented Jun 11, 2022 •

edited

Loading

sundy-li commented Jun 11, 2022 •

edited

Loading

BohuTANG commented Jun 11, 2022

sundy-li commented Jun 11, 2022 •

edited

Loading

sundy-li commented Jun 11, 2022 •

edited

Loading

BohuTANG commented Jun 12, 2022

sundy-li commented Jun 12, 2022

sunisdown commented Jun 14, 2022

b41sh commented Jun 14, 2022

sundy-li commented Jun 14, 2022

sunisdown commented Jun 15, 2022

ariesdevil commented Oct 4, 2022

sunisdown commented Oct 5, 2022

BohuTANG commented Jan 31, 2023

ariesdevil commented Apr 10, 2023

BohuTANG commented Apr 14, 2023

BohuTANG commented Jan 29, 2024

Support fulltext index #3915

Support fulltext index #3915

Comments

wubx commented Jan 20, 2022

Discussed in #3899

b41sh commented Feb 17, 2022

BohuTANG commented Jun 1, 2022 • edited Loading

b41sh commented Jun 1, 2022

sunisdown commented Jun 8, 2022

wubx commented Jun 8, 2022

b41sh commented Jun 8, 2022

BohuTANG commented Jun 10, 2022

BohuTANG commented Jun 11, 2022 • edited Loading

Benchmark

Databend

ClickHouse with ngram index

sundy-li commented Jun 11, 2022 • edited Loading

BohuTANG commented Jun 11, 2022

sundy-li commented Jun 11, 2022 • edited Loading

sundy-li commented Jun 11, 2022 • edited Loading

BohuTANG commented Jun 12, 2022

sundy-li commented Jun 12, 2022

sunisdown commented Jun 14, 2022

b41sh commented Jun 14, 2022

sundy-li commented Jun 14, 2022

sunisdown commented Jun 15, 2022

ariesdevil commented Oct 4, 2022

sunisdown commented Oct 5, 2022

BohuTANG commented Jan 31, 2023

ariesdevil commented Apr 10, 2023

BohuTANG commented Apr 14, 2023

BohuTANG commented Jan 29, 2024

BohuTANG commented Jun 1, 2022 •

edited

Loading

BohuTANG commented Jun 11, 2022 •

edited

Loading

sundy-li commented Jun 11, 2022 •

edited

Loading

sundy-li commented Jun 11, 2022 •

edited

Loading

sundy-li commented Jun 11, 2022 •

edited

Loading