Skip to content

[fix](inverted index) Fix empty string MATCH on keyword index returning wrong results#60500

Merged
yiguolei merged 1 commit intoapache:masterfrom
airborne12:fix-empty-string-match
Feb 5, 2026
Merged

[fix](inverted index) Fix empty string MATCH on keyword index returning wrong results#60500
yiguolei merged 1 commit intoapache:masterfrom
airborne12:fix-empty-string-match

Conversation

@airborne12
Copy link
Member

Proposed changes

Fix empty string MATCH on keyword index returning wrong results.

The multi-analyzer feature commit (2c950e1) incorrectly added an empty string check that prevented MATCH '' from finding rows with empty string values in keyword indexes.

For keyword index (no tokenization), empty string is a valid exact match value and should be matchable. The previous code incorrectly skipped empty strings with the comment "empty query should match nothing", which is wrong for keyword indexes.

Problem

-- Table with keyword index (no parser)
CREATE TABLE test (id INT, col TEXT, INDEX idx(col) USING INVERTED);
INSERT INTO test VALUES (1, ''), (2, 'data');

-- Before fix: returns 0 (WRONG!)
-- After fix: returns 1 (CORRECT!)
SELECT count() FROM test WHERE col MATCH '';

Changes

This fix removes the empty string check for keyword index paths in:

  • be/src/vec/functions/match.cpp (slow path)
  • be/src/olap/rowset/segment_v2/inverted_index_reader.cpp (index path)
  • be/src/olap/rowset/segment_v2/inverted_index/analyzer/analyzer.cpp

Added regression test test_empty_string_match.groovy to cover:

  • Empty string match on keyword index (both index and slow paths)
  • Empty string match on tokenized index (should return 0)
  • match_any and match_all with empty string

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test
    • No need to test
  • Behavior changed:

    • Yes. MATCH '' on keyword index now correctly matches rows with empty string values.
  • Does this need documentation?

    • No.

…ng wrong results

The multi-analyzer feature commit (2c950e1) incorrectly added an
empty string check that prevented MATCH '' from finding rows with
empty string values in keyword indexes.

For keyword index (no tokenization), empty string is a valid exact
match value and should be matchable. The previous code incorrectly
skipped empty strings with the comment "empty query should match
nothing", which is wrong for keyword indexes.

This fix removes the empty string check for keyword index paths in:
- be/src/vec/functions/match.cpp (slow path)
- be/src/olap/rowset/segment_v2/inverted_index_reader.cpp (index path)
- be/src/olap/rowset/segment_v2/inverted_index/analyzer/analyzer.cpp

Added regression test test_empty_string_match.groovy to cover:
- Empty string match on keyword index (both index and slow paths)
- Empty string match on tokenized index (should return 0)
- match_any and match_all with empty string
@airborne12 airborne12 requested a review from zclllyybb as a code owner February 4, 2026 10:24
@Thearas
Copy link
Contributor

Thearas commented Feb 4, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32021 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 45700bd7b1a461d6112bde23a69355fcb0e6c89c, data reload: false

------ Round 1 ----------------------------------
q1	17627	5197	5053	5053
q2	2017	311	189	189
q3	10245	1309	764	764
q4	10223	894	309	309
q5	7547	2154	1945	1945
q6	204	181	148	148
q7	886	728	610	610
q8	9269	1376	1102	1102
q9	5168	4773	4904	4773
q10	6880	1960	1557	1557
q11	519	298	275	275
q12	331	378	225	225
q13	17771	4069	3233	3233
q14	232	265	212	212
q15	884	820	829	820
q16	667	671	611	611
q17	642	774	510	510
q18	6771	6557	7394	6557
q19	1277	1023	640	640
q20	415	400	247	247
q21	2856	2228	1955	1955
q22	388	335	286	286
Total cold run time: 102819 ms
Total hot run time: 32021 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5534	5466	5536	5466
q2	286	360	260	260
q3	2466	2848	2487	2487
q4	1470	1871	1534	1534
q5	4632	4612	4760	4612
q6	227	179	139	139
q7	1999	1952	1868	1868
q8	2541	2339	2380	2339
q9	7676	7654	7647	7647
q10	2818	2930	2594	2594
q11	543	475	449	449
q12	684	748	588	588
q13	3723	4022	3230	3230
q14	283	312	267	267
q15	837	798	781	781
q16	634	689	642	642
q17	1086	1248	1258	1248
q18	7495	7247	7308	7247
q19	837	853	844	844
q20	1962	2061	1924	1924
q21	4546	4133	4023	4023
q22	558	562	521	521
Total cold run time: 52837 ms
Total hot run time: 50710 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.49 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 45700bd7b1a461d6112bde23a69355fcb0e6c89c, data reload: false

query1	0.05	0.04	0.04
query2	0.09	0.04	0.04
query3	0.26	0.09	0.09
query4	1.60	0.11	0.11
query5	0.27	0.24	0.26
query6	1.17	0.68	0.67
query7	0.03	0.03	0.03
query8	0.04	0.04	0.04
query9	0.56	0.50	0.48
query10	0.55	0.55	0.54
query11	0.14	0.09	0.10
query12	0.14	0.10	0.12
query13	0.63	0.63	0.61
query14	1.05	1.08	1.07
query15	0.87	0.87	0.87
query16	0.42	0.45	0.42
query17	1.08	1.10	1.12
query18	0.23	0.21	0.21
query19	2.02	2.03	2.01
query20	0.02	0.01	0.01
query21	15.40	0.23	0.15
query22	5.32	0.05	0.05
query23	16.04	0.28	0.11
query24	1.56	0.59	0.34
query25	0.14	0.06	0.11
query26	0.14	0.14	0.13
query27	0.08	0.08	0.05
query28	5.24	1.15	0.97
query29	12.63	3.94	3.19
query30	0.29	0.13	0.12
query31	2.84	0.63	0.41
query32	3.24	0.60	0.49
query33	3.22	3.26	3.24
query34	16.15	5.43	4.73
query35	4.85	4.80	4.81
query36	0.65	0.50	0.48
query37	0.12	0.08	0.07
query38	0.08	0.04	0.04
query39	0.04	0.03	0.04
query40	0.19	0.16	0.16
query41	0.09	0.03	0.03
query42	0.05	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 99.63 s
Total hot run time: 28.49 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 25.00% (1/4) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.56% (19370/36854)
Line Coverage 36.05% (179935/499172)
Region Coverage 32.38% (139350/430381)
Branch Coverage 33.41% (60398/180779)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 25.00% (1/4) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.26% (26461/36118)
Line Coverage 56.33% (280502/497977)
Region Coverage 53.95% (234577/434797)
Branch Coverage 55.67% (101050/181509)

Copy link
Contributor

@zzzxl1993 zzzxl1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

PR approved by anyone and no changes requested.

@yiguolei yiguolei added dev/4.0.x and removed approved Indicates a PR has been approved by one committer. reviewed labels Feb 5, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

PR approved by at least one committer and no changes requested.

@yiguolei yiguolei merged commit 2b0b87b into apache:master Feb 5, 2026
30 of 32 checks passed
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

PR approved by anyone and no changes requested.

github-actions bot pushed a commit that referenced this pull request Feb 5, 2026
…ng wrong results (#60500)

## Proposed changes

Fix empty string MATCH on keyword index returning wrong results.

The multi-analyzer feature commit (2c950e1) incorrectly added an
empty string check that prevented `MATCH ''` from finding rows with
empty string values in keyword indexes.

For keyword index (no tokenization), empty string is a valid exact match
value and should be matchable. The previous code incorrectly skipped
empty strings with the comment "empty query should match nothing", which
is wrong for keyword indexes.

## Problem

```sql
-- Table with keyword index (no parser)
CREATE TABLE test (id INT, col TEXT, INDEX idx(col) USING INVERTED);
INSERT INTO test VALUES (1, ''), (2, 'data');

-- Before fix: returns 0 (WRONG!)
-- After fix: returns 1 (CORRECT!)
SELECT count() FROM test WHERE col MATCH '';
```

## Changes

This fix removes the empty string check for keyword index paths in:
- `be/src/vec/functions/match.cpp` (slow path)
- `be/src/olap/rowset/segment_v2/inverted_index_reader.cpp` (index path)
- `be/src/olap/rowset/segment_v2/inverted_index/analyzer/analyzer.cpp`

Added regression test `test_empty_string_match.groovy` to cover:
- Empty string match on keyword index (both index and slow paths)
- Empty string match on tokenized index (should return 0)
- match_any and match_all with empty string

## Check List (For Author)

- Test
    - [x] Regression test
    - [x] Unit Test
    - [ ] Manual test
    - [ ] No need to test

- Behavior changed:
- [x] Yes. `MATCH ''` on keyword index now correctly matches rows with
empty string values.

- Does this need documentation?
    - [ ] No.
@airborne12 airborne12 deleted the fix-empty-string-match branch February 5, 2026 02:46
yiguolei pushed a commit that referenced this pull request Feb 5, 2026
…ndex returning wrong results #60500 (#60516)

Cherry-picked from #60500

Co-authored-by: Jack <jiangkai@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.4-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants