Describe the problem
Hi, CockroachDB developers,
Thank you for reading my report. I'd like to suggest an optimization opportunity regarding SEMI JOIN operations.
SEMI JOIN often performs better with FirstMatch(), since it stops scanning after finding the first match, avoiding unnecessary work for subsequent rows. Yet the optimizer of CockroachDB does not choose it. Since these joins are frequent, prioritizing this strategy could yield broad speedups.
To Reproduce
You can reproduce it as follows. As you can see, the SEMI JOIN query consumes 12.193s sec, even though there is only one row in t1;
CREATE TABLE t1(c1 INT8);
CREATE TABLE t2(c2 INT8);
INSERT INTO t1 VALUES (1);
INSERT INTO t2 SELECT i FROM generate_series(1, 10000000) AS i;
explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);
info
-----------------------------------------------------------------------------------
distribution: local
vectorized: true
• hash join (semi)
│ equality: (c1) = (c2)
│
├── • scan
│ missing stats
│ table: t1@t1_pkey
│ spans: FULL SCAN
│
└── • scan
estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
table: t2@t2_pkey
spans: FULL SCAN
index recommendations: 2
1. type: index creation
SQL command: CREATE INDEX ON defaultdb.public.t1 (c1);
2. type: index creation
SQL command: CREATE INDEX ON defaultdb.public.t2 (c2);
(21 rows)
Time: 4ms total (execution 3ms / network 1ms)
SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);
c1
------
1
(1 row)
Time: 12.193s total (execution 12.193s / network 0.000s)
Then executing the query again, CockroachDB still performs full scan.
explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);
info
---------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• hash join (right semi)
│ estimated row count: 1
│ equality: (c2) = (c1)
│
├── • scan
│ estimated row count: 10,000,000 (100% of the table; stats collected 29 seconds ago)
│ table: t2@t2_pkey
│ spans: FULL SCAN
│
└── • scan
estimated row count: 1 (100% of the table; stats collected 34 seconds ago)
table: t1@t1_pkey
spans: FULL SCAN
index recommendations: 1
1. type: index creation
SQL command: CREATE INDEX ON defaultdb.public.t2 (c2);
(20 rows)
Time: 4ms total (execution 4ms / network 1ms)
SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);
c1
------
1
(1 row)
Time: 3.380s total (execution 3.380s / network 0.000s)
Expected behavior
Could CockroachDB perform SEMI JOIN with 'FirstMatch()' as other DBMSs (e.g., MySQL) done. I believe it can help to improve the performance of SEMI JOIN.
mysql> explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2) \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: NULL
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t2
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 9746243
filtered: 10.00
Extra: Using where; FirstMatch(t1); Using join buffer (hash join)
2 rows in set, 2 warnings (0.001 sec)
Environment:
Server version: CockroachDB CCL v24.3.13 (x86_64-pc-linux-gnu, built 2025/05/13 17:14:36, go1.22.8X:nocoverageredesign) (same version as client)
Client app: cockroach sql
Jira issue: CRDB-51574
Describe the problem
Hi, CockroachDB developers,
Thank you for reading my report. I'd like to suggest an optimization opportunity regarding
SEMI JOINoperations.SEMI JOINoften performs better withFirstMatch(), since it stops scanning after finding the first match, avoiding unnecessary work for subsequent rows. Yet the optimizer of CockroachDB does not choose it. Since these joins are frequent, prioritizing this strategy could yield broad speedups.To Reproduce
You can reproduce it as follows. As you can see, the
SEMI JOINquery consumes 12.193s sec, even though there is only one row in t1;Then executing the query again, CockroachDB still performs
full scan.Expected behavior
Could CockroachDB perform
SEMI JOINwith 'FirstMatch()' as other DBMSs (e.g., MySQL) done. I believe it can help to improve the performance of SEMI JOIN.Environment:
Server version: CockroachDB CCL v24.3.13 (x86_64-pc-linux-gnu, built 2025/05/13 17:14:36, go1.22.8X:nocoverageredesign) (same version as client)
Client app: cockroach sql
Jira issue: CRDB-51574