SEMI JOIN will performs better with FirstMatch()

**Describe the problem**

Hi, CockroachDB developers,

Thank you for reading my report. I'd like to suggest an optimization opportunity regarding `SEMI JOIN` operations.

`SEMI JOIN` often performs better with `FirstMatch()`, since it stops scanning after finding the first match, avoiding unnecessary work for subsequent rows. Yet the optimizer of CockroachDB does not choose it. Since these joins are frequent, prioritizing this strategy could yield broad speedups.

**To Reproduce**
You can reproduce it as follows.  As you can see, the `SEMI JOIN` query consumes 12.193s sec, even though there is only one row in t1;
```sql
CREATE TABLE t1(c1 INT8);
CREATE TABLE t2(c2 INT8);
INSERT INTO t1 VALUES (1);
INSERT INTO t2 SELECT i FROM generate_series(1, 10000000) AS i;
explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                
                                       info
-----------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • hash join (semi)
  │ equality: (c1) = (c2)
  │
  ├── • scan
  │     missing stats
  │     table: t1@t1_pkey
  │     spans: FULL SCAN
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
        table: t2@t2_pkey
        spans: FULL SCAN

  index recommendations: 2
  1. type: index creation
     SQL command: CREATE INDEX ON defaultdb.public.t1 (c1);
  2. type: index creation
     SQL command: CREATE INDEX ON defaultdb.public.t2 (c2);
(21 rows)

Time: 4ms total (execution 3ms / network 1ms)

SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                        
  c1
------
   1
(1 row)

Time: 12.193s total (execution 12.193s / network 0.000s)
```
Then executing the query again, CockroachDB still performs `full scan`.
```sql
explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                
                                            info
---------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • hash join (right semi)
  │ estimated row count: 1
  │ equality: (c2) = (c1)
  │
  ├── • scan
  │     estimated row count: 10,000,000 (100% of the table; stats collected 29 seconds ago)
  │     table: t2@t2_pkey
  │     spans: FULL SCAN
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 34 seconds ago)
        table: t1@t1_pkey
        spans: FULL SCAN

  index recommendations: 1
  1. type: index creation
     SQL command: CREATE INDEX ON defaultdb.public.t2 (c2);
(20 rows)

Time: 4ms total (execution 4ms / network 1ms)

SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                        
  c1
------
   1
(1 row)

Time: 3.380s total (execution 3.380s / network 0.000s)
```

**Expected behavior**
Could CockroachDB perform `SEMI JOIN` with 'FirstMatch()' as other DBMSs (e.g.,  MySQL) done. I believe it can help to improve the performance of SEMI JOIN.
```sql
mysql> explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9746243
     filtered: 10.00
        Extra: Using where; FirstMatch(t1); Using join buffer (hash join)
2 rows in set, 2 warnings (0.001 sec)
```

**Environment:**
Server version: CockroachDB CCL v24.3.13 (x86_64-pc-linux-gnu, built 2025/05/13 17:14:36, go1.22.8X:nocoverageredesign) (same version as client)
Client app: cockroach sql

Jira issue: CRDB-51574

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEMI JOIN will performs better with FirstMatch() #148366

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SEMI JOIN will performs better with FirstMatch() #148366

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions