Skip to content

SEMI JOIN will performs better with FirstMatch() #148366

@jinhui-lai

Description

@jinhui-lai

Describe the problem

Hi, CockroachDB developers,

Thank you for reading my report. I'd like to suggest an optimization opportunity regarding SEMI JOIN operations.

SEMI JOIN often performs better with FirstMatch(), since it stops scanning after finding the first match, avoiding unnecessary work for subsequent rows. Yet the optimizer of CockroachDB does not choose it. Since these joins are frequent, prioritizing this strategy could yield broad speedups.

To Reproduce
You can reproduce it as follows. As you can see, the SEMI JOIN query consumes 12.193s sec, even though there is only one row in t1;

CREATE TABLE t1(c1 INT8);
CREATE TABLE t2(c2 INT8);
INSERT INTO t1 VALUES (1);
INSERT INTO t2 SELECT i FROM generate_series(1, 10000000) AS i;
explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                
                                       info
-----------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • hash join (semi)
  │ equality: (c1) = (c2)
  │
  ├── • scan
  │     missing stats
  │     table: t1@t1_pkey
  │     spans: FULL SCAN
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
        table: t2@t2_pkey
        spans: FULL SCAN

  index recommendations: 2
  1. type: index creation
     SQL command: CREATE INDEX ON defaultdb.public.t1 (c1);
  2. type: index creation
     SQL command: CREATE INDEX ON defaultdb.public.t2 (c2);
(21 rows)

Time: 4ms total (execution 3ms / network 1ms)

SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                        
  c1
------
   1
(1 row)

Time: 12.193s total (execution 12.193s / network 0.000s)

Then executing the query again, CockroachDB still performs full scan.

explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                
                                            info
---------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • hash join (right semi)
  │ estimated row count: 1
  │ equality: (c2) = (c1)
  │
  ├── • scan
  │     estimated row count: 10,000,000 (100% of the table; stats collected 29 seconds ago)
  │     table: t2@t2_pkey
  │     spans: FULL SCAN
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 34 seconds ago)
        table: t1@t1_pkey
        spans: FULL SCAN

  index recommendations: 1
  1. type: index creation
     SQL command: CREATE INDEX ON defaultdb.public.t2 (c2);
(20 rows)

Time: 4ms total (execution 4ms / network 1ms)

SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2);                                        
  c1
------
   1
(1 row)

Time: 3.380s total (execution 3.380s / network 0.000s)

Expected behavior
Could CockroachDB perform SEMI JOIN with 'FirstMatch()' as other DBMSs (e.g., MySQL) done. I believe it can help to improve the performance of SEMI JOIN.

mysql> explain SELECT * FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE c1=c2) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9746243
     filtered: 10.00
        Extra: Using where; FirstMatch(t1); Using join buffer (hash join)
2 rows in set, 2 warnings (0.001 sec)

Environment:
Server version: CockroachDB CCL v24.3.13 (x86_64-pc-linux-gnu, built 2025/05/13 17:14:36, go1.22.8X:nocoverageredesign) (same version as client)
Client app: cockroach sql

Jira issue: CRDB-51574

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-performancePerf of queries or internals. Solution not expected to change functional behavior.O-communityOriginated from the communityT-sql-queriesSQL Queries TeamX-blathers-triagedblathers was able to find an owner

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions