Query planner suddenly using very inefficient plans

**Describe the problem**

After we upgraded our cluster from 25.1.4 to 25.2.0, suddenly one of our services began experiencing very high latency, the table has the structure:
```
Table "public.config_parameters"
  Column  |  Type   | Collation | Nullable | Default
----------+---------+-----------+----------+----------
  user_id | integer |           | not null |
  key     | text    |           | not null |
  value   | text    |           | not null |
(3 rows)
                              Indexes
--------------------------------------------------------------------
  config_parameters_pkey PRIMARY KEY, btree (user_id ASC, key ASC)
  config_param_idx_key btree (key ASC)
(2 rows)
```
and the query was like:
```
SELECT
user_id, key, value
FROM
config_parameters
WHERE
user_id = ANY ($1:::INT4[]) AND key = $2:::STRING
-- Arguments:
-- $1: '{151254526,0}'
-- $2: 'privacy.last_seen'
```
with more diagnostic data in:

[stmt-bundle-1075867753532948482.zip](https://github.com/user-attachments/files/20649338/stmt-bundle-1075867753532948482.zip)

We managed to fix it by dropping the index on key(config_param_idx_key).

Recently, without any recent changes made, one of other services suddenly began using a very inefficient query which even `SET plan_cache_mode = force_custom_plan` could not solve:

![Image](https://github.com/user-attachments/assets/832144a5-97ba-40aa-96e7-ffa443aaafd1)
```
Plan Gist: AgGQDRAASwIAAAMTkA0EAwUKBgo= 

• filter
│
└── • index join
    │ table: user_story_primary@user_story_primary_pkey
    │
    └── • filter
        │
        └── • scan
              table: user_story_primary@user_story_primary_is_mutual_expires_at_user_id_idx
              spans: 1+ spans
```

This time we managed to solve it by `alter index user_story_primary_is_mutual_expires_at_user_id_idx not visible;`. I believe this is a serious planner bug that occasionally leads to broken plans that can endanger the whole cluster's stability as some queries that were quite light, become very resource-consuming and there is no pattern to understand why this is happening.

**To Reproduce**

I am not really aware as to why such plans would be chosen, there seems to be no pattern and it does not occur due to changes on our end, the latest occurrence happened when there was absolutely no changes made to the cluster or the tables involved in the incident.

**Expected behavior**
The query planner should continue or come up with efficient plans as it was using before.

**Additional data / screenshots**
They are appended in the description of the issue.

**Environment:**
 - CockroachDB version 25.2
 - Server OS: Ubuntu 24 with kernel 6.8 
 - Client app: golang pgx/jackc

**Additional context**
The initial impact was that the whole cluster was slowed down due to CPU exhaustion. We reduced the max connection count which stabilized the cluster and contained the damage to the affected service, and after setting the index invisible, everything went back to operational.

This only happened after we migrated from 25.1.4 to 25.2.0, before that, no such anomaly in the planner was present.

CC @DrewKimball 


Jira issue: CRDB-51334

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query planner suddenly using very inefficient plans #147960

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query planner suddenly using very inefficient plans #147960

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions