Skip to content

Query planner suddenly using very inefficient plans #147960

@Mahdi-zarei

Description

@Mahdi-zarei

Describe the problem

After we upgraded our cluster from 25.1.4 to 25.2.0, suddenly one of our services began experiencing very high latency, the table has the structure:

Table "public.config_parameters"
  Column  |  Type   | Collation | Nullable | Default
----------+---------+-----------+----------+----------
  user_id | integer |           | not null |
  key     | text    |           | not null |
  value   | text    |           | not null |
(3 rows)
                              Indexes
--------------------------------------------------------------------
  config_parameters_pkey PRIMARY KEY, btree (user_id ASC, key ASC)
  config_param_idx_key btree (key ASC)
(2 rows)

and the query was like:

SELECT
user_id, key, value
FROM
config_parameters
WHERE
user_id = ANY ($1:::INT4[]) AND key = $2:::STRING
-- Arguments:
-- $1: '{151254526,0}'
-- $2: 'privacy.last_seen'

with more diagnostic data in:

stmt-bundle-1075867753532948482.zip

We managed to fix it by dropping the index on key(config_param_idx_key).

Recently, without any recent changes made, one of other services suddenly began using a very inefficient query which even SET plan_cache_mode = force_custom_plan could not solve:

Image

Plan Gist: AgGQDRAASwIAAAMTkA0EAwUKBgo= 

• filter
│
└── • index join
    │ table: user_story_primary@user_story_primary_pkey
    │
    └── • filter
        │
        └── • scan
              table: user_story_primary@user_story_primary_is_mutual_expires_at_user_id_idx
              spans: 1+ spans

This time we managed to solve it by alter index user_story_primary_is_mutual_expires_at_user_id_idx not visible;. I believe this is a serious planner bug that occasionally leads to broken plans that can endanger the whole cluster's stability as some queries that were quite light, become very resource-consuming and there is no pattern to understand why this is happening.

To Reproduce

I am not really aware as to why such plans would be chosen, there seems to be no pattern and it does not occur due to changes on our end, the latest occurrence happened when there was absolutely no changes made to the cluster or the tables involved in the incident.

Expected behavior
The query planner should continue or come up with efficient plans as it was using before.

Additional data / screenshots
They are appended in the description of the issue.

Environment:

  • CockroachDB version 25.2
  • Server OS: Ubuntu 24 with kernel 6.8
  • Client app: golang pgx/jackc

Additional context
The initial impact was that the whole cluster was slowed down due to CPU exhaustion. We reduced the max connection count which stabilized the cluster and contained the damage to the affected service, and after setting the index invisible, everything went back to operational.

This only happened after we migrated from 25.1.4 to 25.2.0, before that, no such anomaly in the planner was present.

CC @DrewKimball

Jira issue: CRDB-51334

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-communityOriginated from the communityT-sql-queriesSQL Queries TeamX-blathers-triagedblathers was able to find an owner

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions