-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the problem
After we upgraded our cluster from 25.1.4 to 25.2.0, suddenly one of our services began experiencing very high latency, the table has the structure:
Table "public.config_parameters"
Column | Type | Collation | Nullable | Default
----------+---------+-----------+----------+----------
user_id | integer | | not null |
key | text | | not null |
value | text | | not null |
(3 rows)
Indexes
--------------------------------------------------------------------
config_parameters_pkey PRIMARY KEY, btree (user_id ASC, key ASC)
config_param_idx_key btree (key ASC)
(2 rows)
and the query was like:
SELECT
user_id, key, value
FROM
config_parameters
WHERE
user_id = ANY ($1:::INT4[]) AND key = $2:::STRING
-- Arguments:
-- $1: '{151254526,0}'
-- $2: 'privacy.last_seen'
with more diagnostic data in:
stmt-bundle-1075867753532948482.zip
We managed to fix it by dropping the index on key(config_param_idx_key).
Recently, without any recent changes made, one of other services suddenly began using a very inefficient query which even SET plan_cache_mode = force_custom_plan could not solve:
Plan Gist: AgGQDRAASwIAAAMTkA0EAwUKBgo=
• filter
│
└── • index join
│ table: user_story_primary@user_story_primary_pkey
│
└── • filter
│
└── • scan
table: user_story_primary@user_story_primary_is_mutual_expires_at_user_id_idx
spans: 1+ spans
This time we managed to solve it by alter index user_story_primary_is_mutual_expires_at_user_id_idx not visible;. I believe this is a serious planner bug that occasionally leads to broken plans that can endanger the whole cluster's stability as some queries that were quite light, become very resource-consuming and there is no pattern to understand why this is happening.
To Reproduce
I am not really aware as to why such plans would be chosen, there seems to be no pattern and it does not occur due to changes on our end, the latest occurrence happened when there was absolutely no changes made to the cluster or the tables involved in the incident.
Expected behavior
The query planner should continue or come up with efficient plans as it was using before.
Additional data / screenshots
They are appended in the description of the issue.
Environment:
- CockroachDB version 25.2
- Server OS: Ubuntu 24 with kernel 6.8
- Client app: golang pgx/jackc
Additional context
The initial impact was that the whole cluster was slowed down due to CPU exhaustion. We reduced the max connection count which stabilized the cluster and contained the damage to the affected service, and after setting the index invisible, everything went back to operational.
This only happened after we migrated from 25.1.4 to 25.2.0, before that, no such anomaly in the planner was present.
CC @DrewKimball
Jira issue: CRDB-51334
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
