Skip to content

release-25.2: sqlinstance: deduplicate live rows by rpcAddr#169217

Open
blathers-crl[bot] wants to merge 1 commit intorelease-25.2from
blathers/backport-release-25.2-169043
Open

release-25.2: sqlinstance: deduplicate live rows by rpcAddr#169217
blathers-crl[bot] wants to merge 1 commit intorelease-25.2from
blathers/backport-release-25.2-169043

Conversation

@blathers-crl
Copy link
Copy Markdown

@blathers-crl blathers-crl Bot commented Apr 28, 2026

Backport 1/1 commits from #169043 on behalf of @shubhamdhama.


selectDistinctLiveRows was deduplicating live SQL instance rows by sqlAddr (the SQL advertise address). This was the wrong key: the SQL advertise address (--sql-advertise-addr) is allowed to be non-unique across nodes — Kubernetes deployments commonly point it at a regional service DNS shared by all pods in the region. Deduplication on sqlAddr silently collapsed distinct live instances down to one entry, breaking downstream consumers like DistSQL placement and execution-locality filtering.

Switch the dedup key to rpcAddr. RPC advertise addresses must be node-unique among live instances — gossip and KV peer dialing depend on this — so rpcAddr is the correct identity for the pod-restart race that motivates dedup in the first place (a SQL pod crashes and a new pod starts at the same rpcAddr with a fresh instance ID before the dead pod's session expires).

The wrong key was introduced when the RPC and SQL listen ports were split; before the split there was a single advertise address and the choice was unambiguous.

Resolves: #168991
Epic: CRDB-63207

Release note (bug fix): Fixed a bug where setting --advertise-sql-addr to the same value across multiple SQL instances — for example, in a Kubernetes deployment where all pods in a region share a regional service DNS name — caused distributed SQL query plans to place onto a single instance per region, and could cause changefeeds with execution_locality filters to fail with "no instances found matching locality filter".


Release justification:

selectDistinctLiveRows was deduplicating live SQL instance rows by
sqlAddr (the SQL advertise address). This was the wrong key: the SQL
advertise address (--sql-advertise-addr) is allowed to be non-unique
across nodes — Kubernetes deployments commonly point it at a regional
service DNS shared by all pods in the region. Deduplication on sqlAddr
silently collapsed distinct live instances down to one entry, breaking
downstream consumers like DistSQL placement and execution-locality
filtering.

Switch the dedup key to rpcAddr. RPC advertise addresses must be
node-unique among live instances — gossip and KV peer dialing depend
on this — so rpcAddr is the correct identity for the pod-restart race
that motivates dedup in the first place (a SQL pod crashes and a new
pod starts at the same rpcAddr with a fresh instance ID before the
dead pod's session expires).

The wrong key was introduced when the RPC and SQL listen ports were
split; before the split there was a single advertise address and the
choice was unambiguous.

Resolves: #168991
Epic: CRDB-63207

Release note (bug fix): Fixed a bug where setting `--advertise-sql-addr` to
the same value across multiple SQL instances — for example, in a Kubernetes
deployment where all pods in a region share a regional service DNS name —
caused distributed SQL query plans to place onto a single instance per
region, and could cause changefeeds with execution_locality filters to fail
with "no instances found matching locality filter".
@blathers-crl blathers-crl Bot force-pushed the blathers/backport-release-25.2-169043 branch from 2879d02 to 6f4cb47 Compare April 28, 2026 07:11
@blathers-crl blathers-crl Bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 28, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 28, 2026

Merging to release-25.2 in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl
Copy link
Copy Markdown
Author

blathers-crl Bot commented Apr 28, 2026

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes OR fixes for serious issues. Non-production includes test-only changes, build system changes, etc. Serious issues are defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to. Reference the approved ENGREQ ticket in the PR body (e.g., "Fixes ENGREQ-123").

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl Bot requested a review from dt April 28, 2026 07:11
@blathers-crl blathers-crl Bot added the backport Label PR's that are backports to older release branches label Apr 28, 2026
@blathers-crl blathers-crl Bot requested a review from jeffswenson April 28, 2026 07:11
@blathers-crl blathers-crl Bot requested a review from a team April 28, 2026 07:11
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-db-server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants