Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfigkvsubscriber: replica of a GC'ed span does not respect SystemSpanConfigs #113867

Open
adityamaru opened this issue Nov 6, 2023 · 3 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Nov 6, 2023

In our test cluster we observed two ranges corresponding to dropped tables not respecting protection policies enforced by SystemSpanConfigs. The replicas of these empty ranges were still evaluating batch requests and causing them to fail with ‹ERROR: batch timestamp 1698960957.937065386,0 must be after replica GC threshold 1698967771.741796894,0 (SQLSTATE XXUUU)›. The running theory is that with the following sequence of operations:

  • Drop DB (t = 1)
  • SCHEMA CHANGE GC Job writes range deletion at t=1, waits for span to be empty
  • Write SystemSpanConfig PTS (t = 2) covering the dropped range
  • MVCC GC Runs up to time 2
  • Span is now empty, SCHEMA CHANGE GC JOB sees span is empty, deletes span configuration.

We end up in a situation where the spanconfigstore does not find any overlapping span configs for the empty range and so does not run the logic to check if any system span configs apply to that range - https://github.com/cockroachdb/cockroach/blob/master/pkg/spanconfig/spanconfigstore/store.go#L199. In this way it misses any protection policies that should hold up the GCThreshold and allows GC to move past the protected timestamp. We think it makes sense to apply a default zone config with the system span configs combined into it to such ranges. We are still attempting to reproduce this locally.

Jira issue: CRDB-33239

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. GA-blocker O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster labels Nov 6, 2023
Copy link

blathers-crl bot commented Nov 6, 2023

Hi @adityamaru, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@adityamaru
Copy link
Contributor Author

I've slapped on the GA-blocker label, but technically this is not a regression from 22.2 when this infrastructure was first introduced so we can re-evaluate.

@adityamaru adityamaru self-assigned this Nov 7, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Nov 7, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Nov 8, 2023
Previously, if a range did not have an overlapping span
config then `ForEachOverlappingSpanConfig` would not apply
the relevant system span configs that may still apply to
that range. One situation in which we could end up with
such a range when a table is dropped and all of its data (including the
range deletion tombstone installed by the drop) is GC'ed, the associated
schema change GC job will delete the table's span config. In this case, we
will not find any overlapping span configs for the table's span, but a
system span config, such as a cluster wide protection policy, may still be
applicable to the replica with the empty table span. A scan
of that span AOST the timestamp at which we wrote the protection
policy could result in a BatchTimestampBeforeGCError.

This change fixes the above bug by applying a fallback config
to the span with no overlapping span configs, and combining the
system span configs that apply to the span. The change adds a
red-green test to exercise this logic.

Informs: cockroachdb#113867
Release note (bug fix): an empty range corresponding to a drop table
did not respect system level span configurations such as protected
timestamps, potentially causing reads above the protected timestamp
to fail
craig bot pushed a commit that referenced this issue Nov 21, 2023
Previously, if a range did not have an overlapping span
config then `ForEachOverlappingSpanConfig` would not apply
the relevant system span configs that may still apply to
that range. One situation in which we could end up with
such a range when a table is dropped and all of its data (including the
range deletion tombstone installed by the drop) is GC'ed, the associated
schema change GC job will delete the table's span config. In this case, we
will not find any overlapping span configs for the table's span, but a
system span config, such as a cluster wide protection policy, may still be
applicable to the replica with the empty table span. A scan
of that span AOST the timestamp at which we wrote the protection
policy could result in a BatchTimestampBeforeGCError.

This change fixes the above bug by applying a fallback config
to the span with no overlapping span configs, and combining the
system span configs that apply to the span. The change adds a
red-green test to exercise this logic.

Informs: #113867
Release note (bug fix): an empty range corresponding to a drop table
did not respect system level span configurations such as protected
timestamps, potentially causing reads above the protected timestamp
to fail
craig bot pushed a commit that referenced this issue Nov 21, 2023
114050: spanconfigkvsubscriber: fix missing system span configs r=stevendanna a=adityamaru

Previously, if a range did not have an overlapping span config then `ForEachOverlappingSpanConfig` would not apply the relevant system span configs that may still apply to that range. One situation in which we could end up with such a range when a table is dropped and all of its data (including the range deletion tombstone installed by the drop) is GC'ed, the associated schema change GC job will delete the table's span config. In this case, we will not find any overlapping span configs for the table's span, but a system span config, such as a cluster wide protection policy, may still be applicable to the replica with the empty table span. A scan of that span AOST the timestamp at which we wrote the protection policy could result in a BatchTimestampBeforeGCError.

This change fixes the above bug by applying a fallback config to the span with no overlapping span configs, and combining the system span configs that apply to the span. The change adds a red-green test to exercise this logic.

Informs: #113867
Release note (bug fix): an empty range corresponding to a drop table did not respect system level span configurations such as protected timestamps, potentially causing reads above the protected timestamp to fail

114802: obsservice: creating Insights Pipeline r=maryliag a=maryliag

Note to Reviewers: only last commit is relevant to this PR

---

Create basic structure for Insights Pipeline.
Currently the Validator (always return that is valid)
and Processor (just printing the value) are very simple.

This commit is to introduce that structure so can be iterated on.

https://www.loom.com/share/b568f514753542d9bd5e551b530f917d

Part Of [CC-26215](https://cockroachlabs.atlassian.net/browse/CC-26215)

Release note: None

114823: types: make Family.Name return redact.SafeString r=yuzefovich a=yuzefovich

All family names are safe constant strings.

Epic: None

Release note: None

Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: maryliag <marylia@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Nov 21, 2023
Previously, if a range did not have an overlapping span
config then `ForEachOverlappingSpanConfig` would not apply
the relevant system span configs that may still apply to
that range. One situation in which we could end up with
such a range when a table is dropped and all of its data (including the
range deletion tombstone installed by the drop) is GC'ed, the associated
schema change GC job will delete the table's span config. In this case, we
will not find any overlapping span configs for the table's span, but a
system span config, such as a cluster wide protection policy, may still be
applicable to the replica with the empty table span. A scan
of that span AOST the timestamp at which we wrote the protection
policy could result in a BatchTimestampBeforeGCError.

This change fixes the above bug by applying a fallback config
to the span with no overlapping span configs, and combining the
system span configs that apply to the span. The change adds a
red-green test to exercise this logic.

Informs: #113867
Release note (bug fix): an empty range corresponding to a drop table
did not respect system level span configurations such as protected
timestamps, potentially causing reads above the protected timestamp
to fail
@adityamaru
Copy link
Contributor Author

I'll keep this open until the 23.1 backport merges

@adityamaru adityamaru added branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 and removed GA-blocker labels Nov 22, 2023
cockroach-teamcity pushed a commit to cockroach-teamcity/cockroach that referenced this issue Nov 27, 2023
Previously, if a range did not have an overlapping span
config then `ForEachOverlappingSpanConfig` would not apply
the relevant system span configs that may still apply to
that range. One situation in which we could end up with
such a range when a table is dropped and all of its data (including the
range deletion tombstone installed by the drop) is GC'ed, the associated
schema change GC job will delete the table's span config. In this case, we
will not find any overlapping span configs for the table's span, but a
system span config, such as a cluster wide protection policy, may still be
applicable to the replica with the empty table span. A scan
of that span AOST the timestamp at which we wrote the protection
policy could result in a BatchTimestampBeforeGCError.

This change fixes the above bug by applying a fallback config
to the span with no overlapping span configs, and combining the
system span configs that apply to the span. The change adds a
red-green test to exercise this logic.

Informs: cockroachdb#113867
Release note (bug fix): an empty range corresponding to a drop table
did not respect system level span configurations such as protected
timestamps, potentially causing reads above the protected timestamp
to fail
adityamaru added a commit to adityamaru/cockroach that referenced this issue Dec 1, 2023
Previously, if a range did not have an overlapping span
config then `ForEachOverlappingSpanConfig` would not apply
the relevant system span configs that may still apply to
that range. One situation in which we could end up with
such a range when a table is dropped and all of its data (including the
range deletion tombstone installed by the drop) is GC'ed, the associated
schema change GC job will delete the table's span config. In this case, we
will not find any overlapping span configs for the table's span, but a
system span config, such as a cluster wide protection policy, may still be
applicable to the replica with the empty table span. A scan
of that span AOST the timestamp at which we wrote the protection
policy could result in a BatchTimestampBeforeGCError.

This change fixes the above bug by applying a fallback config
to the span with no overlapping span configs, and combining the
system span configs that apply to the span. The change adds a
red-green test to exercise this logic.

Informs: cockroachdb#113867
Release note (bug fix): an empty range corresponding to a drop table
did not respect system level span configurations such as protected
timestamps, potentially causing reads above the protected timestamp
to fail
@ajstorm ajstorm added the P-3 Issues/test failures with no fix SLA label Feb 1, 2024
@msbutler msbutler assigned stevendanna and unassigned adityamaru Feb 12, 2024
@exalate-issue-sync exalate-issue-sync bot removed the P-3 Issues/test failures with no fix SLA label Mar 19, 2024
@jlinder jlinder assigned stevendanna and unassigned adityamaru Mar 22, 2024
@jlinder jlinder added the P-3 Issues/test failures with no fix SLA label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA
Projects
None yet
Development

No branches or pull requests

4 participants