Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade: remove buggy TTL repair #110364

Merged
merged 2 commits into from Sep 12, 2023
Merged

upgrade: remove buggy TTL repair #110364

merged 2 commits into from Sep 12, 2023

Conversation

ecwall
Copy link
Contributor

@ecwall ecwall commented Sep 11, 2023

Fixes #110363

The TTL descriptor repair in FirstUpgradeFromReleasePrecondition incorrectly
removes TTL fields from table descriptors after incorrectly comparing the
table descriptor's TTL job schedule ID to a set of job IDs.

This change removes the repair until tests are properly added.

Release note (bug fix): Remove buggy TTL descriptor repair. Previously,
upgrading from 22.2.X to 23.1.9 incorrectly removed TTL storage params from
tables (visible via SHOW CREATE TABLE <ttl-table>;) while attempting to
repair table descriptors. This resulted in the node that attempts to run the
TTL job crashing due to a panic caused by the missing TTL storage params.
Clusters currently on 22.2.X should NOT be upgraded to 23.1.9 and should
be upgraded to 23.1.10 or later directly.

@ecwall ecwall requested a review from a team as a code owner September 11, 2023 18:52
@blathers-crl
Copy link

blathers-crl bot commented Sep 11, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ecwall ecwall added backport-23.1.x Flags PRs that need to be backported to 23.1 FROZEN.backport-23.1.11-rc labels Sep 11, 2023
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ecwall
Copy link
Contributor Author

ecwall commented Sep 11, 2023

code lgtm!, but please include this test: https://cockroachlabs.slack.com/archives/C05RE28U495/p1694457604827269

Ah missed that, added.

@ecwall ecwall requested a review from rafiss September 11, 2023 19:35
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


-- commits line 12 at r1:
Please extend the release note:

  • what is the symptom of the bug a customer would see.
  • what version the bug was introduced in.
  • what, if anything, does the customer need to do if they encounter the bug other than upgrade to the version containing the fix.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @knz)


-- commits line 12 at r1:

Previously, knz (Raphael 'kena' Poss) wrote…

Please extend the release note:

  • what is the symptom of the bug a customer would see.
  • what version the bug was introduced in.
  • what, if anything, does the customer need to do if they encounter the bug other than upgrade to the version containing the fix.

the bug was introduced in v23.1.9


pkg/sql/logictest/testdata/logic_test/upgrade line 0 at r1 (raw file):
nit: let's rename the file to something more descriptive. perhaps mixed_version_upgrade_repair_descriptors or mixed_version_row_level_ttl

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


-- commits line 12 at r1:

Previously, rafiss (Rafi Shamim) wrote…

the bug was introduced in v23.1.9

I am personally aware, yes. I am just pointing out that this information should be included in the release note too.

as per https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/186548364/Release+notes#Bug-fix

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


-- commits line 12 at r1:

Previously, knz (Raphael 'kena' Poss) wrote…

I am personally aware, yes. I am just pointing out that this information should be included in the release note too.

as per https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/186548364/Release+notes#Bug-fix

Probably worth mentioning in the RN as well that "clusters using TTL should not be upgraded to 23.1.9 at all and should be upgraded to 23.1.10 or later directly".

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @knz)


-- commits line 12 at r1:

Previously, knz (Raphael 'kena' Poss) wrote…

Probably worth mentioning in the RN as well that "clusters using TTL should not be upgraded to 23.1.9 at all and should be upgraded to 23.1.10 or later directly".

i thought you were probably aware, but added the comment to help with amending the note :)


pkg/sql/catalog/tabledesc/table_desc_builder.go line 246 at r2 (raw file):

			tdb.changes.Add(catalog.StrippedDanglingBackReferences)
		}
	}

one more thing i forgot from the first review; let's add a short-circuit here:

func FirstUpgradeFromRelease(

for 23.1, let's make this repair logic only happen if a certain environment variable is set

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rafiss)


-- commits line 12 at r1:

Previously, rafiss (Rafi Shamim) wrote…

i thought you were probably aware, but added the comment to help with amending the note :)

Done.


pkg/sql/logictest/testdata/logic_test/upgrade line at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: let's rename the file to something more descriptive. perhaps mixed_version_upgrade_repair_descriptors or mixed_version_row_level_ttl

Done

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rafiss)


pkg/sql/catalog/tabledesc/table_desc_builder.go line 246 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

one more thing i forgot from the first review; let's add a short-circuit here:

func FirstUpgradeFromRelease(

for 23.1, let's make this repair logic only happen if a certain environment variable is set

Should this be a separate PR that is made against the release-23.1 branch and backported to 23.1.10?

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rafiss)


pkg/sql/catalog/tabledesc/table_desc_builder.go line 246 at r2 (raw file):

Previously, ecwall (Evan Wall) wrote…

Should this be a separate PR that is made against the release-23.1 branch and backported to 23.1.10?

Also it looks like this is only used for 23.2

firstUpgradeTowardsV23_2 = upgrade.NewTenantUpgrade(  
  "prepare upgrade to v23.2 release",  
  toCV(clusterversion.V23_2Start),  
  FirstUpgradeFromReleasePrecondition,  
  FirstUpgradeFromRelease,  
)

@ecwall ecwall requested review from knz and rafiss September 11, 2023 23:50
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @knz)


pkg/sql/catalog/tabledesc/table_desc_builder.go line 246 at r2 (raw file):

Previously, ecwall (Evan Wall) wrote…

Also it looks like this is only used for 23.2

firstUpgradeTowardsV23_2 = upgrade.NewTenantUpgrade(  
  "prepare upgrade to v23.2 release",  
  toCV(clusterversion.V23_2Start),  
  FirstUpgradeFromReleasePrecondition,  
  FirstUpgradeFromRelease,  
)

this upgrade exists in 23.1 also.

firstUpgradeTowardsV23_1 = upgrade.NewTenantUpgrade(
"prepare upgrade to v23.1 release",
toCV(clusterversion.V23_1Start),
FirstUpgradeFromReleasePrecondition,
NoTenantUpgradeFunc,
)

anyway, sure, let's only add the short-circuit logic in the backport

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like TestStripDanglingBackReferences needs to be updated

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @knz)


pkg/sql/logictest/testdata/logic_test/mixed_version_upgrade_repair_descriptors line 9 at r3 (raw file):


query T
SELECT crdb_internal.node_executable_version()

you can remove this assertion, or if you keep it, the expected value on this branch is 23.1

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rafiss)


pkg/sql/logictest/testdata/logic_test/mixed_version_upgrade_repair_descriptors line 9 at r3 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

you can remove this assertion, or if you keep it, the expected value on this branch is 23.1

I'll remove that.

Is this needed below?

query B retry
SELECT version LIKE '22.2-%' FROM [SHOW CLUSTER SETTING version]
----
true

Shouldn't the version be 23.1 after the upgrade?

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @knz)


pkg/sql/logictest/testdata/logic_test/mixed_version_upgrade_repair_descriptors line 9 at r3 (raw file):

Previously, ecwall (Evan Wall) wrote…

I'll remove that.

Is this needed below?

query B retry
SELECT version LIKE '22.2-%' FROM [SHOW CLUSTER SETTING version]
----
true

Shouldn't the version be 23.1 after the upgrade?

ah, i had this check from when i was testing on the 23.1 branch.

when testing on master, the check should be:

query B retry
SELECT version LIKE '23.1-%' FROM [SHOW CLUSTER SETTING version]
----
true

Note that this is looking at the logical cluster version, not the binary version.

On master, the previous logical version starts at 23.1, and the check for 23.1-% makes sure that the first upgrade migration (the one that runs all repairs) has finished running. (after all upgrade migrations run, SHOW CLUSTER SETTING version would show 23.2)

On the 23.1 branch, this check should be looking for 22.2-% instead

@ecwall ecwall requested a review from rafiss September 12, 2023 13:58
`upgrade all` upgrades all nodes in the cluster. It has the same behavior as
using `upgrade <idx>` for each index in the cluster.

Release note: None
Fixes #110363

The TTL descriptor repair in FirstUpgradeFromReleasePrecondition incorrectly
removes TTL fields from table descriptors after incorrectly comparing the
table descriptor's TTL job schedule ID to a set of job IDs.

This change removes the repair until tests are properly added.

Release note (bug fix): Remove buggy TTL descriptor repair. Previously,
upgrading from 22.2.X to 23.1.9 incorrectly removed TTL storage params from
tables (visible via `SHOW CREATE TABLE <ttl-table>;`) while attempting to
repair table descriptors. This resulted in the node that attempts to run the
TTL job crashing due to a panic caused by the missing TTL storage params.
Clusters currently on 22.2.X should NOT be upgraded to 23.1.9 and should
be upgraded to 23.1.10 or later directly.
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @knz)

@ecwall
Copy link
Contributor Author

ecwall commented Sep 12, 2023

bors r=rafiss

@craig
Copy link
Contributor

craig bot commented Sep 12, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 12, 2023

Build succeeded:

@craig craig bot merged commit 40dd180 into cockroachdb:master Sep 12, 2023
8 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Sep 12, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 48b0df4 to blathers/backport-release-23.1-110364: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 48b0df4 to blathers/backport-release-23.1.10-rc-110364: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.10-rc failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 FROZEN.backport-23.1.11-rc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade: automated catalog repairs incorrectly removes TTL from the table descriptor
4 participants