Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Fix panic when transitioning from REGIONAL BY ROW #61889

Merged
merged 1 commit into from
Mar 15, 2021

Conversation

ajstorm
Copy link
Collaborator

@ajstorm ajstorm commented Mar 12, 2021

Before this commit, the transition from REGIONAL BY ROW to REGIONAL
BY TABLE would drop the partitioning on the table not drop the
zone configurations. As a result, when the index GC runs, it would
try and cleanup the zone configurations. This hit an issue though,
as the types are not properly hydrated in the index GC code path.
The end result was that the index GC would panic and bring down the
node. To work around this problem this commit adds code to remove
the zone configs when dropping the table partitioning.

Note: This commit does not solve the root cause of this issue, but
instead, prevent the failure from occurring when transitioning from
REGIONAL BY ROW tables to REGIONAL BY TABLE tables. A new issue
will be opened for the general problem.

Release note: None

Resolves #61751 .

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajstorm ajstorm force-pushed the ajstorm_zone_config_cleanup branch 3 times, most recently from 41649e7 to 06d7e89 Compare March 12, 2021 18:07
@blathers-crl
Copy link

blathers-crl bot commented Mar 12, 2021

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Mar 12, 2021
@ajstorm ajstorm force-pushed the ajstorm_zone_config_cleanup branch 3 times, most recently from 180d58b to 51c56e3 Compare March 12, 2021 21:59
Copy link
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @arulajmani)


pkg/ccl/multiregionccl/regional_by_row_test.go, line 646 at r1 (raw file):

	res, err = sqlDB.Query(`WITH jobs AS (
		SELECT status, crdb_internal.pb_to_json(

feels like this SQL query can be function-ised with parameter "status", or made a const


pkg/ccl/multiregionccl/regional_by_row_test.go, line 675 at r1 (raw file):

	// Wait.
	time.Sleep(time.Second * 2)

this time.Sleep seems potentially racy, esp under testrace. two possible ways around this, either:

  • use / add a test knob to the GC job to signal that it is complete using a channel, i.e.
done := make(struct{})
... SeverParams.Knob...  = {
   GCCompleteKnob: func() {
     done <- struct{}
   }
}

....

<-done

// Validate indexes are cleaned up.
// ....
  • use testutils.SucceedsSoon` for the rest below.

pkg/sql/region_util.go, line 469 at r1 (raw file):

		for _, indexID := range indexIDs {
			for _, region := range regionConfig.Regions {
				zoneConfig.DeleteSubzone(uint32(indexID), string(region.Name))

urgh, DeleteSubzone is O(n) so this is O(n*regions)
guess it's not important to fix now.


pkg/sql/region_util.go, line 591 at r1 (raw file):

	newZoneConfigIsEmpty := newZoneConfig.Equal(zonepb.NewZoneConfig())
	currentZoneConfigIsEmpty := currentZoneConfig.Equal(zonepb.NewZoneConfig())
	reWriteZoneConfig := !newZoneConfigIsEmpty

nit: style rewrite as a single word, i.e.rewriteZoneConfig

@otan
Copy link
Contributor

otan commented Mar 14, 2021

also to catch this should we consider setting a lower gc ttl for MR alter_table_locality tests?

@ajstorm ajstorm force-pushed the ajstorm_zone_config_cleanup branch from 51c56e3 to cb05804 Compare March 15, 2021 01:20
@blathers-crl blathers-crl bot requested a review from otan March 15, 2021 01:20
Copy link
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we need to assess gc ttl behavior more holistically for MR test cases. I've added it to the test plan.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @arulajmani, and @otan)


pkg/ccl/multiregionccl/regional_by_row_test.go, line 646 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

feels like this SQL query can be function-ised with parameter "status", or made a const

Done.


pkg/ccl/multiregionccl/regional_by_row_test.go, line 675 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

this time.Sleep seems potentially racy, esp under testrace. two possible ways around this, either:

  • use / add a test knob to the GC job to signal that it is complete using a channel, i.e.
done := make(struct{})
... SeverParams.Knob...  = {
   GCCompleteKnob: func() {
     done <- struct{}
   }
}

....

<-done

// Validate indexes are cleaned up.
// ....
  • use testutils.SucceedsSoon` for the rest below.

Thanks for the pointer to testutils.SucceedsSoon. Very helpful.


pkg/sql/region_util.go, line 591 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

nit: style rewrite as a single word, i.e.rewriteZoneConfig

Fixed.

@ajstorm
Copy link
Collaborator Author

ajstorm commented Mar 15, 2021

Hey @ajwerner. Any chance you could have a look at this sometime today? This issue is a release blocker.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @arulajmani, and @otan)

@ajstorm ajstorm force-pushed the ajstorm_zone_config_cleanup branch from cb05804 to 98b2165 Compare March 15, 2021 15:31
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Regarding the more general problem, it feels to me like the right thing to do is to always remove the index's zone configs in the same transaction that writes the index GC job. You already cannot update the zone configs for an index which has been dropped so that wouldn't be a regression. That alone should fix the bug, right? In short, why don't we just always remove the zone config. We'll need to leave the code in the gcjob for gcjobs created in the older version for one release.

Reviewed 1 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @arulajmani, and @otan)

@ajstorm ajstorm requested a review from ajwerner March 15, 2021 17:13
Copy link
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR @ajwerner. I agree, and was thinking the same thing about the more generalized fix. Once I have the general repro, I'll open an issue.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @arulajmani, and @otan)

Before this commit, the transition from REGIONAL BY ROW to REGIONAL
BY TABLE would drop the partitioning on the table not drop the
zone configurations. As a result, when the index GC runs, it would
try and cleanup the zone configurations. This hit an issue though,
as the types are not properly hydrated in the index GC code path.
The end result was that the index GC would panic and bring down the
node. To work around this problem this commit adds code to remove
the zone configs when dropping the table partitioning.

Note: This commit does not solve the root cause of this issue, but
instead, prevent the failure from occurring when transitioning from
REGIONAL BY ROW tables to REGIONAL BY TABLE tables.  A new issue
will be opened for the general problem.

Release note: None
@ajstorm ajstorm force-pushed the ajstorm_zone_config_cleanup branch from 98b2165 to 70198b3 Compare March 15, 2021 17:24
@ajstorm
Copy link
Collaborator Author

ajstorm commented Mar 15, 2021

Merging with priority, as this is the last remaining release blocker for Beta 1.

bors r+ p=9999

@otan
Copy link
Contributor

otan commented Mar 15, 2021

Should've done single mode as well if you want priority heheh

@craig
Copy link
Contributor

craig bot commented Mar 15, 2021

Build succeeded:

@craig craig bot merged commit 2d3c34a into cockroachdb:master Mar 15, 2021
craig bot pushed a commit that referenced this pull request Mar 16, 2021
62008: sql: Change to using a session variable instead of FORCE syntax for overriding zone configuration settings for MR databases and tables r=otan a=ajstorm

sql: Block zone config updates to multi-region tables

Block users from updating the zone configurations of multi-region
tables, without first having them set the session variable
`override_multi_region_zone_config` to true. We block updates to
multi-region table/partition/index zone configurations because we don't
want users to accidentally override the prescribed settings, leaving
them open to sub-optimal performance. Note that only the multi-region
fields of the zone configuration are blocked behind this variable. All
other fields (gc.ttlseconds, range_min/max_bytes, etc) can be updated
without overriding.

Release note (sql change): Block users from updating the zone
configurations of multi-region tables.

sql: Use session variable instead of FORCE to override zone configs

With #61499 we introduced new syntax (FORCE) to override setting the
zone configurations on multi-region databases. Upon further reflection,
it was decided that a session variable would be better suited to the
task. This commit pulls out the FORCE syntax and replaces it with the
use of override_multi_region_zone_config;

Release note (sql change): Revert the release notes on #61499.  We now
use a session variable (override_multi_region_zone_config) to override
the zone configuration on multi-region databases (and tables).

Resolves: #57668.

Note to reviewers: Please ignore the first commit, as it's being separately reviewed as part of #61889.  That commit was pulle out into a separate PR as it's a release blocker, and there was a need to get it in urgently.

Co-authored-by: Adam Storm <storm@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: Panic after moving table from REGIONAL BY ROW to REGIONAL BY TABLE and back to REGIONAL BY ROW
4 participants