migration,jobs: refactor long-running migrations and hook up job #59760

ajwerner · 2021-02-03T15:53:45Z

This PR comes in three commits. The first commit reworks the package structure of long-running migrations a bit. The second introduces a migration job. The third introduces a system.migrations table to store completion state for migrations. The idea is that running long-running migrations in a job is handy because it provides leasing, observability, and control to users.

Fixes #58183.

cockroach-teamcity · 2021-02-03T15:53:54Z

This change is

tbg

Thanks for getting this out!

Reviewed 30 of 30 files at r1, 23 of 23 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)

pkg/migration/migrationmanager/manager_external_test.go, line 102 at r2 (raw file):

             WHERE (
                    crdb_internal.pb_to_json(
                        'cockroach.sql.jobs.jobspb.Payload',

wow, this pb builtin stuff has become wonderful

pkg/migration/migrationmanager/manager.go, line 45 at r2 (raw file):

// NewManager constructs a new Manager.
//
// TODO(irfansharif): We'll need to eventually plumb in on a lease manager here.

Is this TODO obsolete?

pkg/migration/migrationmanager/manager.go, line 280 at r2 (raw file):

  WHERE status IN ` + jobs.NonTerminalStatusTupleString + `
	)
	WHERE pl->'longRunningMigration'->'clusterVersion' = $1::JSON;`

So I understand that if I invoke two overlapping bumps to 21.1 that they will serialize on each other, but what if I do 20.2-15 and 21.1? They will step through a prefix of the same versions, but it looks like they'll be allowed to run in parallel. I don't know how much we need to do to prevent this, but it'd be good to document the behavior. It could backfire if we moved to more rapid "major" releases or extended the allowed gap between major releases (which is, right now, at zero). In both cases we could end up with concurrent upgrades if the cluster got upgraded in rapid succession.
For now, maybe good enough to return an error?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @tbg)

pkg/migration/migrationmanager/manager.go, line 45 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this TODO obsolete?

Indeed.

pkg/migration/migrationmanager/manager.go, line 280 at r2 (raw file):

For now, maybe good enough to return an error?

In what scenario are you suggesting we return an error? Right now it effectively allows the concurrent migrations to occur with the caveat that no individual migration will be occurring concurrently. If these things are idempotent and the cluster version never moves backwards (which it shouldn't), then this isn't so hazardous. Sure, there's a small window when these concurrent operations might happen but once any of them succeeds in setting the new cluster version, once any other concurrent attempts finish, we shouldn't see the migration run again.

You have highlighted a big, important, missing thing here. If we do have clients concurrently trying to jump through a number of versions and it turns out that one of those migrations is buggy or not finishing, then we might be in a bit of a pickle where we re-run migrations potentially a lot of times.

Ideally we'd make the successful completion of any migration store that fact somewhere so that the migration doesn't get run again. Today there's this weird thing that the side-effect of setting the cluster version is handled very far away from this code which actually does the migration and only sets it to the final version after running all of the migrations.

We've got some options:

Detect succeeded jobs here and then not run again.

This approach would probably be okay and is very simple but it isn't perfect. Job records get GC'd and it's rather implicit.

Write to a key at the end of the job which we read either here or in the job itself.

I'm fine with that but it's a bit heavy. We'd probably want a table. I don't think we're all too keen on random keys like the existing migration keys which are used in sqlmigrations today (right?)

Rework the version setting code to always step through the versions and commit them one-by-one.

We've disallowed explicit transactions for setting this cluster setting. It makes it so that you might end up only partially getting the version to where you want, which might be weird. I think this is my preferred approach but I could be convinced of one of the others.

ajwerner

We had a fun bootstrapping problem to make this all work but I took care of it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @tbg)

pkg/migration/migrationmanager/manager.go, line 280 at r2 (raw file):

Previously, ajwerner wrote…

For now, maybe good enough to return an error?

In what scenario are you suggesting we return an error? Right now it effectively allows the concurrent migrations to occur with the caveat that no individual migration will be occurring concurrently. If these things are idempotent and the cluster version never moves backwards (which it shouldn't), then this isn't so hazardous. Sure, there's a small window when these concurrent operations might happen but once any of them succeeds in setting the new cluster version, once any other concurrent attempts finish, we shouldn't see the migration run again.

You have highlighted a big, important, missing thing here. If we do have clients concurrently trying to jump through a number of versions and it turns out that one of those migrations is buggy or not finishing, then we might be in a bit of a pickle where we re-run migrations potentially a lot of times.

Ideally we'd make the successful completion of any migration store that fact somewhere so that the migration doesn't get run again. Today there's this weird thing that the side-effect of setting the cluster version is handled very far away from this code which actually does the migration and only sets it to the final version after running all of the migrations.

We've got some options:

Detect succeeded jobs here and then not run again.

This approach would probably be okay and is very simple but it isn't perfect. Job records get GC'd and it's rather implicit.

Write to a key at the end of the job which we read either here or in the job itself.

I'm fine with that but it's a bit heavy. We'd probably want a table. I don't think we're all too keen on random keys like the existing migration keys which are used in sqlmigrations today (right?)

Rework the version setting code to always step through the versions and commit them one-by-one.

We've disallowed explicit transactions for setting this cluster setting. It makes it so that you might end up only partially getting the version to where you want, which might be weird. I think this is my preferred approach but I could be convinced of one of the others.

I went with 2. It was heavier but I couldn't justify lowering the cluster setting changing logic. I think this worked reasonably well. I added a test. The hiccup was the migration but I think I dealt with it just fine.

ajwerner

The bootstrap problems here were actually rather thorny but they have been navigated.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, @pbardea, and @tbg)

ajwerner

@pbardea I wonder if you have thoughts on the backup stuff. Should we backup cluster migration state? What sort of enforcement or invariants do we have about restoring backups and their relationship to cluster versions? I'd hope that we write down a version in backup and only ever restore into a cluster with a newer active version. @dt if you've got feelings, I'm interested.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, @pbardea, and @tbg)

tbg

This all looks good to me, thanks for all of this work! Since a lot of the nitty gritty is in the jobs, I would ask you to get a second review for that part from someone more experienced with them. I had a few comments, but nothing dramatic.

Reviewed 25 of 25 files at r3, 24 of 24 files at r4, 24 of 24 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @pbardea)

pkg/clusterversion/cockroach_versions.go, line 206 at r5 (raw file):

	// ReplicaVersions enables the versioning of Replica state.
	ReplicaVersions
	// ReplacedTruncatedAndRangeAppliedStateMigration stands in for

Can you unexport these to avoid confusion? They shouldn't be referenced from outside this package, right?

pkg/migration/migration.go, line 177 at r5 (raw file):

// SQL layer. Lower-level migrations may depend solely on the Cluster.
type SQLDeps struct {
	Cluster  Cluster

Cluster should split into two subinterfaces, right?

type SystemTenantCluster interface {
  AnyTenantCluster
  ForEveryNode(_ context.Context, op string, func(context.Context, serverpb.MigrationClient)
  UntilClusterStable(/* ... */)
}

type AnyTenantCluster interface {
  DB() *kv.DB
  IterateRangeDescriptors(/* ... */)
}

That way SQLMigrationFn can be forced to take only an AnyTenantCluster and thus "ensure" (to a reasonable degree) that folks won't accidentally write migrations that don't work in multi-tenancy. (Obviously we need a mixed version test anyway but still).

I realize you're not working on that issue now. Just throwing it out there. You can punt, I'm already so graceful you're doing all of this jobs work.

pkg/migration/migration.go, line 182 at r5 (raw file):

}

type KVMigrationFn func(context.Context, clusterversion.ClusterVersion, Cluster) error

Don't these need comments for the linter?

pkg/migration/migration.go, line 183 at r5 (raw file):

type KVMigrationFn func(context.Context, clusterversion.ClusterVersion, Cluster) error
type SQLMigrationFn func(context.Context, clusterversion.ClusterVersion, SQLDeps) error

❤️
Commenting to tag the issue: #48436 (comment)

pkg/migration/migrationjob/migration_job.go, line 42 at r5 (raw file):

func NewRecord(version clusterversion.ClusterVersion, user security.SQLUsername) jobs.Record {
	return jobs.Record{
		Description: "Long running migration",

Is it supposed to be "Long-running"?

pkg/migration/migrationjob/migration_job.go, line 58 at r5 (raw file):

var _ jobs.Resumer = (*resumer)(nil)

func (r resumer) Resume(ctx context.Context, execCtxI interface{}) error {

Since I'm not an expert on jobs, you probably want to get a second pair of eyes on the jobs-related code.

pkg/migration/migrationjob/migration_job.go, line 67 at r5 (raw file):

	alreadyCompleted, err := CheckIfMigrationCompleted(ctx, nil /* txn */, ie, cv)
	if alreadyCompleted || err != nil {
		return errors.Wrapf(err, "checking migration completion for %v", cv)

I'm always on the fence about this pattern (using errors.Wrap to return nil) but I don't know exactly why, so just a remark and not a suggestion to change.

pkg/migration/migrationjob/migration_job.go, line 76 at r5 (raw file):

	}
	if !ok {
		// TODO(ajwerner): Consider treating this as an assertion failure. Jobs

👍 could there be exceptions though if we reorder versions?

pkg/migration/migrationmanager/manager_external_test.go, line 248 at r5 (raw file):

	defer log.Scope(t).Close(t)

	// We're going to be migrating from startKey to endKey. We end up needing

This is going to be annoying every time we phase out old versions. Could you add a testing knob that can mock out the call to ListBetween in (*Manager).Migrate? There's already the corresponding thing (though unfortunately as a singleton, but whatever) in TestingRegisterMigrationInterceptor which you are already using.

pkg/migration/migrationmanager/manager_external_test.go, line 275 at r5 (raw file):

					ctx context.Context, cv clusterversion.ClusterVersion, c migration.SQLDeps,
				) error {
					migrationRunCounts[cv]++

You could also int active int32 // atomic above, and here

if atomic.AddInt32(&active, 1) != 1 {
  t.Error("unexpected concurrency")
}
time.Sleep(5*time.Millisecond)
atomic.AddInt32(&active, -1)

Just anticipating that this test won't be run under the race detector forever :S (Plus, not sure how efficient the race detector is for this particular race)

pkg/migration/migrationmanager/manager.go, line 294 at r5 (raw file):

  WHERE status IN ` + jobs.NonTerminalStatusTupleString + `
	)
	WHERE pl->'longRunningMigration'->'clusterVersion' = $1::JSON;`

Is there a danger of us renaming those fields and badly breaking things? Do these fields have explicit json name tags that don't depend on proto name changes? Relying on proto field names always scares me because we always assume these can be mutated.
I assume there's no way for this protoreflect stuff to use field numbers as accessors?

ajwerner

With this PR and the associate issues, it feels like this subsystem can and should totally subsume pkg/sqlmigrations. I can't think of a reason why sqlmigrations should continue to exist.

@miretskiy any chance I coax you into looking at the job bits of this?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @tbg)

pkg/clusterversion/cockroach_versions.go, line 206 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you unexport these to avoid confusion? They shouldn't be referenced from outside this package, right?

Done.

pkg/migration/migration.go, line 177 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Cluster should split into two subinterfaces, right?
type SystemTenantCluster interface {
  AnyTenantCluster
  ForEveryNode(_ context.Context, op string, func(context.Context, serverpb.MigrationClient)
  UntilClusterStable(/* ... */)
}

type AnyTenantCluster interface {
  DB() *kv.DB
  IterateRangeDescriptors(/* ... */)
}
That way SQLMigrationFn can be forced to take only an AnyTenantCluster and thus "ensure" (to a reasonable degree) that folks won't accidentally write migrations that don't work in multi-tenancy. (Obviously we need a mixed version test anyway but still).

I realize you're not working on that issue now. Just throwing it out there. You can punt, I'm already so graceful you're doing all of this jobs work.

Good point. I don't even think that IterateRangeDescriptors is needed for the sql migrations. I've reworked this a bit. It's not amazing but I think it's better.

I think that getting the tenant work finished here isn't too bad. One of the biggest blockers seems to be #58362 which I'm planning on sprucing up this week.

pkg/migration/migration.go, line 182 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't these need comments for the linter?

Yes, done.

pkg/migration/migrationjob/migration_job.go, line 42 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is it supposed to be "Long-running"?

Done.

pkg/migration/migrationjob/migration_job.go, line 67 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm always on the fence about this pattern (using errors.Wrap to return nil) but I don't know exactly why, so just a remark and not a suggestion to change.

I've come around to using it. I find myself really happy to have annotations indicating the call stack of an error and for various reasons we don't tend to get stack traces back to clients. It should be plenty cheap for everything but very hot code paths.

pkg/migration/migrationjob/migration_job.go, line 76 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

👍 could there be exceptions though if we reorder versions?

I suppose, yes. Added more commentary.

pkg/migration/migrationmanager/manager_external_test.go, line 248 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is going to be annoying every time we phase out old versions. Could you add a testing knob that can mock out the call to ListBetween in (*Manager).Migrate? There's already the corresponding thing (though unfortunately as a singleton, but whatever) in TestingRegisterMigrationInterceptor which you are already using.

Done.

pkg/migration/migrationmanager/manager_external_test.go, line 275 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You could also int active int32 // atomic above, and here
if atomic.AddInt32(&active, 1) != 1 {
  t.Error("unexpected concurrency")
}
time.Sleep(5*time.Millisecond)
atomic.AddInt32(&active, -1)
Just anticipating that this test won't be run under the race detector forever :S (Plus, not sure how efficient the race detector is for this particular race)

Done.

pkg/migration/migrationmanager/manager.go, line 294 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is there a danger of us renaming those fields and badly breaking things? Do these fields have explicit json name tags that don't depend on proto name changes? Relying on proto field names always scares me because we always assume these can be mutated.
I assume there's no way for this protoreflect stuff to use field numbers as accessors?

Certainly there is. This code needs to stay in sync with the field naming compiled into it. Unit testing here should catch issues. It is somewhat brittle.Fortunately this isn't a mixed version problem. We made these pb_to_json functions not eligible for distsql for exactly this reason.

miretskiy

Reviewed 1 of 30 files at r1, 1 of 25 files at r3, 4 of 44 files at r6, 7 of 22 files at r7, 21 of 47 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @tbg)

pkg/ccl/backupccl/system_schema.go, line 232 at r8 (raw file):

	systemschema.LongRunningMigrationsTable.GetName(): {
		includeInClusterBackup: optOutOfClusterBackup,
	},

fyi: you'll need to rebase. This code moved to backupbase package.

pkg/migration/migration.go, line 17 at r8 (raw file):

activate

made active?

pkg/migration/migration.go, line 21 at r8 (raw file):

pkg/sqlmigrations.

don't you want to get rid of it eventually? Perhaps drop mention of this?

pkg/migration/migrationcluster/cluster.go, line 74 at r8 (raw file):

	}

	// We'll want to rate limit outgoing RPCs (limit pulled out of thin air).

i'm curious why do we want to rate limit here?

pkg/migration/migrationjob/migration_job.go, line 78 at r8 (raw file):

		// with a newer version where the migration has been re-ordered to be later.
		// This should only happen between alphas but is theoretically not illegal.
		return nil

So, that would mark the job completed?

pkg/migration/migrationjob/migration_job.go, line 99 at r8 (raw file):

	// no-op and new jobs are not created.
	return errors.Wrapf(markMigrationCompleted(ctx, ie, cv),
		"marking migration complete for %v", cv)

nit: perhaps long hand?
I see you already discussed this... I don't know; most of the code does "if err != nil { return wrapf }"

pkg/migration/migrationmanager/manager.go, line 269 at r8 (raw file):

migrationjob.CheckIfMigrationCompleted

nit: this almost feels like this method might have a better home inside migration package (or migrationmanager) instead of migrationjob?

pkg/migration/migrationmanager/manager.go, line 302 at r8 (raw file):

		SELECT id,
		status,
		crdb_internal.pb_to_json(

🎉

pkg/migration/migrationmanager/manager.go, line 310 at r8 (raw file):

  WHERE status IN ` + jobs.NonTerminalStatusTupleString + `
	)
	WHERE pl->'longRunningMigration'->'clusterVersion' = $1::JSON;`

do you need to use cte here? Can't we just
select id, status, crdb_internal.pb_to_json() pl
where status in ... AND pl->....?

pkg/migration/migrationmanager/manager.go, line 341 at r8 (raw file):

		}
		log.Errorf(ctx, "%s", buf)
		return false, 0, errors.AssertionFailedf("%s", buf)

will the cluster ever recover from this?

pkg/migration/migrationmanager/manager.go, line 360 at r8 (raw file):

		// do something similar would be caught at the sql layer (also tested in
		// the same logictest). We'll just explicitly append the target version
		// here instead, so that we're able to actually migrate into it.

just wondering: can this be made more explicit? Like, perhaps tests (or test cluster?) can set a variable here that this manager is running under test and only then allow this? Otherwise panic or something?

Might be more trouble than worth though.

ajwerner

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, @miretskiy, and @tbg)

pkg/migration/migration.go, line 17 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

activate 
made active?

Done.

pkg/migration/migration.go, line 21 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

pkg/sqlmigrations.
don't you want to get rid of it eventually? Perhaps drop mention of this?

Done.

pkg/migration/migrationcluster/cluster.go, line 74 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

i'm curious why do we want to rate limit here?

I didn't write this code, @irfansharif did. I'm not inclined to change it (Jordan Lewis's law)

pkg/migration/migrationjob/migration_job.go, line 78 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

So, that would mark the job completed?

Yes but it won't mark the migration as completed.

pkg/migration/migrationjob/migration_job.go, line 99 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

nit: perhaps long hand?
I see you already discussed this... I don't know; most of the code does "if err != nil { return wrapf }"

Done.

pkg/migration/migrationmanager/manager.go, line 269 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

migrationjob.CheckIfMigrationCompleted
nit: this almost feels like this method might have a better home inside migration package (or migrationmanager) instead of migrationjob?

Heh it started in the migrationmanager package. Ultimately the job needs to write the completion. I suppose that the most correct thing to do would be to abstract it completely but that feels like overkill.

pkg/migration/migrationmanager/manager.go, line 310 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

do you need to use cte here? Can't we just
select id, status, crdb_internal.pb_to_json() pl
where status in ... AND pl->....?

This isn't a CTE. This is a subquery. I'm not sure what you mean. In sql you can't refer to a projection column in a where clause.

demo@127.0.0.1:44909/movr> SELECT 1 as v WHERE v > 0;
ERROR: column "v" does not exist
SQLSTATE: 42703
demo@127.0.0.1:44909/movr> SELECT v FROM (SELECT 1 as v) WHERE v > 0;
  v
-----
  1
(1 row)

CTE would be:

WITH t AS (SELECT 1 AS v) SELECT v FROM t WHERE v > 0;

pkg/migration/migrationmanager/manager.go, line 341 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

will the cluster ever recover from this?

I believe it ends up just failing the job and sending a sentry report. This is an invariant, it should not happen.

pkg/migration/migrationmanager/manager.go, line 360 at r8 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

just wondering: can this be made more explicit? Like, perhaps tests (or test cluster?) can set a variable here that this manager is running under test and only then allow this? Otherwise panic or something?

Might be more trouble than worth though.

Done. Injected the testing knob inside the logic test that was depending on this behavior. I was sad about it too.

irfansharif

Thanks for doing this! I've looked towards the parts of pkg/migrations I'm most comfortable with, and deferring to you + other reviewers for the bits I'm not (basically the jobs stuff).

pkg/migration/migrationmanager/manager.go

irfansharif · 2021-02-09T21:04:34Z

pkg/migration/migration.go

+	//   (e) We'll continue to loop around until the node ID list stabilizes
+	//
+	// [*]: We can be a bit more precise here. What UntilClusterStable gives us is a
+	// strict causal happenedbefore relation between running the given closure and


"happened-before"?

irfansharif · 2021-02-09T21:09:47Z

pkg/migration/migration.go

+// [4]: pkg/kv/kvserver/batch_eval/cmd_migrate.go
+//
+type Migration interface {
+	ClusterVersion() clusterversion.ClusterVersion


Given that we want only two real types to implement this, KVMigration and SQLMigration, should we define a hidden method here to restrict the list of types implicitly implementing the Migration interface?

irfansharif · 2021-02-09T21:13:43Z

pkg/migration/migrations/long_running_migrations.go

+	"github.com/cockroachdb/cockroach/pkg/sqlmigrations"
+)
+
+func init() {


Remove the empty init?

irfansharif · 2021-02-09T21:28:46Z

pkg/sql/planner.go

@@ -473,6 +474,11 @@ func (p *planner) DistSQLPlanner() *DistSQLPlanner {
 	return p.extendedEvalCtx.DistSQLPlanner
 }

+// MigrationRunDependencies returns the migration.Cluster if there is one.


Did we mean to say migration.Cluster here?

Fixed. I was toying for a few minutes with what migrations will look like for tenants but backed off a bit.

irfansharif · 2021-02-09T21:33:19Z

pkg/migration/migration.go

+}
+
+// RunDependencies are used by the job to run migrations.
+type RunDependencies interface {


Does "RunDependencies" match with terminology we use elsewhere in the jobs infrastructure? Or is the "Run" here referring the runnable migrations (KVMigrations.Run, SQLMigrations.Run)? I don't have a better name off-hand, though compared to the rest of the interfaces we've defined here, this was a bit less clear.

I agree this was a wart. I've reworked it to be migration.JobDeps and pulled the Registry to just be defined inline. I think it's cleaner.

irfansharif · 2021-02-09T21:39:29Z

pkg/migration/migrationcluster/migrationcluster.go

+// by the Apache License, Version 2.0, included in the file
+// licenses/APL.txt.
+
+package migrationcluster


[nit] Can we use a better file name for this file? At the moment we have this, and cluster.go, the latter of which houses the primary types in this sub package.

Aside from ClusterConfig, everything here is liveness related? Should ClusterConfig move closer to migrationcluster.New instead?

Removed this file. Moved the dependencies to cluster.go and the NodesFromNodeLiveness to nodes.go

irfansharif · 2021-02-09T21:44:01Z

pkg/sql/catalog/systemschema/system.go

@@ -345,6 +345,17 @@ CREATE TABLE system.sqlliveness (
    expiration       DECIMAL NOT NULL,
  	FAMILY fam0_session_id_expiration (session_id, expiration)
 )`
+
+	LongRunningMigrationsTableSchema = `
+CREATE TABLE system.long_running_migrations (


Probably not worth doing now, but if we're imagining that this subsystem will eventually grow to envelop pkg/sqlmigrations, maybe a better table name is in order? internal_migrations?

Again, probably not worth introducing new names for, but curious what you think. I think it would also tie in better with the fact that we've named the top-level package as simply "migrations". I don't think we've used "long-running" anywhere in code before this PR.

Done, I've removed "long running" in all of its forms. Everything is now just migrations.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @miretskiy, and @tbg)

pkg/migration/migrationcluster/cluster.go, line 74 at r8 (raw file):

Previously, ajwerner wrote…

I didn't write this code, @irfansharif did. I'm not inclined to change it (Jordan Lewis's law)

The reasoning was to try and "roll-out" the operations run across every-node in the cluster, out of an abundance of caution. With a limit as high as 25(nodes in parallel), perhaps it's too high in practice. Can I cite Jordan's law for constants I introduced myself to avoid doing any work to change them?

No tests were harmed in this commit. Some abstractions which felt like cruft were thinned. Release note: None

This commit introduces a job to run long-running migration. This empowers long-running migrations with leases, pausability, and cancelability. Fixes cockroachdb#58183 Release note: None

This commit adds a table, `system.migrations`, to store the completion status of long-running migrations. This table is used to ensure safety and isolation in the face of concurrent attempts to upgrade the cluster version. An important facet of this change is the migration to add this table to the cluster during upgrades. Given that the implementation of long-running migrations relies on the table's existence, all versions which are associated with long-running migrations must follow the version which introduces this table. This migration needs to utilize the new infrastructure rather than the `sqlmigrations` infrastructure because of some really bad behavior introduced in 20.2 (cockroachdb#59850). Given two such versions already existed, they have been assigned new version keys and their old version keys and values are now merely placeholders. This provides comptability with the earlier alphas. Release note: None

ajwerner · 2021-02-10T05:56:11Z

Created #60307 for a fast follow-on. Also fixed an issue with pausing and added a test.

ajwerner · 2021-02-10T06:41:56Z

TFTR!

bors r+

craig · 2021-02-10T07:14:16Z

Build succeeded:

GitHub CI (Cockroach)

ajwerner requested a review from irfansharif February 3, 2021 15:53

ajwerner force-pushed the ajwerner/migration-job branch from b46f25b to 1d9813e Compare February 3, 2021 15:58

irfansharif requested a review from tbg February 3, 2021 16:02

tbg reviewed Feb 3, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch from 1d9813e to ed701ca Compare February 4, 2021 05:01

ajwerner commented Feb 4, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch 2 times, most recently from 673bea4 to eeaff75 Compare February 4, 2021 21:03

ajwerner marked this pull request as ready for review February 4, 2021 21:04

ajwerner requested a review from a team as a code owner February 4, 2021 21:04

ajwerner force-pushed the ajwerner/migration-job branch from eeaff75 to 2f9dc4e Compare February 5, 2021 06:29

ajwerner commented Feb 5, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch 2 times, most recently from f6e1a34 to af69c1c Compare February 5, 2021 17:07

ajwerner requested review from a team and pbardea and removed request for a team February 5, 2021 17:07

ajwerner commented Feb 5, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch 2 times, most recently from 2018612 to 6c31b48 Compare February 7, 2021 06:26

tbg approved these changes Feb 8, 2021

View reviewed changes

pbardea removed their request for review February 8, 2021 15:06

ajwerner force-pushed the ajwerner/migration-job branch 2 times, most recently from 87a16ba to d747903 Compare February 9, 2021 02:07

ajwerner requested a review from a team as a code owner February 9, 2021 02:07

ajwerner force-pushed the ajwerner/migration-job branch 3 times, most recently from 167d0bc to 4e2916c Compare February 9, 2021 05:20

ajwerner force-pushed the ajwerner/migration-job branch from 4e2916c to 68c4e42 Compare February 9, 2021 06:46

ajwerner commented Feb 9, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch from 68c4e42 to cdd04f8 Compare February 9, 2021 07:28

ajwerner mentioned this pull request Feb 9, 2021

sql: creating new system tables with sqlmigrations is not possible #59850

Closed

miretskiy requested a review from tbg February 9, 2021 17:18

miretskiy approved these changes Feb 9, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch from cdd04f8 to 737e0c4 Compare February 9, 2021 21:20

ajwerner commented Feb 9, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch 2 times, most recently from ddd4f45 to c7c5d1a Compare February 9, 2021 21:45

irfansharif approved these changes Feb 9, 2021

View reviewed changes

irfansharif reviewed Feb 9, 2021

View reviewed changes

ajwerner force-pushed the ajwerner/migration-job branch from c7c5d1a to a0c5198 Compare February 10, 2021 04:05

migration: rework migration subsystem into multiple packages

126ea38

No tests were harmed in this commit. Some abstractions which felt like cruft were thinned. Release note: None

ajwerner force-pushed the ajwerner/migration-job branch 2 times, most recently from 88f2157 to d3f860e Compare February 10, 2021 05:07

ajwerner mentioned this pull request Feb 10, 2021

migration,backupccl: deal with full cluster backup/restore and migration jobs #60307

Closed

jobs,migration: introduce long running migration job

d710268

This commit introduces a job to run long-running migration. This empowers long-running migrations with leases, pausability, and cancelability. Fixes cockroachdb#58183 Release note: None

ajwerner force-pushed the ajwerner/migration-job branch from d3f860e to 38d606d Compare February 10, 2021 05:33

ajwerner force-pushed the ajwerner/migration-job branch from 38d606d to c17795f Compare February 10, 2021 05:35

craig bot merged commit 3a3565a into cockroachdb:master Feb 10, 2021

irfansharif mentioned this pull request Feb 10, 2021

tracing: use manual collection in ForkCtxSpan #59815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migration,jobs: refactor long-running migrations and hook up job #59760

migration,jobs: refactor long-running migrations and hook up job #59760

ajwerner commented Feb 3, 2021 •

edited

cockroach-teamcity commented Feb 3, 2021

tbg left a comment

ajwerner left a comment

ajwerner left a comment

ajwerner left a comment

ajwerner left a comment

tbg left a comment

ajwerner left a comment

miretskiy left a comment

ajwerner left a comment

irfansharif left a comment

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif Feb 9, 2021

ajwerner Feb 10, 2021

irfansharif left a comment •

edited

ajwerner commented Feb 10, 2021

ajwerner commented Feb 10, 2021

craig bot commented Feb 10, 2021

migration,jobs: refactor long-running migrations and hook up job #59760

migration,jobs: refactor long-running migrations and hook up job #59760

Conversation

ajwerner commented Feb 3, 2021 • edited

cockroach-teamcity commented Feb 3, 2021

tbg left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irfansharif left a comment • edited

Choose a reason for hiding this comment

ajwerner commented Feb 10, 2021

ajwerner commented Feb 10, 2021

craig bot commented Feb 10, 2021

ajwerner commented Feb 3, 2021 •

edited

irfansharif left a comment •

edited