Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs,importccl: adopt protected timestamps for IMPORT #44739

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Feb 5, 2020

This PR comes in two commits. See each for details.

  1. The first refactors and augments the jobs.Registry API to enable transactional creation of a job which will be started on the local gateway. I would love to discuss this commit itself a bit.
  2. Adopt protected timestamps and demonstrate their effectiveness with a test.

@ajwerner ajwerner requested review from dt and spaskob February 5, 2020 01:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch 2 times, most recently from 3a084b5 to bb36776 Compare February 5, 2020 05:36
@ajwerner
Copy link
Contributor Author

ajwerner commented Feb 5, 2020

cc @lucy-zhang this is the PR with the jobs change

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch from bb36776 to d965916 Compare February 5, 2020 13:50
Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)


pkg/jobs/registry.go, line 286 at r3 (raw file):

// StartableJob is a job created with a transaction to be started later.
// See Registry.CreateStartableJob
type StartableJob struct {

please move to jobs.go


pkg/jobs/registry.go, line 304 at r3 (raw file):

		}
	}()
	// TODO(ajwerner): Consider adding some protection against starting more than

can you add detail on how that can happen?


pkg/jobs/registry.go, line 320 at r3 (raw file):

// CreateStartableJobWithTxn creates a job to be started later, after the
// creating txn commits. It stores writes job in the jobs table, marks it

... stores writes ... ?


pkg/jobs/registry.go, line 332 at r3 (raw file):

		return nil, err
	}
	j.WithTxn(nil)

please add a comment why we need this


pkg/jobs/registry.go, line 338 at r3 (raw file):

	}
	resumerCtx, cancel := r.makeCtx()
	if err := r.register(*j.ID(), cancel); err != nil {

if the txn is aborted the job will not be unregistered.

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)


pkg/ccl/importccl/import_stmt.go, line 630 at r4 (raw file):

StartableJob

why can't we do that here using using QueueJob from planner.go?

@ajwerner
Copy link
Contributor Author

ajwerner commented Feb 7, 2020

cc @lucy-zhang this has the jobs change we talked about - @spaskob and I discussed offline that you should certainly merge your change that utilized QueueJob and Run as written today and then later that can be refactored. This doesn't touch that API at all, just an FYI and some reassurance that any API changes in jobs will be out of your way.

// Prepare the protected timestamp record.
spans := make([]roachpb.Span, 0, len(tableDetails))
for i := range tableDetails {
spans = append(spans, tableDetails[i].Desc.TableSpan())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to add spans for tables where IsNew is false (which could require revert) and then below we can skip if spans is empty

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch from d965916 to d8b6f38 Compare February 26, 2020 20:31
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/import_stmt.go, line 630 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…
StartableJob

why can't we do that here using using QueueJob from planner.go?

I think we discussed this offline.


pkg/ccl/importccl/import_stmt.go, line 652 at r4 (raw file):

Previously, dt (David Taylor) wrote…

I think we only need to add spans for tables where IsNew is false (which could require revert) and then below we can skip if spans is empty

Done.


pkg/jobs/registry.go, line 286 at r3 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

please move to jobs.go

Done.


pkg/jobs/registry.go, line 304 at r3 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

can you add detail on how that can happen?

If there's a programming error in the client. I added protection.


pkg/jobs/registry.go, line 320 at r3 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

... stores writes ... ?

Fixed.


pkg/jobs/registry.go, line 332 at r3 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

please add a comment why we need this

Done.


pkg/jobs/registry.go, line 338 at r3 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

if the txn is aborted the job will not be unregistered.

Good point! I've extended the interface to deal with that case and added some testing. The contract is now that user transactions which are rolled back will need to be cleaned up using CleanupOnRollback. I also added unit testing.

@ajwerner
Copy link
Contributor Author

erm actually let me fix the test, must have broken something

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch from d8b6f38 to 9adacc9 Compare February 26, 2020 22:57
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, RFAL now

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch 3 times, most recently from dc340d3 to 607ea4a Compare February 27, 2020 05:46
@ajwerner
Copy link
Contributor Author

@spaskob can you give the first commit this a look? I'd like to get this merged ASAP

@spaskob
Copy link
Contributor

spaskob commented Feb 27, 2020 via email

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)


pkg/jobs/jobs.go, line 740 at r5 (raw file):

}

// Start will resume a resumable job. The transaction used to create the

Start will start a job; resume can be confused with RESUME job_id


pkg/jobs/jobs_test.go, line 1994 at r5 (raw file):

}

// TestStartableJobErrors tests that the StartableJob returns the expected

I know you want to get this merged but a good to have is to search for use of CreateAndStartJob in this test file sand see if it may be useful to add similar test or replace with a StartableJob.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)


pkg/jobs/jobs_test.go, line 1994 at r5 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I know you want to get this merged but a good to have is to search for use of CreateAndStartJob in this test file sand see if it may be useful to add similar test or replace with a StartableJob.

I don't know that I follow. CreateAndStartJob is implemented on top of StartableJob. All tests which exercise it exercise the StartableJob code.

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great then, I missed that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch 4 times, most recently from 4f92a1d to 8d32c56 Compare February 28, 2020 20:39
@ajwerner
Copy link
Contributor Author

The remaining test failures are addressed by #45544

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for jobs code

Reviewed 1 of 7 files at r5, 4 of 9 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)


pkg/ccl/importccl/import_processor_test.go, line 611 at r6 (raw file):



pkg/ccl/importccl/import_processor_test.go, line 611 at r6 (raw file):



pkg/ccl/importccl/import_processor_test.go, line 612 at r6 (raw file):

			resumer := raw.(*importResumer)

@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch from 8d32c56 to dd9f637 Compare February 29, 2020 09:25
StartableJob isn't great but it's better than what we've got now.
A StartableJob is a job which is created transactionally but also hooked in
to a `resultCh` and can be started on the local gateway after the transaction
commits.

Currently we've got two main ways that jobs are kicked off "synchronously":

 1) Registry.Run()

This doesn't currently do anything to kick off the jobs - it just waits for
all of the relevant jobs to be started. This is used in conjunction with
`planner.QueueJob` and `CreateJobWithTxn`. The idea here is that when a
transaction creates a job we want, upon commit, for it to wait for the
jobs it created to complete.

 2) Registry.CreateAndStartJob()

This method has special powers that prevents the gateway node from adopting
the job and also writes the job with a lease for the local node. It also allows
the caller to pass a resultsCh which will recieve datums from the job.

That resultCh is critical for IMPORT and BACKUP (?). Currently in both of those
cases we create and run the job on a transaction outside of the caller's txn.

This is particularly frustrating because the jobs package assigns IDs at
creation time, making it difficult to transactionally associate a job with
another resource while also retaining the ability to ensure that it starts
locally and is connected to a resultsCh. This commit addresses that.

The fact that `IMPORT` and `BACKUP` commit the job before committing the
planner's txn feels dirty. I'm not planning on mucking with that stuff here in
this PR but ultimately the work here seems to lay a possible path forward. In
particular, instead of calling `Run()` with a slice of IDs we could call it
with a slice of `StartableJob`s which may be hooked up to the channel. In that
way we could unify CreateAndStartJob() and Run() with something better.

Release note: None.
Perhaps this only needed to be applied to IMPORT INTO but it doesn't hurt
import. This change ensures that IMPORT statements protect the spans into
which they'll be writing from being GC'd.

The change adopts the StartableJob change from the previous commit,
protects a timestamp, and then makes sure to clean up that timestamp
on success or failure.

Before this change the test added here would fail with the following error:

```
require.go:1104:
                Error Trace:    import_into_test.go:152
                Error:          Expect "pq: job 526896149330362369: cannot be reverted, manual cleanup may be required: rolling back partially completed IMPORT: cannot revert before replica GC threshold 1580866345.938283735,0" to match "error response from server: 500 Internal Server Error"
                Test:           TestProtectedTimestampsDuringImportInto
```

Release note (bug fix): IMPORT INTO jobs which are canceled or fail
can no longer get stuck in an unrecoverable state if data from the
previous state of the table had expired relative to the GC TTL.
The logic to detect whether a type is clonable cannot traverse into all of the
implementations of itnerfaces. This is problematic when one of the structs
underneath a oneof is unclonable. While a more general solution might be nice,
this workaround allows the addition of a UUID field to a job details struct.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/adopt-protected-timestamps-in-import-into branch from dd9f637 to bfa371b Compare February 29, 2020 18:14
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @spaskob)


pkg/jobs/jobs.go, line 740 at r5 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Start will start a job; resume can be confused with RESUME job_id

Done.

@craig
Copy link
Contributor

craig bot commented Feb 29, 2020

Build succeeded

@craig craig bot merged commit 0dc604b into cockroachdb:master Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants