provisionerd sends failed or complete last #2732

spikecurtis · 2022-06-29T19:13:21Z

Substantially narrows the race window on #2603 but doesn't close it.

This PR refactors provisionerd to include a job runner that is designed to ensure that we get at most 1 FailedJob or CompletedJob message per job, and that we cease any updates for jobs once this is sent.

It uses unexported interfaces to distinguish between methods that can be called by things like the main provisionerd goroutines (things like canceling the current job), and methods internal to the runner. Similarly, we abstract the provisionerd server from the runner so that we ensure that sending messages about jobs happens with consistent code.

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis · 2022-06-29T19:14:14Z

agent/reaper/reaper_stub.go

@@ -7,6 +7,6 @@ func IsInitProcess() bool {
 	return false
 }

-func ForkReap(opt ...Option) error {


This trips up the linter on my box, so I fixed. Missed by CI because CI linting happens on linux.

Emyrk · 2022-06-29T19:20:34Z

coderd/provisionerjobs.go

@@ -130,6 +130,7 @@ func (api *API) provisionerJobLogs(rw http.ResponseWriter, r *http.Request, job
 		for _, log := range logs {
 			select {
 			case bufferedLogs <- log:
+				api.Logger.Debug(r.Context(), "subscribe buffered log", slog.F("job_id", job.ID), slog.F("stage", log.Stage))


Is this going to be noisy in our logs?

provisionerd/runner.go

provisionerd/provisionerd_test.go

provisionerd/runner.go

kylecarbs · 2022-06-29T20:09:43Z

provisionerd/runner.go

+type messageSender interface {
+	updateJob(ctx context.Context, in *proto.UpdateJobRequest) (*proto.UpdateJobResponse, error)
+	failJob(ctx context.Context, in *proto.FailedJob) error
+	completeJob(ctx context.Context, in *proto.CompletedJob) error
+}


Could this just use the DRPCProvisionerDaemonServer interface instead? Provisionerd seems to be wrapping those, which is fine, but that's just additional functionality and not required to fulfill the interface. This could reduce misdirection.

That interface is larger, giving access to the Conn and to the ability to acquire a new job. Don't want the runner doing that.

This one seems to essentially be the same, just adds AcquireJob, which I don't think is a big deal to expose anyways

The provisionerd.Server already has an acquireJob() method and it doesn't have this signature, nor should it.

Also, DRPCProvisionerDaemonServer has these as methods visible outside the package and we don't want that.

I'm not sure creating a single interface that's used in a single place improves clarity around that. I feel like a comment explaining why acquireJob shouldn't be passed through would do just as well, because a consumer would just add it if they really wanted to call it anyways.

So, with the new structure of Runner in a separate package, we definitely need an interface here: *Runner can't use the *Server type because it would cause an import loop.

I'm pretty strongly against reusing the DRPCProvisionerDaemonServer interface for this purpose because it requires implementing and exposing the AcquireJob() method.

"Keep interfaces small" is generally considered a best practice.

More importantly in this context, though, is that we don't want the Runner acquiring jobs. That's not the contract: Server acquires the job, Runner updates/fails/completes it.

Moreover, the Server is presently designed with the assumption that there is only one acquired job at a time. If the Server exposes an external method to allow other things to acquire jobs, that assumption could easily be violated. We should not leave guns like that lying around for our future selves to get shot in the foot.

provisionerd/runner.go

mafredri

Some questions and minor stylistic comments, but in general this looks good to me! I'm deferring approval to someone more familiar with the previous logic, though.

Side-note: Being somewhat unfamiliar with provisionerd, it'd have been nice to have this as two PRs, one breaking out the functionality from one file, and another implementing the fix so it's clearer what's new and old code.

mafredri · 2022-06-30T13:14:56Z

provisionerd/runner.go

+// are canceled but complete asynchronously (although they are prevented from further updating the job to the coder
+// server).  The provided context sets how long to keep trying to send the FailJob.
+func (r *runner) fail(ctx context.Context, f *proto.FailedJob) error {
+	f.JobId = r.job.JobId


Should this be protected by mutex? Also set in setFail.

r.job is never changed, and so doesn't need to be protected.

I was referring to f.JobId which is re-assigned.

Ah, I just noticed that f won't actually be re-used between function invocations, so not a problem. Curious why this uses r.job.JobId and the other r.job.GetJobId()?

that's on the passed in proto, and isn't protected by any mutex

this uses r.job.JobId and the other r.job.GetJobId()

that's just me being inconsistent for no reason.

provisionerd/runner.go

mafredri · 2022-06-30T13:36:25Z

provisionerd/runner.go

+		case <-r.forceStopContext.Done():
+			return
+		case <-r.gracefulContext.Done():
+			_ = stream.Send(&sdkproto.Provision_Request{


I'm not sure what's supposed to happen here. Since we're closing the stream on function exit (above), we may do stream.Send on a closed stream here. I'm guessing this should be OK, but a comment explaining it may be good.

@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)

Sorry, I kinda mentally skipped over it because that's existing code, other than the names of the contexts.

But yeah, I think your analysis is right, the stream might be closed at the point this case fires, but sending the cancel is best effort anyway.

provisionerd/runner.go

Signed-off-by: Spike Curtis <spike@coder.com>

mafredri

Did a second pass on this and it looks good! I like the refactor into a separate package, turned out well. 👍🏻

(Approving, although I still think approval from someone more familiar with provisionerd might be good.)

mafredri · 2022-07-01T15:53:20Z

provisionerd/runner.go

+		case <-r.forceStopContext.Done():
+			return
+		case <-r.gracefulContext.Done():
+			_ = stream.Send(&sdkproto.Provision_Request{


@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)

kylecarbs

I'm nervous about the lack of tests on the runner package. What do you think about moving those over from provisionerd_test.go?

kylecarbs · 2022-07-01T16:26:57Z

provisionerd/runner/runner.go

+	CompleteJob(ctx context.Context, in *proto.CompletedJob) error
+}
+
+func NewRunner(


This could be called New to eliminate redundancy in the naming.

mafredri · 2022-07-01T16:41:27Z

I'm nervous about the lack of tests on the runner package. What do you think about moving those over from provisionerd_test.go?

@kylecarbs From what I looked, there's relatively high coverage via existing provisionerd tests, but I agree it might be good to test out some of the functionality. For example, ForceStop() seems to be lacking coverage: https://codecov.io/gh/coder/coder/compare/d9668f7a4ef81ea22fcf27fd188a25e8eae45327...164a3024fe52aa47af23a9605d8b2238a2ad78a6/diff

kylecarbs · 2022-07-01T16:44:19Z

@mafredri I dislike us separating tests from the package at hand. We have a unique case with codersdk, but I think that should stay isolated.

spikecurtis · 2022-07-01T16:49:45Z

We can't just move the tests in provisionerd_test.go over wholesale, because they work by setting up the whole omelette. They'd have to be refactored. Then also, we'd want some tests that actually test the provisionerd logic (connecting, acquiring jobs, etc.).

It's a good idea, and I'm happy to do it, but I'm not convinced it's the best use of my (or anyone's, really) time right now. WDYT @kylecarbs ?

kylecarbs · 2022-07-01T16:51:13Z

Seems fine to me. I use VS Code's test gutters to check what is/isn't covered, and it works pretty well, so I'm sad this won't appear there.

I think we make an issue for it, then this looks good to me!

spikecurtis · 2022-07-01T16:55:41Z

Issue is #2780

provisionerd sends failed or complete last

bcf1d8f

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis requested review from mafredri and a team June 29, 2022 19:13

spikecurtis commented Jun 29, 2022

View reviewed changes

Emyrk reviewed Jun 29, 2022

View reviewed changes

kylecarbs reviewed Jun 29, 2022

View reviewed changes

johnstcn reviewed Jun 30, 2022

View reviewed changes

provisionerd/runner.go Outdated Show resolved Hide resolved

mafredri reviewed Jun 30, 2022

View reviewed changes

spikecurtis added 4 commits June 30, 2022 11:41

Move runner into package

e7a7119

Signed-off-by: Spike Curtis <spike@coder.com>

Remove jobRunner interface

97ed447

Signed-off-by: Spike Curtis <spike@coder.com>

renames and slight reworking from code review

cb07b64

Signed-off-by: Spike Curtis <spike@coder.com>

Reword comment about okToSend

164a302

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis requested review from mafredri and kylecarbs June 30, 2022 19:58

mafredri approved these changes Jul 1, 2022

View reviewed changes

kylecarbs reviewed Jul 1, 2022

View reviewed changes

kylecarbs approved these changes Jul 1, 2022

View reviewed changes

spikecurtis mentioned this pull request Jul 1, 2022

Refactor provisionerd_tests.go to split provisionerd and runner tests #2780

Closed

spikecurtis merged commit 22febc7 into main Jul 1, 2022

spikecurtis deleted the spike/provisionerd_cleanup_complete branch July 1, 2022 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provisionerd sends failed or complete last #2732

provisionerd sends failed or complete last #2732

spikecurtis commented Jun 29, 2022

spikecurtis Jun 29, 2022

Emyrk Jun 29, 2022

spikecurtis Jun 29, 2022

kylecarbs Jun 29, 2022

spikecurtis Jun 29, 2022

kylecarbs Jun 29, 2022

spikecurtis Jun 30, 2022

kylecarbs Jun 30, 2022

spikecurtis Jun 30, 2022

mafredri left a comment

mafredri Jun 30, 2022

spikecurtis Jun 30, 2022

mafredri Jun 30, 2022

mafredri Jun 30, 2022

spikecurtis Jun 30, 2022

spikecurtis Jun 30, 2022

mafredri Jun 30, 2022

mafredri Jul 1, 2022

spikecurtis Jul 1, 2022

mafredri left a comment

mafredri Jul 1, 2022

kylecarbs left a comment

kylecarbs Jul 1, 2022

mafredri commented Jul 1, 2022

kylecarbs commented Jul 1, 2022

spikecurtis commented Jul 1, 2022

kylecarbs commented Jul 1, 2022

spikecurtis commented Jul 1, 2022

provisionerd sends failed or complete last #2732

provisionerd sends failed or complete last #2732

Conversation

spikecurtis commented Jun 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mafredri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mafredri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylecarbs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mafredri commented Jul 1, 2022

kylecarbs commented Jul 1, 2022

spikecurtis commented Jul 1, 2022

kylecarbs commented Jul 1, 2022

spikecurtis commented Jul 1, 2022