Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provisionerd sends failed or complete last #2732

Merged
merged 5 commits into from
Jul 1, 2022

Conversation

spikecurtis
Copy link
Contributor

Substantially narrows the race window on #2603 but doesn't close it.

This PR refactors provisionerd to include a job runner that is designed to ensure that we get at most 1 FailedJob or CompletedJob message per job, and that we cease any updates for jobs once this is sent.

It uses unexported interfaces to distinguish between methods that can be called by things like the main provisionerd goroutines (things like canceling the current job), and methods internal to the runner. Similarly, we abstract the provisionerd server from the runner so that we ensure that sending messages about jobs happens with consistent code.

Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis requested review from mafredri and a team June 29, 2022 19:13
@@ -7,6 +7,6 @@ func IsInitProcess() bool {
return false
}

func ForkReap(opt ...Option) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trips up the linter on my box, so I fixed. Missed by CI because CI linting happens on linux.

@@ -130,6 +130,7 @@ func (api *API) provisionerJobLogs(rw http.ResponseWriter, r *http.Request, job
for _, log := range logs {
select {
case bufferedLogs <- log:
api.Logger.Debug(r.Context(), "subscribe buffered log", slog.F("job_id", job.ID), slog.F("stage", log.Stage))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be noisy in our logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Debug!

provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/provisionerd_test.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
Comment on lines 66 to 70
type messageSender interface {
updateJob(ctx context.Context, in *proto.UpdateJobRequest) (*proto.UpdateJobResponse, error)
failJob(ctx context.Context, in *proto.FailedJob) error
completeJob(ctx context.Context, in *proto.CompletedJob) error
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just use the DRPCProvisionerDaemonServer interface instead? Provisionerd seems to be wrapping those, which is fine, but that's just additional functionality and not required to fulfill the interface. This could reduce misdirection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That interface is larger, giving access to the Conn and to the ability to acquire a new job. Don't want the runner doing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
This one seems to essentially be the same, just adds AcquireJob, which I don't think is a big deal to expose anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provisionerd.Server already has an acquireJob() method and it doesn't have this signature, nor should it.

Also, DRPCProvisionerDaemonServer has these as methods visible outside the package and we don't want that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure creating a single interface that's used in a single place improves clarity around that. I feel like a comment explaining why acquireJob shouldn't be passed through would do just as well, because a consumer would just add it if they really wanted to call it anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, with the new structure of Runner in a separate package, we definitely need an interface here: *Runner can't use the *Server type because it would cause an import loop.

I'm pretty strongly against reusing the DRPCProvisionerDaemonServer interface for this purpose because it requires implementing and exposing the AcquireJob() method.

"Keep interfaces small" is generally considered a best practice.

More importantly in this context, though, is that we don't want the Runner acquiring jobs. That's not the contract: Server acquires the job, Runner updates/fails/completes it.

Moreover, the Server is presently designed with the assumption that there is only one acquired job at a time. If the Server exposes an external method to allow other things to acquire jobs, that assumption could easily be violated. We should not leave guns like that lying around for our future selves to get shot in the foot.

provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and minor stylistic comments, but in general this looks good to me! I'm deferring approval to someone more familiar with the previous logic, though.

Side-note: Being somewhat unfamiliar with provisionerd, it'd have been nice to have this as two PRs, one breaking out the functionality from one file, and another implementing the fix so it's clearer what's new and old code.

// are canceled but complete asynchronously (although they are prevented from further updating the job to the coder
// server). The provided context sets how long to keep trying to send the FailJob.
func (r *runner) fail(ctx context.Context, f *proto.FailedJob) error {
f.JobId = r.job.JobId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be protected by mutex? Also set in setFail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.job is never changed, and so doesn't need to be protected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to f.JobId which is re-assigned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just noticed that f won't actually be re-used between function invocations, so not a problem. Curious why this uses r.job.JobId and the other r.job.GetJobId()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's on the passed in proto, and isn't protected by any mutex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses r.job.JobId and the other r.job.GetJobId()

that's just me being inconsistent for no reason.

provisionerd/runner.go Outdated Show resolved Hide resolved
case <-r.forceStopContext.Done():
return
case <-r.gracefulContext.Done():
_ = stream.Send(&sdkproto.Provision_Request{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's supposed to happen here. Since we're closing the stream on function exit (above), we may do stream.Send on a closed stream here. I'm guessing this should be OK, but a comment explaining it may be good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I kinda mentally skipped over it because that's existing code, other than the names of the contexts.

But yeah, I think your analysis is right, the stream might be closed at the point this case fires, but sending the cancel is best effort anyway.

provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
provisionerd/runner.go Outdated Show resolved Hide resolved
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a second pass on this and it looks good! I like the refactor into a separate package, turned out well. 👍🏻

(Approving, although I still think approval from someone more familiar with provisionerd might be good.)

case <-r.forceStopContext.Done():
return
case <-r.gracefulContext.Done():
_ = stream.Send(&sdkproto.Provision_Request{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)

Copy link
Member

@kylecarbs kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about the lack of tests on the runner package. What do you think about moving those over from provisionerd_test.go?

CompleteJob(ctx context.Context, in *proto.CompletedJob) error
}

func NewRunner(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be called New to eliminate redundancy in the naming.

@mafredri
Copy link
Member

mafredri commented Jul 1, 2022

I'm nervous about the lack of tests on the runner package. What do you think about moving those over from provisionerd_test.go?

@kylecarbs From what I looked, there's relatively high coverage via existing provisionerd tests, but I agree it might be good to test out some of the functionality. For example, ForceStop() seems to be lacking coverage: https://codecov.io/gh/coder/coder/compare/d9668f7a4ef81ea22fcf27fd188a25e8eae45327...164a3024fe52aa47af23a9605d8b2238a2ad78a6/diff

@kylecarbs
Copy link
Member

@mafredri I dislike us separating tests from the package at hand. We have a unique case with codersdk, but I think that should stay isolated.

@spikecurtis
Copy link
Contributor Author

We can't just move the tests in provisionerd_test.go over wholesale, because they work by setting up the whole omelette. They'd have to be refactored. Then also, we'd want some tests that actually test the provisionerd logic (connecting, acquiring jobs, etc.).

It's a good idea, and I'm happy to do it, but I'm not convinced it's the best use of my (or anyone's, really) time right now. WDYT @kylecarbs ?

@kylecarbs
Copy link
Member

Seems fine to me. I use VS Code's test gutters to check what is/isn't covered, and it works pretty well, so I'm sad this won't appear there.

I think we make an issue for it, then this looks good to me!

@spikecurtis
Copy link
Contributor Author

Issue is #2780

@spikecurtis spikecurtis merged commit 22febc7 into main Jul 1, 2022
@spikecurtis spikecurtis deleted the spike/provisionerd_cleanup_complete branch July 1, 2022 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants