Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize hosts writes per instance #2753

Merged
merged 21 commits into from
Nov 8, 2021
Merged

Serialize hosts writes per instance #2753

merged 21 commits into from
Nov 8, 2021

Conversation

chiiph
Copy link
Contributor

@chiiph chiiph commented Nov 1, 2021

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added (for user-visible changes)
  • Documented any API changes (docs/01-Using-Fleet/03-REST-API.md)
  • Documented any permissions changes
  • Added/updated tests
  • Manual QA for all new/changed functionality

@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 11:22 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 11:22 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 13:38 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 13:38 Inactive
Copy link
Member

@mna mna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting on the general approach, not the details as this is still a wip, my main concern would be that the goroutine can't process fast enough and requests start piling up and waiting for the context to expire (do we set a timeout on request handlers? i.e. would it ever expire?), consuming resources on the fleet server during that time. Saving a host stores a lot of data, especially when there are packs/software/additional data/users involved.

There's also a lot of similarity with the async processing work - this is basically addressing the same issues but with a different solution, in the async mode we aim to process requests very fast by storing temporarily to redis and asynchronously processing the save to mysql in a controlled manner, here we queue the requests until they get a chance to save to mysql (or time out waiting).

That being said, while I think the MarkHostsSeen processing would fit well in the same async approach as labels and policies (small payload), I don't think the same can be said of SaveHosts, it would be quite a lot to store to redis even temporarily, our load testing already shows that it takes a relatively big redis cluster to handle labels (not so much due to size of the payload in memory, but due to high workload of the redis process, which would be impacted even more with the size of the payload of the save hosts processing).

So I think this PR's approach would be worth a shot, but keeping an eye on the impact of the idle waiting of the requests. But it seems like more and more, we see the need for a real queue to decouple requests from processing (which gocloud makes relatively easy to support in an abstract way so that you don't force a specific provider, with implementations for aws sqs, GCP, Azure, etc.: https://pkg.go.dev/gocloud.dev/pubsub).

@chiiph
Copy link
Contributor Author

chiiph commented Nov 1, 2021

my main concern would be that the goroutine can't process fast enough and requests start piling up and waiting for the context to expire

Yeah, this is my main concern as well (plus testing would be pretty annoying, unless we make it configurable). The first approach didn't have the async save, so I added it with a counter to see how things look. If it works well, then the question would be how to make it solid enough that it would pile up. We can probably make it all configurable at least.

@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 16:16 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 16:16 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 19:38 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 19:38 Inactive
@chiiph
Copy link
Contributor Author

chiiph commented Nov 1, 2021

I cleaned it up a bit and made this approach configurable. For the time being, I haven't added any tests for this approach. I will if tomorrow's call results in us going for this.

server/datastore/mysql/hosts.go Outdated Show resolved Hide resolved
server/datastore/mysql/mysql.go Show resolved Hide resolved
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 19:59 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 1, 2021 19:59 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 2, 2021 13:07 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 2, 2021 13:08 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 2, 2021 17:17 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 2, 2021 17:18 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 4, 2021 19:21 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 11:29 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 11:29 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 11:31 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 11:31 Inactive
errCh: errCh,
item: host,
}:
return <-errCh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucasmrod @mna after a convo with Lucas, he helped me see that I could remove the select from here, and defer close errCh above. Just mentioning explicitly to bring more eyes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last note on my end, given latest state of the code, we don't need a buffered errCh channel anymore (the creator will always read it after writing to writeCh, but!... maybe it makes sense in the scenario where there's high load to not block the serializing goroutine...). So am ok with leaving as-is :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that works, it didn't work in the previous implementation as the context could be cancelled and cause the function to return between the write to writeCh and the read from errCh. Now, it is guaranteed not to return before the other goroutine writes to errCh, though note that it means the context can only unblock the call if it expires before the write to writeCh. After that, it can block indefinitely waiting for the result from errCh. This should be fine though, as the item's ctx is also used in the other goroutine to stop the database actions early if necessary.

}
defer rows.Close()

for rows.Next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<Just a nit, as this PR has been up for too long now :)>

Small improvement to not mix row scanning + deletion in the same for loop:

  1. Make a helper function that calls the SELECT and returns a slice of ids (the rows.Err() + rows.Close() would get call when the function returns).
  2. Then on the for loop here you iterate such returned slice of ids and call DELETE on them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning behind the scan+delete in a loop is to not store potentially a silly amount of ids in memory, though.

@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 12:48 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 12:48 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 13:57 Inactive
@chiiph chiiph temporarily deployed to Docker Hub November 5, 2021 13:57 Inactive
@chiiph chiiph requested review from lucasmrod and mna November 5, 2021 14:14
Copy link
Member

@mna mna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a bit concerned about stalling requests waiting on write to writeCh, but load tests have shown that it works quite nicely and fast enough so 👍 !

errCh: errCh,
item: host,
}:
return <-errCh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that works, it didn't work in the previous implementation as the context could be cancelled and cause the function to return between the write to writeCh and the read from errCh. Now, it is guaranteed not to return before the other goroutine writes to errCh, though note that it means the context can only unblock the call if it expires before the write to writeCh. After that, it can block indefinitely waiting for the result from errCh. This should be fine though, as the item's ctx is also used in the other goroutine to stop the database actions early if necessary.

@chiiph chiiph merged commit 7db6de7 into main Nov 8, 2021
@chiiph chiiph deleted the serialize-host-writes branch November 8, 2021 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants