-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize hosts writes per instance #2753
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting on the general approach, not the details as this is still a wip, my main concern would be that the goroutine can't process fast enough and requests start piling up and waiting for the context to expire (do we set a timeout on request handlers? i.e. would it ever expire?), consuming resources on the fleet server during that time. Saving a host stores a lot of data, especially when there are packs/software/additional data/users involved.
There's also a lot of similarity with the async processing work - this is basically addressing the same issues but with a different solution, in the async mode we aim to process requests very fast by storing temporarily to redis and asynchronously processing the save to mysql in a controlled manner, here we queue the requests until they get a chance to save to mysql (or time out waiting).
That being said, while I think the MarkHostsSeen
processing would fit well in the same async approach as labels and policies (small payload), I don't think the same can be said of SaveHosts
, it would be quite a lot to store to redis even temporarily, our load testing already shows that it takes a relatively big redis cluster to handle labels (not so much due to size of the payload in memory, but due to high workload of the redis process, which would be impacted even more with the size of the payload of the save hosts processing).
So I think this PR's approach would be worth a shot, but keeping an eye on the impact of the idle waiting of the requests. But it seems like more and more, we see the need for a real queue to decouple requests from processing (which gocloud makes relatively easy to support in an abstract way so that you don't force a specific provider, with implementations for aws sqs, GCP, Azure, etc.: https://pkg.go.dev/gocloud.dev/pubsub).
Yeah, this is my main concern as well (plus testing would be pretty annoying, unless we make it configurable). The first approach didn't have the async save, so I added it with a counter to see how things look. If it works well, then the question would be how to make it solid enough that it would pile up. We can probably make it all configurable at least. |
I cleaned it up a bit and made this approach configurable. For the time being, I haven't added any tests for this approach. I will if tomorrow's call results in us going for this. |
server/datastore/mysql/migrations/tables/20211102135149_AddHostSeenTimesTable.go
Show resolved
Hide resolved
errCh: errCh, | ||
item: host, | ||
}: | ||
return <-errCh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucasmrod @mna after a convo with Lucas, he helped me see that I could remove the select from here, and defer close errCh above. Just mentioning explicitly to bring more eyes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last note on my end, given latest state of the code, we don't need a buffered errCh
channel anymore (the creator will always read it after writing to writeCh
, but!... maybe it makes sense in the scenario where there's high load to not block the serializing goroutine...). So am ok with leaving as-is :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that works, it didn't work in the previous implementation as the context could be cancelled and cause the function to return between the write to writeCh and the read from errCh. Now, it is guaranteed not to return before the other goroutine writes to errCh
, though note that it means the context can only unblock the call if it expires before the write to writeCh
. After that, it can block indefinitely waiting for the result from errCh
. This should be fine though, as the item's ctx is also used in the other goroutine to stop the database actions early if necessary.
} | ||
defer rows.Close() | ||
|
||
for rows.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<Just a nit, as this PR has been up for too long now :)>
Small improvement to not mix row scanning + deletion in the same for loop:
- Make a helper function that calls the
SELECT
and returns a slice of ids (therows.Err() + rows.Close()
would get call when the function returns). - Then on the for loop here you iterate such returned slice of
id
s and callDELETE
on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning behind the scan+delete in a loop is to not store potentially a silly amount of ids in memory, though.
server/datastore/mysql/migrations/tables/20211102135149_AddHostSeenTimesTable.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a bit concerned about stalling requests waiting on write to writeCh
, but load tests have shown that it works quite nicely and fast enough so 👍 !
errCh: errCh, | ||
item: host, | ||
}: | ||
return <-errCh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that works, it didn't work in the previous implementation as the context could be cancelled and cause the function to return between the write to writeCh and the read from errCh. Now, it is guaranteed not to return before the other goroutine writes to errCh
, though note that it means the context can only unblock the call if it expires before the write to writeCh
. After that, it can block indefinitely waiting for the result from errCh
. This should be fine though, as the item's ctx is also used in the other goroutine to stop the database actions early if necessary.
Checklist for submitter
If some of the following don't apply, delete the relevant line.
Documented any API changes (docs/01-Using-Fleet/03-REST-API.md)Documented any permissions changes