Skip to content

Various fixes.#5

Merged
stgraber merged 13 commits into
canonical:mainfrom
masnax:fixes
Jul 27, 2022
Merged

Various fixes.#5
stgraber merged 13 commits into
canonical:mainfrom
masnax:fixes

Conversation

@masnax
Copy link
Copy Markdown
Contributor

@masnax masnax commented Jul 13, 2022

Depends on #4

This adds several bug fixes that I noticed upon more nuanced testing:

  • Most importantly, this moves initiating the heartbeat out of the dqlite dialFunc. Calls to the dqlite client methods (to check the leader and cluster status during a heartbeat) will initiate further calls to the dialFunc, which calls more heartbeats, and on and on. I noticed my CPU spiking up to 600% from getting slammed with heartbeat requests. The corresponding commit here adds a helper that calls the heartbeat function every second, and is now set up to run in a goroutine upon opening the database.

  • Auto-schema-update logic was missing from the heartbeat sequence. Now there is an environment variable: SCHEMA_UPDATE that will be checked for an auto-update executable.

  • Adds a 5s context timeout for calls to the dqlite client since they can be blocking in some cases. I'm unable to replicate this frequently enough (It seems to happen randomly as far as I can tell) to be able to test how exactly this works, but I noticed that very occasionally, attempting to check if we are the dqlite leader froze the daemon. If this happens on the leader then no heartbeats will be sent until the leader is reloaded.

  • Ensures all API responses are fully read and closed so we can reuse the connection.

  • Prevents erroring out during reload if other cluster members are still updating

  • The heartbeat timeout logic was incorrect, and is now correctly sleeping a maximum of half the client context timeout for heartbeats (HeartbeatTimeout), and a minimum of 2 seconds, as the other cluster members will report their heartbeat times a bit ahead of the leader).

  • Adds some debug logs for the heartbeat.

@masnax masnax force-pushed the fixes branch 5 times, most recently from e89f98d to 748254c Compare July 19, 2022 18:49
masnax added 13 commits July 27, 2022 19:39
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
The heartbeat function was previously part of the dqlite dialFunc, but
since the heartbeat uses the dqlite client helpers, which results in
more calls to the dialFunc, this basically creates an infinite loop.
With the heartbeat lock, this doesn't pose any functional issue but
I've noticed my CPU spiking to 600% during heartbeats, which is
certainly not good!

This commit instead adds a loopHeartbeat helper which calls the hearbeat
function once per second. This helper is called once the database is
opened.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
… be sent

If we are in the middle of an update, it's possible some node is
unreachable so don't fail if we can't send that node a notification that
we have updated our schema.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
The tcp connection can't be reused unless the body is fully read and
closed, so this explicitly makes sure of that for each returned
response.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
I've encountered an odd issue where sometimes the dqlite client helpers
will lock up, so this adds a timeout context of 5 seconds (like the
client context).

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
This properly sets the amount of time to sleep between heartbeat
attempts to a minimum of 2 seconds and a maximum of half the client
HeartbeatTimeout.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
… data

cluster.Query is called with concurrent=true, so access to the hbInfo
map can cause a panic if two threads access it at once.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Additionally close the database on stop.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax
Copy link
Copy Markdown
Contributor Author

masnax commented Jul 27, 2022

This is a bit longer than 8 commits now as I've added the shutdown command as well.

@masnax masnax marked this pull request as ready for review July 27, 2022 19:51
@masnax masnax requested a review from stgraber July 27, 2022 19:51
@stgraber stgraber merged commit 85b63d4 into canonical:main Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants