gravel: replace etcd with kvstore backed by ceph itself #631
Conversation
b684e7a
to
94eee47
Compare
Having apparently thought about this further while I was asleep, I woke up this morning and realised this'll be because the second node can't talk to the cluster before it's joined (no ceph.conf & admin keyring), so of course won't be able to retrieve the ntp address from the kvstore. I guess I need to have the join return copies of these so the joining node can talk to the cluster immediately... |
Correction: ceph.conf and the admin keyring are already there at that point (the join already does return those things). The problem is just that the cluster connection thread is inevitably in the middle of a 10 second sleep when those things land, so isn't connected when we try to get the ntp address. |
OK, 35eefa5 fixes the join case, but there's still something a bit janky about it that I haven't yet been able to nail down (see the comments in that commit) |
35eefa5
to
5f209e9
Compare
Please ignore the previous comment about jankiness. Turns out what I was seeing was two calls to |
This is now in a reasonable state for manual testing if anyone's keen. It works fine for cluster deployment and node join. I still expect it to potentially behave badly and/or lock up if the cluster goes dead later though. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
9ca1645
to
53f4aa3
Compare
Rebased, also made black and mypy happy. Still need to fix the tests. |
53f4aa3
to
d6e1339
Compare
jenkins run tumbleweed |
Rock 'n' roll.. I don't have unit tests for the new KV class yet, but at least I'm no longer breaking all the other gravel unit tests! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but I'm a bit worried about the caching semantics. Should it be feasible, I think we should not be updating the cache with values that we write, directly, but instead rely on watches to update cached values. I wonder whether that's appropriate as well should a value be deleted, because we need to clear up that value from our cache as well. Then again, that might be just a matter of always trying to read directly from omap, and only fallback to the cache if a cluster connection is not possible.
src/gravel/controllers/kv.py
Outdated
if not var_lib_aquarium.exists(): | ||
var_lib_aquarium.mkdir(0o700) | ||
# this will fail with "_gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/aquarium/kvstore' | ||
# if someone else has it open ... need KV to be a singleton? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the KV class should be a single, global instance that is opened upon start/init and closed on shutdown. Concurrent writes will have to be gated somehow. I'm sure there has to be locks in python somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the KV class should be a single, global instance that is opened upon start/init and closed on shutdown.
It kinda is, right now, implicitly, because it's a member of GlobalState, and not of anything else. Still, it'd be better to make that explicit.
Concurrent writes will have to be gated somehow. I'm sure there has to be locks in python somewhere.
threading.Lock() is probably what I want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually threading.Lock() for locking between different threads, and asyncio.Lock() for locking between asyncio functions in the same thread.
src/gravel/controllers/kv.py
Outdated
# TODO: deal with this (log it? ignore it?) | ||
# e.g. RADOS rados state (You cannot perform that operation on a Rados object in state configuring.) | ||
# this makes the log pretty messy prior to bootstrap (errors every 10 seconds) | ||
logger.error(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe handle exceptions on a per-operation basis and log those that make sense? Propagate when the operation should actually fail?
await self._client.put(key, value) | ||
# Try to write to the kvstore in our pool, and also | ||
# write to our local cache (whether or not the write | ||
# to the pool succeeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only write to the cache if the write succeeds, and ideally from the watches. If we're not watching for a specific key, I think it's okay to not cache it. This way we ensure that we always have a consistent state with the cluster, even if it gets out of sync when we die.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me, except for the bootstrap case where we have to write values before we have a cluster at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps those need to be tracked in a separate cache so not to confuse ourselves with the state of the cluster?
But I do see the below comments on getting back into sync being very difficult.
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
@tserong happy to merge once the conflict is fixed. :) |
This is a bit rough. There's all sorts of sharp corners you can cut your fingers on. Also, I haven't actually removed the etcd bootstrap bits yet, and I haven't touched any of the tests. Still, it does actually work. Well... Almost. I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not. Still, feedback on the general approach would be greatly appreciated :-) Signed-off-by: Tim Serong <tserong@suse.com>
This allows us to kick the connection thread to try to actually get a connection to the cluster when joining new nodes, so that we can pull needed values from the kvstore (e.g. ntp_addr). Note that if you watch the logs, on the joining node, you'll first see: -- kv -- ensure_connection: no cluster exists yet after 5 tries Then, a second or so later, you'll see: -- kv -- ensure_connection: cluster state 'connected' after 1 tries The first invocation is coming from the call to ensure_connection() from GlobalState.init_store(). At that point in time, there's no cluster yet, so that's fine. The second invocation is from NodeDeployment.join(), once we can actually access the cluster. Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Without this, KV's cluster connect thread hangs around forever if you try to stop the aquarium service (well, it hangs around for a minute and a half until systemd gets sick of asking nicely and just kills the process - still, better to do this cleanly). Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
This is not technically necessary (the cluster connection thread will have well and truly found the cluster and connected to it by now). But I thought it was neater to make this explicit in the bootstrap case. Also, if we ever change the sleep in the connection thread to something signfificantly longer than 10 seconds, this might actually become necessary. Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
We need to make sure python3-rados is mocked out, and the connection thread doesn't keep running when testing. Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
This handles the extremely unlikely (impossible?) case where there is somehow a race or bug such that two aquarium instances try to create the same aquarium pool and same kvstore object in the same cluster at the same time. Signed-off-by: Tim Serong <tserong@suse.com>
This commit also removes the related but now unused GlobalState.init_store() method (other cases where this functionality is needed are already directly calling KV.ensure_connection()). Signed-off-by: Tim Serong <tserong@suse.com>
Previously, we'd see "-- kv -- [errno 2] RADOS object not found (error calling conf_read_file)" in the logs every 10 seconds prior to cluster creation. Now we'll only see it once. Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
The KV store creates a pool during cluster bringup, which (because threads) inevitably happens before the default ruleset is set in the cluster config, so we need ensure we apply that ruleset to any existing pools as well. Signed-off-by: Tim Serong <tserong@suse.com>
bae058b
to
e652542
Compare
@jecluis ah, dammit :-) Done! |
This is a bit rough. There's all sorts of sharp corners you can cut your fingers on. Also, I haven't actually removed the etcd bootstrap bits yet, and I haven't touched any of the tests. Still, it does actually work. Well... Almost. I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not. Still, feedback on the general approach would be greatly appreciated :-)
Signed-off-by: Tim Serong tserong@suse.com
Checklist
Show available Jenkins commands
jenkins test tumbleweed
jenkins run tumbleweed
jenkins test leap
jenkins run leap