gravel: replace etcd with kvstore backed by ceph itself #631

tserong · 2021-08-17T13:11:25Z

This is a bit rough. There's all sorts of sharp corners you can cut your fingers on. Also, I haven't actually removed the etcd bootstrap bits yet, and I haven't touched any of the tests. Still, it does actually work. Well... Almost. I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not. Still, feedback on the general approach would be greatly appreciated :-)

Signed-off-by: Tim Serong tserong@suse.com

Checklist

References issues, create if required
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins test tumbleweed
jenkins run tumbleweed
jenkins test leap
jenkins run leap

tserong · 2021-08-17T21:58:46Z

I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not

Having apparently thought about this further while I was asleep, I woke up this morning and realised this'll be because the second node can't talk to the cluster before it's joined (no ceph.conf & admin keyring), so of course won't be able to retrieve the ntp address from the kvstore. I guess I need to have the join return copies of these so the joining node can talk to the cluster immediately...

tserong · 2021-08-18T05:26:53Z

Correction: ceph.conf and the admin keyring are already there at that point (the join already does return those things). The problem is just that the cluster connection thread is inevitably in the middle of a 10 second sleep when those things land, so isn't connected when we try to get the ntp address.

tserong · 2021-08-18T11:13:24Z

OK, 35eefa5 fixes the join case, but there's still something a bit janky about it that I haven't yet been able to nail down (see the comments in that commit)

tserong · 2021-08-19T05:41:01Z

Please ignore the previous comment about jankiness. Turns out what I was seeing was two calls to ensure_connect(), one from GlobalState.init_store() followed by one from NodeDeployment.join(). That's fine.

tserong · 2021-08-19T07:37:01Z

This is now in a reasonable state for manual testing if anyone's keen. It works fine for cluster deployment and node join. I still expect it to potentially behave badly and/or lock up if the cluster goes dead later though.

github-actions · 2021-08-20T01:28:28Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

tserong · 2021-08-20T09:22:44Z

Rebased, also made black and mypy happy. Still need to fix the tests.

tserong · 2021-08-24T11:29:37Z

jenkins run tumbleweed

tserong · 2021-08-24T11:31:12Z

Rock 'n' roll.. I don't have unit tests for the new KV class yet, but at least I'm no longer breaking all the other gravel unit tests!

src/gravel/controllers/kv.py

jecluis

Overall looks good, but I'm a bit worried about the caching semantics. Should it be feasible, I think we should not be updating the cache with values that we write, directly, but instead rely on watches to update cached values. I wonder whether that's appropriate as well should a value be deleted, because we need to clear up that value from our cache as well. Then again, that might be just a matter of always trying to read directly from omap, and only fallback to the cache if a cluster connection is not possible.

images/aquarium/config.sh

src/gravel/controllers/kv.py

jecluis · 2021-08-25T08:42:13Z

src/gravel/controllers/kv.py

+        if not var_lib_aquarium.exists():
+            var_lib_aquarium.mkdir(0o700)
+        # this will fail with "_gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/aquarium/kvstore'
+        # if someone else has it open ... need KV to be a singleton?


I think the KV class should be a single, global instance that is opened upon start/init and closed on shutdown. Concurrent writes will have to be gated somehow. I'm sure there has to be locks in python somewhere.

I think the KV class should be a single, global instance that is opened upon start/init and closed on shutdown.

It kinda is, right now, implicitly, because it's a member of GlobalState, and not of anything else. Still, it'd be better to make that explicit.

Concurrent writes will have to be gated somehow. I'm sure there has to be locks in python somewhere.

threading.Lock() is probably what I want.

Actually threading.Lock() for locking between different threads, and asyncio.Lock() for locking between asyncio functions in the same thread.

src/gravel/controllers/kv.py

jecluis · 2021-08-25T09:34:03Z

src/gravel/controllers/kv.py

+                # TODO: deal with this (log it? ignore it?)
+                # e.g. RADOS rados state (You cannot perform that operation on a Rados object in state configuring.)
+                # this makes the log pretty messy prior to bootstrap (errors every 10 seconds)
+                logger.error(str(e))


Maybe handle exceptions on a per-operation basis and log those that make sense? Propagate when the operation should actually fail?

jecluis · 2021-08-25T09:40:34Z

src/gravel/controllers/kv.py

-        await self._client.put(key, value)
+        # Try to write to the kvstore in our pool, and also
+        # write to our local cache (whether or not the write
+        # to the pool succeeds)


I think we should only write to the cache if the write succeeds, and ideally from the watches. If we're not watching for a specific key, I think it's okay to not cache it. This way we ensure that we always have a consistent state with the cluster, even if it gets out of sync when we die.

Sounds reasonable to me, except for the bootstrap case where we have to write values before we have a cluster at all.

Perhaps those need to be tracked in a separate cache so not to confuse ourselves with the state of the cluster?

But I do see the below comments on getting back into sync being very difficult.

github-actions · 2021-09-13T15:33:26Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

jecluis · 2021-09-13T16:31:13Z

@tserong happy to merge once the conflict is fixed. :)

This is a bit rough. There's all sorts of sharp corners you can cut your fingers on. Also, I haven't actually removed the etcd bootstrap bits yet, and I haven't touched any of the tests. Still, it does actually work. Well... Almost. I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not. Still, feedback on the general approach would be greatly appreciated :-) Signed-off-by: Tim Serong <tserong@suse.com>

This allows us to kick the connection thread to try to actually get a connection to the cluster when joining new nodes, so that we can pull needed values from the kvstore (e.g. ntp_addr). Note that if you watch the logs, on the joining node, you'll first see: -- kv -- ensure_connection: no cluster exists yet after 5 tries Then, a second or so later, you'll see: -- kv -- ensure_connection: cluster state 'connected' after 1 tries The first invocation is coming from the call to ensure_connection() from GlobalState.init_store(). At that point in time, there's no cluster yet, so that's fine. The second invocation is from NodeDeployment.join(), once we can actually access the cluster. Signed-off-by: Tim Serong <tserong@suse.com>