Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

gravel: replace etcd with kvstore backed by ceph itself #631

Merged
merged 19 commits into from Sep 14, 2021

Conversation

tserong
Copy link
Member

@tserong tserong commented Aug 17, 2021

This is a bit rough. There's all sorts of sharp corners you can cut your fingers on. Also, I haven't actually removed the etcd bootstrap bits yet, and I haven't touched any of the tests. Still, it does actually work. Well... Almost. I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not. Still, feedback on the general approach would be greatly appreciated :-)

Signed-off-by: Tim Serong tserong@suse.com

Checklist

  • References issues, create if required
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins test tumbleweed
  • jenkins run tumbleweed
  • jenkins test leap
  • jenkins run leap

@tserong tserong added WIP Work in progress - Do not merge gravel Related to the Aquarium Backend labels Aug 17, 2021
@tserong
Copy link
Member Author

tserong commented Aug 17, 2021

I just tried adding a second node, and it's failing with "assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for some reason. I'm not sure yet if that's due to this change or not

Having apparently thought about this further while I was asleep, I woke up this morning and realised this'll be because the second node can't talk to the cluster before it's joined (no ceph.conf & admin keyring), so of course won't be able to retrieve the ntp address from the kvstore. I guess I need to have the join return copies of these so the joining node can talk to the cluster immediately...

@tserong
Copy link
Member Author

tserong commented Aug 18, 2021

Correction: ceph.conf and the admin keyring are already there at that point (the join already does return those things). The problem is just that the cluster connection thread is inevitably in the middle of a 10 second sleep when those things land, so isn't connected when we try to get the ntp address.

@tserong
Copy link
Member Author

tserong commented Aug 18, 2021

OK, 35eefa5 fixes the join case, but there's still something a bit janky about it that I haven't yet been able to nail down (see the comments in that commit)

@tserong
Copy link
Member Author

tserong commented Aug 19, 2021

Please ignore the previous comment about jankiness. Turns out what I was seeing was two calls to ensure_connect(), one from GlobalState.init_store() followed by one from NodeDeployment.join(). That's fine.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 19, 2021
@tserong
Copy link
Member Author

tserong commented Aug 19, 2021

This is now in a reasonable state for manual testing if anyone's keen. It works fine for cluster deployment and node join. I still expect it to potentially behave badly and/or lock up if the cluster goes dead later though.

@tserong tserong added the enhancement Iteration over existing code or feature label Aug 19, 2021
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tserong tserong added needs-rebase and removed documentation Improvements or additions to documentation needs-rebase labels Aug 20, 2021
@tserong
Copy link
Member Author

tserong commented Aug 20, 2021

Rebased, also made black and mypy happy. Still need to fix the tests.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 24, 2021
@tserong
Copy link
Member Author

tserong commented Aug 24, 2021

jenkins run tumbleweed

@tserong
Copy link
Member Author

tserong commented Aug 24, 2021

Rock 'n' roll.. I don't have unit tests for the new KV class yet, but at least I'm no longer breaking all the other gravel unit tests!

src/gravel/controllers/kv.py Outdated Show resolved Hide resolved
Copy link
Member

@jecluis jecluis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I'm a bit worried about the caching semantics. Should it be feasible, I think we should not be updating the cache with values that we write, directly, but instead rely on watches to update cached values. I wonder whether that's appropriate as well should a value be deleted, because we need to clear up that value from our cache as well. Then again, that might be just a matter of always trying to read directly from omap, and only fallback to the cache if a cluster connection is not possible.

images/aquarium/config.sh Show resolved Hide resolved
src/gravel/controllers/kv.py Outdated Show resolved Hide resolved
src/gravel/controllers/kv.py Show resolved Hide resolved
if not var_lib_aquarium.exists():
var_lib_aquarium.mkdir(0o700)
# this will fail with "_gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/aquarium/kvstore'
# if someone else has it open ... need KV to be a singleton?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the KV class should be a single, global instance that is opened upon start/init and closed on shutdown. Concurrent writes will have to be gated somehow. I'm sure there has to be locks in python somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the KV class should be a single, global instance that is opened upon start/init and closed on shutdown.

It kinda is, right now, implicitly, because it's a member of GlobalState, and not of anything else. Still, it'd be better to make that explicit.

Concurrent writes will have to be gated somehow. I'm sure there has to be locks in python somewhere.

threading.Lock() is probably what I want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually threading.Lock() for locking between different threads, and asyncio.Lock() for locking between asyncio functions in the same thread.

src/gravel/controllers/kv.py Outdated Show resolved Hide resolved
src/gravel/controllers/kv.py Show resolved Hide resolved
src/gravel/controllers/kv.py Outdated Show resolved Hide resolved
# TODO: deal with this (log it? ignore it?)
# e.g. RADOS rados state (You cannot perform that operation on a Rados object in state configuring.)
# this makes the log pretty messy prior to bootstrap (errors every 10 seconds)
logger.error(str(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe handle exceptions on a per-operation basis and log those that make sense? Propagate when the operation should actually fail?

await self._client.put(key, value)
# Try to write to the kvstore in our pool, and also
# write to our local cache (whether or not the write
# to the pool succeeds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only write to the cache if the write succeeds, and ideally from the watches. If we're not watching for a specific key, I think it's okay to not cache it. This way we ensure that we always have a consistent state with the cluster, even if it gets out of sync when we die.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me, except for the bootstrap case where we have to write values before we have a cluster at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps those need to be tracked in a separate cache so not to confuse ourselves with the state of the cluster?

But I do see the below comments on getting back into sync being very difficult.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@jecluis
Copy link
Member

jecluis commented Sep 13, 2021

@tserong happy to merge once the conflict is fixed. :)

This is a bit rough.  There's all sorts of sharp corners you can cut your
fingers on.  Also, I haven't actually removed the etcd bootstrap bits yet,
and I haven't touched any of the tests.  Still, it does actually work.
Well...  Almost.  I just tried adding a second node, and it's failing with
"assert ntp_addr" in gravel/controllers/nodes/deployment.py's join() for
some reason.  I'm not sure yet if that's due to this change or not.
Still, feedback on the general approach would be greatly appreciated :-)

Signed-off-by: Tim Serong <tserong@suse.com>
This allows us to kick the connection thread to try to actually get
a connection to the cluster when joining new nodes, so that we can
pull needed values from the kvstore (e.g. ntp_addr).  Note that if
you watch the logs, on the joining node, you'll first see:

-- kv -- ensure_connection: no cluster exists yet after 5 tries

Then, a second or so later, you'll see:

-- kv -- ensure_connection: cluster state 'connected' after 1 tries

The first invocation is coming from the call to ensure_connection()
from GlobalState.init_store().  At that point in time, there's no
cluster yet, so that's fine.  The second invocation is from
NodeDeployment.join(), once we can actually access the cluster.

Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Without this, KV's cluster connect thread hangs around forever if you
try to stop the aquarium service (well, it hangs around for a minute
and a half until systemd gets sick of asking nicely and just kills the
process - still, better to do this cleanly).

Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
This is not technically necessary (the cluster connection thread will
have well and truly found the cluster and connected to it by now).
But I thought it was neater to make this explicit in the bootstrap
case.  Also, if we ever change the sleep in the connection thread to
something signfificantly longer than 10 seconds, this might actually
become necessary.

Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
We need to make sure python3-rados is mocked out, and the connection thread
doesn't keep running when testing.

Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
This handles the extremely unlikely (impossible?) case where there
is somehow a race or bug such that two aquarium instances try to
create the same aquarium pool and same kvstore object in the same
cluster at the same time.

Signed-off-by: Tim Serong <tserong@suse.com>
This commit also removes the related but now unused
GlobalState.init_store() method (other cases where this
functionality is needed are already directly calling
KV.ensure_connection()).

Signed-off-by: Tim Serong <tserong@suse.com>
Previously, we'd see "-- kv -- [errno 2] RADOS object not found
(error calling conf_read_file)" in the logs every 10 seconds
prior to cluster creation.  Now we'll only see it once.

Signed-off-by: Tim Serong <tserong@suse.com>
Signed-off-by: Tim Serong <tserong@suse.com>
The KV store creates a pool during cluster bringup, which (because
threads) inevitably happens before the default ruleset is set in
the cluster config, so we need ensure we apply that ruleset to any
existing pools as well.

Signed-off-by: Tim Serong <tserong@suse.com>
@tserong
Copy link
Member Author

tserong commented Sep 14, 2021

@jecluis ah, dammit :-)

Done!

Project Aquarium automation moved this from Review in progress to Reviewer approved Sep 14, 2021
@jecluis jecluis merged commit df7f432 into aquarist-labs:main Sep 14, 2021
Project Aquarium automation moved this from Reviewer approved to Done Sep 14, 2021
@tserong tserong deleted the wip-replace-etcd branch September 14, 2021 10:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation enhancement Iteration over existing code or feature gravel Related to the Aquarium Backend
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants