gravel: add etcd #403

jecluis · 2021-04-10T01:58:37Z

We need something to share state across multiple nodes, consistently, and, ideally, without a single point of failure. This patchset proposes etcd as that thing.

Why etcd?

Because it exists, is simple enough to setup, and does not require Ceph to work.

Why not Ceph's monitors kvstore?

We could use the monitors' config-key/kv store, but a few things strike me as good enough reasons not to:

That was never what it was supposed to exist for, and everyone keeps abusing it. Ceph needs something like etcd to keep its stuff, or to implement a proper kvstore may it be in the monitors or elsewhere (the manager's would be preferable tbh).
We lose the store if the cluster dies, or doesn't start, or a monitor fills up to the brim and causes a quorum loss. If that happens, Aquarium needs to still be available to provide info to the user, and to tell the user which services have gone away, stuff like that.
The ceph-mon kvstore lacks a few things that are pretty cool, like watching key updates, and transactions.

Few things of note

we add a new dependency, on aetcd3. This has been the only asyncio etcd3 library I found that seems recent enough, with decent testing, and actually building on pypi, even though it doesn't seem to be particularly active.
aetcd3 is far from perfect, and has this one thing that bites us on quit, where resources are not properly freed and an exception is raised. I've opened a PR (client: ensure channels are closed on __del__() martyanov/aetcd#4) against it but I have very little hope to get it merged. This is why we're relying, for the purposes of this patchset, on a custom-compiled library sitting in one of my servers, somewhere.
Unless there's suddenly more activity on that lib, we may have to fork it and carry it ourselves.

Resolves #32

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

github-actions · 2021-04-12T03:38:34Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

tserong · 2021-04-12T06:29:26Z

src/gravel/controllers/nodes/messages.py

@@ -41,6 +41,7 @@ class WelcomeMessageModel(BaseModel):
    pubkey: str
    cephconf: str
    keyring: str
+    peer_url: str


Maybe I'm just nitpicking, but can we give this a slightly different name? Because it's peer_url, it makes me think it's a URL, but it's actually a string in the form "name=URL". Maybe call it, ummm... etcd_peer? Just something to make it slightly more obvious it's weird.

(oops, that comment should have been part of the review below)

yep, sounds good.

tserong

Looks like it should work :-) Missing a cache, for the case where etcd is dead, but you already noted that we'll need a cache in one of the commit messages, so that's fine. I have no particular reason not to approve this, but a few questions inline.

tserong · 2021-04-12T06:39:31Z

images/microOS/config.sh

@@ -174,7 +174,8 @@ EOF
 fi

 if [[ "$kiwi_profiles" == *"Ceph"* ]]; then
-  pip install fastapi uvicorn aetcd3
+  pip install fastapi uvicorn \
+    https://celeuma.wipwd.dev/aqr/aetcd3/aetcd3-0.1.0a5.dev2+g0aa852e.d20210410-py3-none-any.whl


Is this referring to the fix for martyanov/aetcd#4, which was since closed? In that case, do we still need this?

I'll update the initial comment, but no. Even though that PR was closed, there's a couple of additional patches on top of the original repo, one of them to fix an aiofiles's dependency version -- original repo requires <0.6, while uvicorn requires >= 0.6, and it's annoying to have to deal with those dependencies when pip installs. Another one is a missing await when shutting down the lib IIRC.

tserong · 2021-04-12T06:39:44Z

src/requirements.txt

@@ -7,4 +7,4 @@ starlette==0.13.6
 uvicorn==0.13.3
 pip
 websockets==8.1
-aetcd3
+https://celeuma.wipwd.dev/aqr/aetcd3/aetcd3-0.1.0a5.dev2+g0aa852e.d20210410-py3-none-any.whl


Same comment as in pip install line

tserong · 2021-04-12T06:52:51Z

src/gravel/controllers/nodes/mgr.py

-        logger.info(f"started etcd process pid = {process.pid}")
+        t = threading.Thread(target=_bootstrap_process)
+        t.start()
+        logger.info("started etcd thread")


"Starting it as a process, for some reason, allows etcd to squash the event loop's signal handlers, which breaks our "on shutdown" cleanup." Huh? As a separate process it shouldn't be able to do anything to the parent, should it? And if it can mess up the parent process, why wouldn't it also mess up the process when run as a thread? (I kinda feel like etcd should really be a separate process, just because it's its own daemon, rather than being some class or library that we import...)

From the docs, that seems to be the daemon's instance name. initial-cluster seems much like mon_initial_cluster from Ceph. Even if that might set the cluster's name to the name of the first node, maybe there's a way to do that without having to name the first node something that does not match the hostname? (I'll look into that)

wtf - this last comment should have been a reply to something else entirely. github seems to be on fire today.

Anyhoo, to @tserong actual comment:

"Starting it as a process, for some reason, allows etcd to squash the event loop's signal handlers, which breaks our "on shutdown" cleanup." Huh? As a separate process it shouldn't be able to do anything to the parent, should it? And if it can mess up the parent process, why wouldn't it also mess up the process when run as a thread? (I kinda feel like etcd should really be a separate process, just because it's its own daemon, rather than being some class or library that we import...)

Yeah, I know. This is weird, really. I assumed the same thing, but for some reason, when starting an asyncio subprocess from a multithreading.Process, the signal handler is squashed; when starting the asyncio subprocess from a threading.Thread, it does not. I'll try getting a script to reproduce the behavior just to ensure I'm not imagining things though.

Okay, I'm struggling to reproduce this with a script because it works fine with a Process. I truly can't tell wtf was going on. I'll try reverting it and see what happens, I guess.

tserong · 2021-04-12T06:57:25Z

src/gravel/controllers/kv.py

+    async def ensure_connection(self) -> None:
+        """ Open k/v store connection """
+        # try getting the status, loop until we make it.
+        opened = False


What happens if we never make it? The KV store is unavailable forever?

yeah. This might warrant a timeout or something. But yeah, that's it.

It's probably fine for now. It'll also presumably benefit once we have a cache later (at least we'll be able to read cached K/Vs while we're waiting for etcd)

jecluis

For some reason github did not allow me to reply to @tserong without reviewing my own PR? 🤷

jecluis · 2021-04-12T07:14:59Z

images/microOS/config.sh

@@ -174,7 +174,8 @@ EOF
 fi

 if [[ "$kiwi_profiles" == *"Ceph"* ]]; then
-  pip install fastapi uvicorn aetcd3
+  pip install fastapi uvicorn \
+    https://celeuma.wipwd.dev/aqr/aetcd3/aetcd3-0.1.0a5.dev2+g0aa852e.d20210410-py3-none-any.whl


I'll update the initial comment, but no. Even though that PR was closed, there's a couple of additional patches on top of the original repo, one of them to fix an aiofiles's dependency version -- original repo requires <0.6, while uvicorn requires >= 0.6, and it's annoying to have to deal with those dependencies when pip installs. Another one is a missing await when shutting down the lib IIRC.

jecluis · 2021-04-12T07:16:10Z

src/gravel/controllers/nodes/mgr.py

-        logger.info(f"started etcd process pid = {process.pid}")
+        t = threading.Thread(target=_bootstrap_process)
+        t.start()
+        logger.info("started etcd thread")


From the docs, that seems to be the daemon's instance name. initial-cluster seems much like mon_initial_cluster from Ceph. Even if that might set the cluster's name to the name of the first node, maybe there's a way to do that without having to name the first node something that does not match the hostname? (I'll look into that)

jecluis · 2021-04-12T07:17:31Z

src/gravel/controllers/kv.py

+    async def ensure_connection(self) -> None:
+        """ Open k/v store connection """
+        # try getting the status, loop until we make it.
+        opened = False


yeah. This might warrant a timeout or something. But yeah, that's it.

jecluis · 2021-04-12T07:18:00Z

src/gravel/controllers/nodes/messages.py

@@ -41,6 +41,7 @@ class WelcomeMessageModel(BaseModel):
    pubkey: str
    cephconf: str
    keyring: str
+    peer_url: str


yep, sounds good.

jhesketh · 2021-04-12T07:52:40Z

src/gravel/controllers/nodes/mgr.py

-    def _load(self) -> None:
-        self._token = self._load_token()
+        self._token = await self._load_token()
+        await self._kvstore.watch("/nodes/token", _watcher)


Why is a token obtained, and then the key watched for updates? Do we expect the token to change?

I hope so. In this case was meant mostly as a PoC that watching keys works, but thinking ahead, given I hope we'll let the user refresh their token should they want to.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Add new members on join. Relies on python's aetcd3 library for etcd shenanigans. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

So we can have a consistent and up to date ceph.conf across all nodes, let's rely on cephadm to manage it. We don't drop the ceph.conf initially shared with a joining node because we want to ensure that node is able to perform operations on the cluster as soon as join finishes, and we don't want to have to wait for cephadm to write the ceph.conf. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Also adds needed typings for aetcd3.locks module. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

We needed to patch aetcd3 and the fix hasn't been merged upstream yet. So, we created our own package and uploaded it somewhere it can be reached. This is what we are installing now. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Obtain state and watch changes, update as needed. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

We're going to cache things in the near-future, and obtain state from the kvstore, watches and whatnot. We need it to be a full fledged service. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

For those services needing to cleanup state, add a shutdown method to be called when we are shutting down. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

jecluis · 2021-04-12T20:56:19Z

@tserong addressed all comments, I think.

jhesketh

I haven't tested this manually, but I'm in favour of the general approach etc.

introduced by PR aquarist-labs#403 Signed-off-by: Michael Fritch <mfritch@suse.com>

jecluis added feature New feature gravel Related to the Aquarium Backend milestone: required Required for the assigned Milestone labels Apr 10, 2021

jecluis added this to the Milestone 3 milestone Apr 10, 2021

jecluis requested review from tserong and mgfritch April 10, 2021 01:58

jecluis added this to In progress in Project Aquarium via automation Apr 10, 2021

jecluis force-pushed the wip-etcd branch 2 times, most recently from 44c4ba4 to 19b1758 Compare April 10, 2021 21:38

github-actions bot added the needs-rebase label Apr 12, 2021

tserong reviewed Apr 12, 2021

View reviewed changes

jecluis commented Apr 12, 2021

View reviewed changes

jhesketh reviewed Apr 12, 2021

View reviewed changes

jecluis added 15 commits April 12, 2021 11:47

images/microos: add etcd, python aetcd3

1347206

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: spin-off token generation

aa2a0b1

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: spawn etcd on start / bootstrap

547a116

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: add etcd support on multi-node deployments

85a0d34

Add new members on join. Relies on python's aetcd3 library for etcd shenanigans. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: add kvstore controller

a5c7ca6

Also adds needed typings for aetcd3.locks module. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: move node init earlier in the file

61fa349

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: make start async, add shutdown

7fe2d14

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: make as async as possible

fd73020

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: make etcd spawn async

c086a16

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: add kvstore support

21eaea5

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: keep token in kvstore

61ad42b

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

images: config.sh: install etcd from custom location

2fefa66

We needed to patch aetcd3 and the fix hasn't been merged upstream yet. So, we created our own package and uploaded it somewhere it can be reached. This is what we are installing now. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: obtain state on start, watch

ac8d0af

Obtain state and watch changes, update as needed. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: ctrl/svc: make it a ticker

a0ad5cd

We're going to cache things in the near-future, and obtain state from the kvstore, watches and whatnot. We need it to be a full fledged service. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

jecluis added 7 commits April 12, 2021 11:47

gravel: ctrl/svc: gate operations if not ready

bb22107

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: kv: allow returning None on 'get'

61021bb

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: nodes/mgr: expose kvstore to the world

5a973ad

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: ctrl/svc: keep state in kvstore

c737488

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: gstate: cleanup tickers on shutdown

ba50e74

For those services needing to cleanup state, add a shutdown method to be called when we are shutting down. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

gravel: ctrl/svc: cleanup watcher on shutdown

44d0224

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

aquarium: log shutdown event

9cc8749

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

jecluis force-pushed the wip-etcd branch from 19b1758 to 9cc8749 Compare April 12, 2021 20:54

github-actions bot removed the needs-rebase label Apr 12, 2021

tserong approved these changes Apr 13, 2021

View reviewed changes

Project Aquarium automation moved this from In progress to Reviewer approved Apr 13, 2021

tserong mentioned this pull request Apr 13, 2021

gravel: requires Python 3.7+ ? #361

Closed

jhesketh approved these changes Apr 13, 2021

View reviewed changes

jecluis merged commit 649a96b into aquarist-labs:main Apr 13, 2021

Project Aquarium automation moved this from Reviewer approved to Done Apr 13, 2021

jecluis deleted the wip-etcd branch April 13, 2021 07:00

tserong mentioned this pull request Apr 13, 2021

tools: setup-dev.sh: check for python >= 3.8 #405

Merged

3 tasks

mgfritch added a commit to mgfritch/aquarium that referenced this pull request Apr 15, 2021

gravel: add aetcd3 to venv requirements

855ef72

introduced by PR aquarist-labs#403 Signed-off-by: Michael Fritch <mfritch@suse.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gravel: add etcd #403

gravel: add etcd #403

jecluis commented Apr 10, 2021

github-actions bot commented Apr 12, 2021

tserong Apr 12, 2021 •

edited

tserong Apr 12, 2021

jecluis Apr 12, 2021

tserong left a comment

tserong Apr 12, 2021

jecluis Apr 12, 2021

tserong Apr 12, 2021

tserong Apr 12, 2021

jecluis Apr 12, 2021

jecluis Apr 12, 2021

jecluis Apr 12, 2021

tserong Apr 12, 2021

jecluis Apr 12, 2021

tserong Apr 12, 2021

jecluis left a comment

jecluis Apr 12, 2021

jecluis Apr 12, 2021

jecluis Apr 12, 2021

jecluis Apr 12, 2021

jhesketh Apr 12, 2021

jecluis Apr 12, 2021

jecluis commented Apr 12, 2021

jhesketh left a comment

gravel: add etcd #403

gravel: add etcd #403

Conversation

jecluis commented Apr 10, 2021

Why etcd?

Why not Ceph's monitors kvstore?

Few things of note

github-actions bot commented Apr 12, 2021

tserong Apr 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tserong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jecluis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jecluis commented Apr 12, 2021

jhesketh left a comment

Choose a reason for hiding this comment

tserong Apr 12, 2021 •

edited